Sure is a lot of posts today. With all the Java performance myths out of the way, let's take a look at the threading system of WSW.

TL;DR of this entire post: Insomnia achieves approximately 3.5x scaling on a quad core. It fails to reach 4.0x because the graphics driver's thread is slow and competing with the game's threads. Future versions of OpenGL (pretty much the same as DirectX but cross-platform) should allow us to reach 4.0x scaling.



So what's a thread? To simplify this a lot, threads are essentially what allows your (single-core) processor to run multiple programs at the same time. If a processor has 4 threads to run, it'll switch between them extremely fast, so from the programmer's perspective, it looks like all 4 threads are running at the same time, but 1/4th as fast. However, it's also useful to have a single program using multiple threads. This can allow the program to do heavy calculations in the background while keeping the user interface responsive.

At some point, hardware developers realized that increasing the clock speed of CPUs was starting to become unsustainable. CPUs were getting too hot and used too much power. What they realized was that it was much cheaper to drop the clock rate a little bit and instead have more cores in them. Doubling the clock rate essentially increases power usage (and therefore heat) by a factor of 8. This means that at the same power consumption, you can get

a single core processor at 1.00 GHz with 1.0x total performance.
a dual core processor at 0.80 GHz with 1.6x total performance.
a quad core processor at 0.63 GHz, with 2,52x total performance.
an octa core processor at 0.5 GHz with 4x total performance.

These 4 processors all use the same amount of power, but efficiency increases massively as more cores are added.

So now, all of a sudden our quad core processors can take those 4 threads our original single-core processor had and actually run all 4 of them at the same time at full speed. It doesn't matter that each core is slower; running 4 of them at more than half speed is still 2.5x faster. This is the theory behind it of course. It's worth noting that in practice, the CPU cores actually share a lot of their resources (particularly the RAM and memory controller), so you're not gonna see a perfect 4x performance boost from utilizing all 4 cores. At best, you might see a 3.5-3.9x increase in performance.



The problem today is that games aren't good at using the resources they have available. Having more cores doesn't mean anything unless you have threads to run on them. Even today, many years after the introduction of multi-core CPUs, most games still don't utilize more than 1 or 2 cores (*cough* Planetside 2 *cough*), but some games do show that it's doable (the recent Battlefield games for example). Insomnia's not going to lose when it comes to threading.

Insomnia's thread system is based on splitting up the game's rendering and logic code into distinct tasks. These tasks are organized similar to a flow chart, with certain tasks requiring other tasks to be completed before they're executed. These tasks are then put into a queue, and a number of threads can be created that runs  these tasks one by one from the queue.

Insomnia directly or indirectly uses a large number of threads.

 - N generic worker threads, where N is the number of cores the CPU has.
 - 1 main thread, which is the only thread allowed to communicate with the graphics driver.
 - The graphics driver has its own internal thread which is beyond Insomnia's control. Insomnia's main thread offloads work to this thread so that the main thread can work on AI and other stuff.

For the graphics and physics code, almost everything can be run on any number of cores. The only tasks that cannot be run on multiple threads are the tasks that require communication with the graphics card. Almost all of these are just small high-fives with the driver to ensure that everything's still correct, but some are pretty large. This is where the graphics driver's thread comes in and splits the work with the main thread automatically. It took a lot of work to avoid stepping on the driver's thread's toes, but I've managed to let the driver thread work completely undisturbed. It's not perfect (as will be evident later), but I'm not sure it's possible to improve this with the current version of OpenGL.

Here's a rather large flow-chart-like visualization of the tasks that the rendering code is split up into. Tasks marked with red are tasks that require communication with the graphics thread, so they must be run on the main thread.

How much does this improve performance though? If I run this on a quad core, do I see 4 times higher FPS? Almost.

Here are some of the results I get on my Intel i7-4770K quad core CPU:

 - The rendering code achieves 3.64x scaling.
 - The physics code achieves a 3.19x scaling.
 - The actual increase in frame rate is only 2.82x (which is still a 182% increase).

I blame this on the driver's internal thread, which competes for CPU time with Insomnia's threads. This is evident by the fact that the engine spends around 1/3rd of its time waiting for the server thread to finish its work. The next generation of OpenGL (pretty much the same as DirectX but cross-platform) should remove the restriction of the red tasks and also remove the internal driver thread, which would allow us to improve this scaling even further, but until then, this is about as good as it gets.

Holy shit, someone read all the way down here. Uh, not sure what to say... Hi, mum?

Hello again, everyone! Double post this week! I thought I'd rant a bit about our choice of Java.



As some of you know, we're using Java to develop WSW and Insomnia. No, we're not using Unreal Engine, but thanks for the compliment. ^^ Now, a lot of people are skeptical to our choice of programming language. Java doesn't exactly have a flawless reputation when it comes to performance (and security, although that's only applies to the Java browser plugin, which is not required in any way for Insomnia), but I thought I'd kill the two most common misconceptions about Java here.


a + b is equally fast in Java and C++.

Any basic arithmetic operation is equally fast in Java and C++. The Java Virtual Machine (JVM) compiles those instructions to exactly the same assembly code as C++ is compiled to in the end, although Java requires a few seconds after starting up for all the code to be compiled for optimal performance when the game is first started. There are some special instructions that can be used from C++ that can improve performance in some math intensive areas (for example matrix math). In our case, we actually take advantage of some of these by using math libraries that have native C++ code for the most performance heavy places like skeleton animation, so again, our performance with Java is in the 90+% of C++ here.


Java's garbage collection is not a problem.

Many games written in Java have problem with performance and stuttering due to the Java garbage collector, which automatically frees memory that is no longer in use. An automatic collection pass can suddenly trigger and interfere with the game's smoothness. There are three reasons why this is not a problem.
First, garbage collection only happens if you're actually generating garbage. It's not hard to make a completely garbage free game loop that allocates all its resources once and then reuses them indefinitely, and this is what we're aiming for.
Secondly, the garbage collection passes are fast and run in parallel with the game mostly, so the actual time that the game is paused for a collection is in the range of a few milliseconds, which the CPU easily handles without dropping a single frame in almost all cases. The stuttering we get from garbage collection is 1/10th as frequent and intensive as the stuttering we get from deep within the graphics driver, far out of any game developers control.
Thirdly, that allocating and freeing memory is slower in Java than C++ is a myth in the first place. The fact that the memory management is completely left to the JVM is actually an advantage as it can avoid fragmenting the heap, which is a common problem for C++ programs that degrades performance over time. Another massive advantage of garbage collection is that it's a lot easier to work with for us developers, so we can spend more time on new features and optimizing our algorithms instead of figuring where that memory leak that causes the game to crash and burn after 30 minutes of playing is.



So where is Java actually slower then? The biggest loss of performance in Java compared to C++ comes from the memory layout. In C++ you can use a number of techniques to force memory locality so that memory that is often used together lies in a continuous place in RAM. This makes the program more cache-friendly, as the CPU always loads in memory in relatively large blocks so it'd "accidentally" load in and cache all the required information when the first piece of memory is accessed. In Java, we have no way of forcing this as the placement in memory is left to the JVM, and the JVM may even reorder things later (again, this has other advantages). If you're aware of all this, it's not that difficult to minimize the impact of this. In addition, many Intel CPUs have hardware that pretty much eliminates this difference, which I'll go into detail about in my next post.


Labels