Sure is a lot of posts today. With all the Java performance myths out of the way, let's take a look at the threading system of WSW.

TL;DR of this entire post: Insomnia achieves approximately 3.5x scaling on a quad core. It fails to reach 4.0x because the graphics driver's thread is slow and competing with the game's threads. Future versions of OpenGL (pretty much the same as DirectX but cross-platform) should allow us to reach 4.0x scaling.

So what's a thread? To simplify this a lot, threads are essentially what allows your (single-core) processor to run multiple programs at the same time. If a processor has 4 threads to run, it'll switch between them extremely fast, so from the programmer's perspective, it looks like all 4 threads are running at the same time, but 1/4th as fast. However, it's also useful to have a single program using multiple threads. This can allow the program to do heavy calculations in the background while keeping the user interface responsive.

At some point, hardware developers realized that increasing the clock speed of CPUs was starting to become unsustainable. CPUs were getting too hot and used too much power. What they realized was that it was much cheaper to drop the clock rate a little bit and instead have more cores in them. Doubling the clock rate essentially increases power usage (and therefore heat) by a factor of 8. This means that at the same power consumption, you can get

a single core processor at 1.00 GHz with 1.0x total performance.
a dual core processor at 0.80 GHz with 1.6x total performance.
a quad core processor at 0.63 GHz, with 2,52x total performance.
an octa core processor at 0.5 GHz with 4x total performance.

These 4 processors all use the same amount of power, but efficiency increases massively as more cores are added.

So now, all of a sudden our quad core processors can take those 4 threads our original single-core processor had and actually run all 4 of them at the same time at full speed. It doesn't matter that each core is slower; running 4 of them at more than half speed is still 2.5x faster. This is the theory behind it of course. It's worth noting that in practice, the CPU cores actually share a lot of their resources (particularly the RAM and memory controller), so you're not gonna see a perfect 4x performance boost from utilizing all 4 cores. At best, you might see a 3.5-3.9x increase in performance.

The problem today is that games aren't good at using the resources they have available. Having more cores doesn't mean anything unless you have threads to run on them. Even today, many years after the introduction of multi-core CPUs, most games still don't utilize more than 1 or 2 cores (*cough* Planetside 2 *cough*), but some games do show that it's doable (the recent Battlefield games for example). Insomnia's not going to lose when it comes to threading.

Insomnia's thread system is based on splitting up the game's rendering and logic code into distinct tasks. These tasks are organized similar to a flow chart, with certain tasks requiring other tasks to be completed before they're executed. These tasks are then put into a queue, and a number of threads can be created that runs  these tasks one by one from the queue.

Insomnia directly or indirectly uses a large number of threads.

 - N generic worker threads, where N is the number of cores the CPU has.
 - 1 main thread, which is the only thread allowed to communicate with the graphics driver.
 - The graphics driver has its own internal thread which is beyond Insomnia's control. Insomnia's main thread offloads work to this thread so that the main thread can work on AI and other stuff.

For the graphics and physics code, almost everything can be run on any number of cores. The only tasks that cannot be run on multiple threads are the tasks that require communication with the graphics card. Almost all of these are just small high-fives with the driver to ensure that everything's still correct, but some are pretty large. This is where the graphics driver's thread comes in and splits the work with the main thread automatically. It took a lot of work to avoid stepping on the driver's thread's toes, but I've managed to let the driver thread work completely undisturbed. It's not perfect (as will be evident later), but I'm not sure it's possible to improve this with the current version of OpenGL.

Here's a rather large flow-chart-like visualization of the tasks that the rendering code is split up into. Tasks marked with red are tasks that require communication with the graphics thread, so they must be run on the main thread.

How much does this improve performance though? If I run this on a quad core, do I see 4 times higher FPS? Almost.

Here are some of the results I get on my Intel i7-4770K quad core CPU:

 - The rendering code achieves 3.64x scaling.
 - The physics code achieves a 3.19x scaling.
 - The actual increase in frame rate is only 2.82x (which is still a 182% increase).

I blame this on the driver's internal thread, which competes for CPU time with Insomnia's threads. This is evident by the fact that the engine spends around 1/3rd of its time waiting for the server thread to finish its work. The next generation of OpenGL (pretty much the same as DirectX but cross-platform) should remove the restriction of the red tasks and also remove the internal driver thread, which would allow us to improve this scaling even further, but until then, this is about as good as it gets.

Holy shit, someone read all the way down here. Uh, not sure what to say... Hi, mum?