performance – Nuclex Games Blog

When you’re writing some code that needs to notify code in othe r parts of the program, your weapon of choice is the "signal / slot concept". A signal is a connection point where any interested party can register a callback function to be invoked when the signal emits/fires.

There’s already an ocean of libraries out there providing this functionality to C++, but as you will see in this article, they’re all suffering from performance issues in one way or other. Plus, most don’t compile without warnings, have inconvenient sytax or lack unit tests.

So here’s the signal/slot "library" (it’s just three headers) I wrote to fix those issues for me, together with a summary of my design goals and a comprehensive benchmark on different compilers and CPUs.

Sometimes, games have to render highly dynamic geometry such as sparks, bullet trails, muzzle flashes and lightning arcs. Sometimes it’s possible to off-load the work of simulating these things to the GPU, but there are effects than can’t be done by the GPU alone.

These cases usually occur when effects require heavy interaction with level geometry or when they require lots of conditionals to mutate a persistent effect state. And sometimes, the effort of simulating an effect on the GPU is just not worth the results. If you have maybe a hundred instances of a bullet trail at once, letting the GPU orient the constrained billboards for the trails instead of generating the vertices on the CPU might just not yield any tangible benefits.

However, there are still a lot of traps you can run into. A typical mistake of the unknowing developer is to render the primitives one-by-one either using one Draw[Indexed]UserPrimitives() call per spark/trail/arc.

This is not a good idea because modern GPUs are optimized for rendering large numbers of polygons at once. When you call Draw[Indexed]UserPrimitives(), XNA will call into Direct3D, which will cause a call into driver (which means a call from code running in user mode to code running in kernel mode, which is especially slow). Then the vertices are added to the GPU’s processing queue.

By sending single primitives to the GPU, this per-call overhead is multiplied and can become a real bottleneck. To demonstrate the effects of making lots of small drawing calls, I wrote a small benchmark and measured the results on my GeForce 8800 GTS 512 and on whatever the XBox 360 uses as its GPU.

Chart showing a drastic performance bottleneck for excessive DrawPrimitive() calls

Tag: performance

Nuclex Signal/Slot Library: Benchmarks

Efficiently Rendering Dynamic Vertices