2018-03-19

I spent a couple of weeks working on improving the performance of SZLib. Specifically, I wanted to eliminate two of the major bottlenecks in the system: recording areas to be redrawn; and determining which layers within the hierarchy will draw those areas.

The algorithm for recording damaged areas looks like this:

• Add the rect to the damaged rect manager:
• For each existing damaged rect:
• Remove the intersection from the incoming rect.
• Add the remaining areas of the damaged rect to the damaged rect array.

The algorithm for determining the layer responsible for redrawing a rect looks like this:

• For each rect in the damaged rect manager:
• For each layer in the layer hierarchy:
• Add the layer/rect intersection tuple to an array.
• Remove the intersection from the damaged rect manager.

There are obvious issues in both of these algorithms. In the first we have to compare every new damaged rect with every existing damaged rect. In the second we have to compare every damaged rect with every layer, which gives us a worst-case complexity of O(n^2).

My solution to both problems was to create a simple quad tree. I’d tried a complex quad tree before but it failed miserably. This time I:

• Stored only rectangles, not rectangles and the layers that they represented;
• Created the entire fixed-depth quad tree immediately, rather than trying dynamically to split and join nodes;
• Reset and re-used quad tree instances instead of creating new trees each time I needed one;
• Treated the quad tree as an immutable structure once it had been built by eliminating methods that would remove rectangles.

The time complexity of the two algorithms above is now the worst-case rather than every case. This change, coupled with some other minor optimizations, produced some great results.

I put together a test app that has 80 overlapping Hanky Alien bitmaps moving around the screen with a few nested text boxes. Numbers from my Mac suggest that the test app is around 40% faster now.

Some more numbers:

• Ignoring time spent in SDL and during vertical blanks, the test takes up less than half a second of CPU time for every 10 seconds of runtime on my Mac.
• Rendering time for every 10 seconds of runtime is about 30ms on my Mac.
• The PSP can handle moving 68 aliens at a solid 60fps.
• The DS can handle moving 13 aliens at a solid 60fps.