Performance Problems - Not Where You'd Expect

The new rendering system in Woopsi is coming along nicely. In most areas it increases rendering speed and reduces flicker. In one area, though, it hasn’t been an improvement at all. Scrolling has become a problem.

In the old system, scrolling regions were extremely quick, mainly because most of the scrolling is done by just using the DMA to copy the existing pixels to a new location. In the new system, however, scrolling is sluggish and annoying. It seems to take more than one frame to perform a single scroll operation, which results in the display lagging behind the stylus and the system appearing to be, well, a bit rubbish.

I’ve already made one major optimisation. Instead of asking every gadget to draw every damaged rectangle until there are no rectangles left (which results in dozens of pointless, complex clipping operations), Woopsi uses the hierarchy of the gadgets to its advantage. If a parent gadget does not intersect a damaged rectangle, there is no way for its children to intersect it either. In avoiding trying to draw those children we save a lot of time.

However, this did not solve the scrolling problem. So what could it be? More to the point, how can it have happened, as the scrolling code hasn’t actually changed?

I had a few guesses. Perhaps the scrolling operation was running more than once. Putting break points into the Xcode version of Woopsi quickly killed that idea. Perhaps I need to cache the contents of ScrollingPanels in some sort of off-screen bitmap - maybe it’s all the redrawing that is a problem. But it couldn’t be that - there weren’t any problems before. Perhaps the list of newly-revealed rectangles that need to be redrawn that is returned by the scrolling routine is inaccurate. Testing it overturned that idea.

So I turn to Xcode’s “Instruments”, something that I wish Visual Studio would copy. Running Woopsi through the CPU Sampler suggests that the only thing Woopsi is doing is creating Rect objects. Creating rectangles is making Woopsi’s scrolling rubbish? That makes no sense.

Except, perhaps it does.

All of the new rendering functions create WoopsiArrays of Rect objects. As the inital size of a WoopsiArray is 100, and as they store objects, not pointers to objects, that means that each function call to one of these methods creates hundreds of Rect objects. Worse, these functions are recursive. There are probably tens of thousands of Rect objects being created every time the screen redraws.

I tried switching to arrays of pointers-to-rect objects, but quickly got bogged down in recursive memory allocation hell. Instead, I altered the WoopsiArray class so that it accepts an initial reserved size parameter. In most cases, none of the methods needs more than 4 rects in a list, so I’ve gone with that as the initial size. I’ve also changed the way that the array class resizes itself; it now doubles in size every time it needs to allocate more space instead of increasing by 100 elements. I believe this is the behaviour of the STL’s vector class.

Surprisingly enough, these very minor changes fixed the scrolling problems and boosted the speed of Woopsi everywhere else. I’m not sure I can remember it being this quick.

It’s a very old programmer’s cliche, but it’s true. Performance problems aren’t always where you’d expect.


Lakedaemon on 2010-09-10 at 14:35 said:

Yes, resizing is usually done exponentially because :

if you have to grow something to size m by fixed increments, you’ll get O(m) resize operations which means an O(m^2) copy algorythm

If you do it by doubling the size, you only do O(log(m)) operations and that means a O(m log(m)) copy algorythm

Beware of malloc/free/new/delete too : a call to “new” takes around 1300 clock cycles on the nintendo ds…

I’m considering rewriting the algorithms in xmlbox to remove most of the new/malloc calls. I think it would give drastinc speed improvements, like 100x (xmlbox uses actually 3 000 000 clock cycle to parse css and 8 000 000 to display a xml document… )

Lakedaemon on 2010-09-15 at 21:34 said:

I have been over over-optimist in this last post. a 100x speed up…ahah…how ignorant of me. I have learned better now, after having played a few days with memory allocators and trying to speed up css parsing on the nds.

First : I tried parsing a 2kb .css file loaded in a string in memory with different data structures, and it takes between 350 000 (badly designed data structures/methods) and 200 000 clock cycles (better design/simple/fast structures/methods) to do so…

which means that the 3 000 000 clock cycles figure I gave in the previous post was plain wrong.

Second : I played a bit with memory allocators and there are good speed gains to have for fixed size little structures or with “scratch” memory (you write once lots of data on it and later on, you delete all data by freeing the whole memory slab). Yet malloc/new does a real good job at allocating variable length memory chunk. So this is a trade off between flexibility and speed (but you can have the best of both worlds). The speed gains that you can have are like 3x faster for fixed size data chunks. (and not 1000x) new/malloc calls are much faster than 1300 cpu cycles (for new, it really depends if the constructor calls malloc a few times more)

Lakedaemon on 2010-09-15 at 21:38 said:

oh, and you were definetely right : performance bottlenecks are sometimes not where expected (funny things, cache lines…).