Performance Problems - Not Where You'd Expect

The new rendering system in Woopsi is coming along nicely. In most areas it increases rendering speed and reduces flicker. In one area, though, it hasn’t been an improvement at all. Scrolling has become a problem.

In the old system, scrolling regions were extremely quick, mainly because most of the scrolling is done by just using the DMA to copy the existing pixels to a new location. In the new system, however, scrolling is sluggish and annoying. It seems to take more than one frame to perform a single scroll operation, which results in the display lagging behind the stylus and the system appearing to be, well, a bit rubbish.

I’ve already made one major optimisation. Instead of asking every gadget to draw every damaged rectangle until there are no rectangles left (which results in dozens of pointless, complex clipping operations), Woopsi uses the hierarchy of the gadgets to its advantage. If a parent gadget does not intersect a damaged rectangle, there is no way for its children to intersect it either. In avoiding trying to draw those children we save a lot of time.

However, this did not solve the scrolling problem. So what could it be? More to the point, how can it have happened, as the scrolling code hasn’t actually changed?

I had a few guesses. Perhaps the scrolling operation was running more than once. Putting break points into the Xcode version of Woopsi quickly killed that idea. Perhaps I need to cache the contents of ScrollingPanels in some sort of off-screen bitmap - maybe it’s all the redrawing that is a problem. But it couldn’t be that - there weren’t any problems before. Perhaps the list of newly-revealed rectangles that need to be redrawn that is returned by the scrolling routine is inaccurate. Testing it overturned that idea.

So I turn to Xcode’s “Instruments”, something that I wish Visual Studio would copy. Running Woopsi through the CPU Sampler suggests that the only thing Woopsi is doing is creating Rect objects. Creating rectangles is making Woopsi’s scrolling rubbish? That makes no sense.

Except, perhaps it does.

All of the new rendering functions create WoopsiArrays of Rect objects. As the inital size of a WoopsiArray is 100, and as they store objects, not pointers to objects, that means that each function call to one of these methods creates hundreds of Rect objects. Worse, these functions are recursive. There are probably tens of thousands of Rect objects being created every time the screen redraws.

I tried switching to arrays of pointers-to-rect objects, but quickly got bogged down in recursive memory allocation hell. Instead, I altered the WoopsiArray class so that it accepts an initial reserved size parameter. In most cases, none of the methods needs more than 4 rects in a list, so I’ve gone with that as the initial size. I’ve also changed the way that the array class resizes itself; it now doubles in size every time it needs to allocate more space instead of increasing by 100 elements. I believe this is the behaviour of the STL’s vector class.

Surprisingly enough, these very minor changes fixed the scrolling problems and boosted the speed of Woopsi everywhere else. I’m not sure I can remember it being this quick.

It’s a very old programmer’s cliche, but it’s true. Performance problems aren’t always where you’d expect.


A Better UTF-8 String

The WoopsiString Class - Optimised for ASCII

The current WoopsiString class stores its data in the usual way: an array of chars. When the string is initially set, the object creates an array large enough to hold the data it should store. It then copies the data into that new array.

When appending to the string, the object checks to see if it contains enough free space within the existing array to contain the new data. If so, the data is just copied into the free space. If not, a new array is allocated large enough to contain both the existing data and the new data. The existing data is copied into it, and the new data is copied to the end.

Insertion works in a similar way, except two portions of the existing data will be copied separately into disjointed sections of the new array if the insertion point falls within the existing string.

In contrast, removal is very simple. A new terminator character is placed at the new endpoint of the string, and everything following it is deemed to be free space, ready for future insert/append operations.

Although it involves a lot of copying when inserting and appending, this system works exceptionally well for the two most frequent string operations - random access to a character within a string, and iterating sequentially over the characters in a string. Keeping all of the characters in a sequential block of RAM makes these two operations trivial.

Reducing Data Copying Via Linked Lists

When I was writing the WoopsiString class, however, the amount of copying annoyed me. I came up with a solution that performed a lot less copying by breaking the string data up into a linked list of string fragments. For example, imagine that we have this string:


Drawn with pipe characters as byte boundaries, it looks like this:

| h | e | l | l | o | \0 |

Suppose we want to append the text “ world!” onto that string. The current WoopsiString class would do this:

  1. Allocate a new char array of length 13.
  2. Copy “hello” from existing array into new array.
  3. Copy “ world!” from second array into new array.
  4. Append terminator to end of new array.
  5. Delete existing array.
  6. Update array pointer to point to new array.

We end up with this arrangement:

| h | e | l | l | o |   | w | o | r | l | d | ! | \0 |

The proposed new WoopsiString class would avoid all of the tedious copying by appending the “ world!” text as a new string fragment:

  1. Allocate a new linked list item containing a char array of length 7.
  2. Copy “ world!” into the new array.
  3. Point the “next” pointer of the “hello” fragment to the new “ world!” fragment.

The data structure we end up with looks like this (“-->” indicates a pointer):

| h | e | l | l | o | --> |   | w | o | r | l | d | ! |"

This new system has clear advantages over the current class, at least when it comes to appending. Very little copying needs to be performed.

Inserting has similar benefits. If we want to insert a comma after the word “hello” we would need to perform the following operations in the current class:

  1. Allocate a new char array of length 14.
  2. Copy “hello” from existing array into new array.
  3. Copy “,” from second array into new array after the word “hello”.
  4. Copy “ world!” from existing array into new array.
  5. Append terminator to end of new array.
  6. Delete existing array.
  7. Update array pointer to point to new array.

We end up with this:

| h | e | l | l | o | , |   | w | o | r | l | d | ! | \0 |

Again, lots of copying. The new class would do this instead:

  1. Allocate a new linked list item containing a char array of length 1.
  2. Copy “,” into the new array.
  3. Point the “next” pointer of the “,” fragment to the “ world!” fragment.
  4. Point the “next” pointer of the “hello” fragment to the “,” fragment.

We get this:

| h | e | l | l | o | --> | , | --> |   | w | o | r | l | d | ! |"

No copying involved. However, we are cheating a little here; we’re taking advantage of the fact that we’re inserting between two existing fragments. What happens if we try to insert “llo he” into our “hello, world!” string just after the second character?

  1. Allocate a new linked list item containing a char array of length 2.
  2. Copy “he” from the “hello” fragment into the new array.
  3. Allocate a new linked list item containing a char array of length 6.
  4. Copy “llo he” from the second array into the new array.
  5. Allocate a new linked list item containing a char array of length 3.
  6. Copy “llo” from the “hello” fragment into the new array.
  7. Point the “next” pointer of the “he” fragment to the “llo he” fragment.
  8. Point the “next” pointer of the “llo he” fragment to the “llo” fragment.
  9. Point the “next” pointer of the “llo” fragment to the “,” fragment.
  10. Delete the “hello” fragment.

The final data structure looks like this:

| h | e | -- > | l | l | o |   | h | e | --> l | l | o | --> | , | -->
|   | w | o | r | l | d | ! |"

(We could optimise that slightly by making use of the existing “hello” fragment instead of deleting it, which would ultimately give us “hello”-->” hello”->“,”-->” world!” instead. It makes the algorithm slightly more complex but slightly more efficient.)

The existing class would need to make 1 memory allocation, copy 19 characters and perform 1 memory free to be able to perform this operation. The new class would make 3 memory allocations, copy 11 characters and perform 1 memory free to do the same thing. Though this doesn’t look like much of an improvement, imagine that the string we’re inserting into is 5,000 characters long, and has been created by concatenating five 1,000 character strings together. The existing class would copy all 5,000 characters during the insert. The proposed new class would copy just 1,000, as it could work with an individual fragment instead of the entire string.

Problems with Random Access and Iteration

If this is such a great improvement, then, why didn’t I implement it? The answer lies right at the top of this post. The two most frequent operations performed on strings are extracting substrings and iterating over their characters. If a string is stored non-contiguously, these two operations become much more complex. In a sequential string, extracting a character from a string can be done in O(1) time, and is as simple as getting a pointer to the start of the string and adding an offset. Iterating over the string is simply a matter of adding 1 to the pointer.

In the proposed class, extracting a character is far more laborious. Assuming each fragment includes an int that identifies its length, the algorithm for finding a character looks like this:

Store a pointer to the head of the list in "listPointer"
Store its length in a temporary variable "totalLength"
While the desired index is greater "totalLength"
    Point "listPointer" to its own "next" pointer
    Add the length of the "listPointer" fragment to "totalLength"

Retrieve the character from the "listPointer" char array at desired index
minus "totalLength"

The worst-case scenario, in which the desired index is the last character in a string that has been built by appending single characters, will result in a retrieval time of O(n).

For ASCII strings, this is disastrous.

When dealing with UTF-8 strings, however, this approach starts to look far more appealing. In the best-case scenario for retrieving a character from a UTF-8 string the time is O(n), because there is no way of knowing where a character is stored without iterating over the entire string.

In Defense of UTF-8

To explain this, here’s a quick description of the problem that UTF-8 tries to solve. In ASCII text, each character is represented by a single byte. The most significant bit is not used, so each byte can represent one of 128 different characters. This is great for the English-speaking world, as ASCII text can represent all characters from the standard English alphabet, all numerals, and most of the important symbols (though us Brits are screwed when trying to talk about money, as the symbol for the British pound is not in the ASCII set).

The rest of the world is not so lucky. The ASCII set does not include accent marks, so the French are stuck. It doesn’t include umlauts or the Eszett (double ’s’) so the Germans are stuck. And it definitely doesn’t include the tens of thousands of characters in any of the multiple Chinese languages, so they’re really stuck.

The solution is obviously to make characters longer than a single byte. Suppose we make all characters 32-bits wide, for example. Now we have another problem - all of our existing applications, that assume characters are 8-bits wide, no longer work. Plus we’ve quadrupled the size of all strings in all future applications. If those applications are for American users, ¾ths of that is wasted space.

UTF-8 solves this by using multiple bytes for characters if necessary. It will represent ASCII characters as single bytes, but other characters will be represented as either 2, 3 or 4-byte sequences.

The problem with variable-length encodings is that it is impossible to know where a particular character is. For example, suppose we decide that the letter “w” should be represented by 4 bytes. “Hello world” now looks like this:

| H | e | l | l | o |   | w | w | w | w | o | r | l | d | \0 |

In order to find the first “r” in the string (character 8 ) we need to retrieve byte 11. Since all of the characters in the string could be multibyte characters, the only way we can possibly guarantee that we retrieve the correct character is by starting at “H” and working forwards, a character at a time.

Optimising UTF-8 with Linked Lists

This is where the new WoopsiString idea becomes extraordinarily helpful. Suppose that the string above has been created by concatenating the strings “Hello ” and “world”:

| H | e | l | l | o |   | --> | w | w | w | w | o | r | l | d |

Locating the “r” in the string is now considerably faster. We can bypass the entire “Hello” portion of the string simply by comparing the index we’re looking for (8) with the length of the “Hello” fragment (5), which tells us that the character we’re looking for isn’t contained in that fragment. For ASCII strings, this was a horrible performance decrease, but for UTF-8 strings represents a considerable performance increase.

If strings were automatically divided into fragments (32 characters - not bytes - wide, for example), and if the fragments included “previous” as well as “next” pointers (ie. a doubly-linked list), random access performance can be improved even further.

Iterating with Linked Lists - The StringIterator Class

As always, there is a penalty to pay for this performance increase. Iterating over a UTF-8 string is more complex than iterating over an ASCII string, but not much more complex. Once the starting character has been located, iterating over the string is simply a question of consuming bytes until a valid codepoint (the “value” of the character, sans the UTF-8 multibyte markers) has been identified, returning the codepoint, moving to the next byte and starting the process again.

If linked lists are introduced into the mix, finding the starting character becomes much faster, but iterating over the string from that point (assuming the string section being iterated over crosses a fragment boundary) becomes more complicated.

At the moment, Woopsi handles iteration over strings by extracting a char pointer from the WoopsiString object, then iterates over it by adding the size of each character to the pointer (the same algorithm as outlined above). If linked lists were introduced, this would not be possible. In fact, the only practical and sturdy solution I can some up with is to have a separate StringIterator class which would handle all string iteration. You’d need to give it a starting character to locate, then use its “MoveNext()” and “MovePrevious()” methods to iterate over the string.


In summary, if it’s possible to sum up these disparate thoughts, it’s possible to create a string class that uses linked lists in order to require far fewer copy operations than the existing WoopsiString class for standard string functions. Unfortunately, for ASCII text, it would be counterproductive to do so. Conversely, the same design presents a significant performance improvement for UTF-8 text, at the cost of making iteration more complex and time consuming.


Unicode News

Unicode support in Woopsi is coming along nicely. Most of the gadgets now expect to receive WoopsiString objects instead of chars or char arrays when dealing with text. As the WoopsiString now works with UTF-8, that means most of the gadgets now also support UTF-8.

I’ve overloaded the WoopsiString’s “=” operator for WoopsiString objects, char arrays and u32s, which means that any parameter that should be a WoopsiString can accept char arrays or u32s instead. Converting from string literals to WoopsiStrings is therefore seamless and automatic.

The glyphs are now stored in a separate “GlyphFont” font. The default glyph font is stored in the defaultGadgetStyle global variable, in the same way that the default system font is stored. Converting this unearthed a couple of bugs in the Bmp2Font program - it doesn’t write out the resultant font constuctor properly (it misses out the transparent colour and uses the raw bitmap data instead of the bitmap wrapper object). I still need to fix those.

The most problematic changes were convincing buttons that displayed glyphs to centre their text properly (turns out that they were centring based on the wrong font) and getting the MultiLineTextBox to redraw at a reasonable speed. After some cunning hacking about with the WoopsiString and the MultiLineTextBox I’ve got it back up to something approaching its old speed.

Regarding speed, unicode definitely has a negative impact. Having to scan through an entire string to locate a codepoint is a horrible performance killer. WoopsiString::getCharAt() is definitely not a function you should use if you’re reading out sequential characters, as it looks for the codepoint from the start of the string each time it is called. It’s much more advisable to use getToken() to get a pointer to the first token, read out the codepoint, then increase the pointer by the size of each character as it is read. Before I went back to optimise things, the MultiLineTextBox was using getCharAt() when drawing and so was the PackedFontBase, meaning the string was being scanned through twice for each character being printed. Xcode’s “Shark” profiling utility noted that the MultiLineTextBox was using 31% of the total CPU time when printing a line of text solely because of all the UTF-8 parsing. Yikes.

There are still a lot of gadgets to convert over, and probably a hefty set of bugfixes and optimisations to make.


SDL and Woopsi2x Mark 2

Continuing the bitmap changes released in Woopsi 0.40, I’ve altered the bitmap and graphics classes so that all interaction with a bitmap’s data must occur through the bitmap class’ accessor methods. This has enabled me to change the FrameBuffer class so that it includes the SDL code that was previously floating around in the woopsifuncs files.

Instead of Woopsi mimicking the DS’ framebuffer with an intermediate array of u16s, the FrameBuffer class accesses the SDL surface directly. The FrameBuffer class still includes a u16 array, but that is used as a way of avoiding lengthy 24-bit to 16-bit conversions every time a pixel is read or written. Woopsi therefore maintains a 24-bit SDL surface and a 16-bit array and keeps them in sync by making changes to both whenever one is altered. This has resulted in a massive speed increase. Whereas the entire framebuffer simulator array was previously written to the SDL surface every VBL, only changed pixels are now written.

As a test of the new system I’ve compiled it in Xcode (no noticable change) and Code::Blocks for Windows. In the latter Woopsi is now far too fast to be usable; I probably need to put a call to SDL_Delay in there somewhere.

I’ve also compiled it for the GP2x F-200. I ported it a while ago but gave up because the GP2x’s touch screen is utterly atrocious. At the time Woopsi was too slow to be usable due to the extra copying the SDL version was performing. Now it runs at the same speed as the DS version.


Here’s a pre-compiled binary and the sourcecode for Woopsi2x. Note that this is simply a test; I have no intention of maintaining a GP2x port because of the hopeless touchscreen. I’m hoping to make a Pandora port when the hardware is released.

Try it for yourself to find out just how bad the GP2x touch screen is!


Scrolling Changes

Building on the GraphicsPort::copy() method reported last time, I’ve added a new GraphicsPort::scroll() method. This takes in x and y co-ordinates as the location of the region to scroll, in addition to the width and height of the region to be scrolled. This information represents a rectangle that will be shifted through a specified distance both vertically and horizontally, which are also parameters to the function.

The method uses the copy() method to do the actual pixel moving, but does some crazy clipping calculations in order to limit the region being copied to within the bounds of the visible regions of the relevant gadget. To be specific, it uses the copy() routine where possible, but there are a number of situations where no data is actually available to copy (ie. destination falls outside the bounds of the clipping region). To cater for this, the function expects to be passed a pointer to an empty WoopsiArray of rects. Where the copy cannot be performed, a rect representing the region to be refreshed is appended to the array. These can be redrawn (using a loop around the draw(Rect) method) once the scroll function exits.

In addition to this, the function adds all revealed regions to the array too. So, if you scroll a gadget horizontally by 10 pixels, the array will contain a rect 10 pixels wide representing the newly-visible area that needs to be redrawn.

The scroll() method rendered much of the code in the ScrollingPanel redundant. In fact, it reduced it from ~500 lines of code down to ~200, and removed another set of functions that dealt directly with the framebuffer and recalculated visible regions. Unfortunately, it doesn’t appear to have made any noticable difference to the speed of the ScrollingPanel’s dragging function - it’s still flickery. The only way around that would appear to be double-buffering the display, which is counter-productive given the way I’m using the DMA hardware to copy pixels around.


More Optimisations and Framebuffer Copying

Following on from yesterday’s post, I have renamed the two types of clipping rect to “foreground” and “background”. It should be possible to optimise it further, and have the cache selectively invalidate either set of regions depending on what exactly has invalidated the cache.

The Gadget::clipRectToHierarchy() function, which clips drawing regions to a gadget’s ancestors, is now non-recursive, which should give a minor speed boost.

A feature that Jeff asked for nearly 18 months ago was a ScrollRect() function in the graphics port that would shift a region of the framebuffer by n pixels. I’ve now got part of this implemented - the GraphicsPort has a “copy()” method that will copy a rectangular region of the framebuffer from one location to another. It uses all of the scrolling tricks I worked out when writing the ScrollingPanel class, such as an off-screen buffer for horizontal copies (wherein the destination overlaps the source), and copying top-to-bottom or bottom-to-top depending on the vertical difference between source and destination. I still need to add in clipping methods to enable this to be used as a scrolling function.

The existing copy() method has allowed me to trim the Screen::drag() method. That was previously recalculating its foreground visible rects (for some inexplicable reason, as they’ve been cached for ages now) and had implemented its own copying routine that worked directly with the framebuffer. I’m trying to eliminate that from Woopsi to make it more portable. Anyway, the upshot of all this is that screen dragging is now faster, and the code is tidier. Hurrah!


Woopsi Optimisations and Fixes

I’ve made a number of fixes to Woopsi today. First of all, sliders and scrollbars. The slider grip now automatically resizes itself, meaning there’s no need to call resizeGrip() when setting the max/min/value properties of a slider/scrollbar. The resizeGrip() method is now private. Still on the topic of scrollbars, the grip in the ScrollingTextbox’s scrollbar now adjusts to the correct position when the textbox is first initialised. It was previously placed at the top of the scrollbar regardless of the position of the text.

Next, some fixes related to the event refactoring. The Alert, Requester and WoopsiKeyboard gadgets now handle events from their decorations correctly. Previously, the XOR dragging rect wasn’t being drawn correctly when they were first clicked.

Lastly, I’ve made some improvements to the visible region caching system. There are two types of rectangles used in the system. The first, which was already being cached, represents the entire visible surface of a gadget, ignoring any children it might have. The closest Java analogy would be Swing’s “glass pane”. Drawing to these rects draws over the top of the entire gadget, even over the top of any children the gadget might have. The second type of rects represent the visible portions of a gadget once its children have been drawn; they can be thought of as background rects, behind everything else in the gadget. Woopsi only draws the background rects; everything else is left to children to draw (and so on, recursively, until the screen is filled).

Whilst looking at the Gadget::redraw() method, I noticed that the background rects were being recalculated every time redraw() was called. I’ve now changed this so that both foreground and background rects are cached, which should provide a small speed boost. Note that, as I type this, I’ve realised that calling the rects “foreground” and “background” is a far better convention than their current names, so I’ll probably rename them soon.

Still on the subject of caching, I’ve moved the caching code and the rect splitting code into a new class, “RectCache”. The Gadget class is overloaded with functionality at the moment, so I want to extract some of it out to make it easier to work with. As a byproduct, I’ve made the “removeOverlappedRects()” method non-recursive, which should make that a little faster, too.

A major reason for trying to move this code into a separate class is to try and either optimise it (by making the rect splitting smarter) or replace it with the BSP tree code I came up with - oops - a year ago. I’ve fixed a raft of bugs in that code, but I’m still struggling to work out how exactly I can integrate it into Woopsi, or if I even need to. On a system with a little more CPU oomph and no DMA hardware (GP2x or Wiz, for example), it makes sense to replace the current system with the BSP code. It’s tidier, more efficient, and simpler. On the DS, however, I keep encountering optimisations that I’ve made in Woopsi that can’t be replicated using the BSP method and that offer significant speed gains. I might put up a longer post specifically on this topic later.


Iterators and Bugfixes

Something I’ve been meaning to do for a while is replace any non-int iterator variables with ints. This was mentioned nearly a year ago by a chap called cearn, whose website I’ve been browsing through lately. In addition to his advice, I’ve spotted the same information in the documentation for the official ARM C compiler. That book was written for the ARM4, so it’s probably ancient now and not entirely appropriate to GCC, but it does re- iterate the same information:

Use ints.

Variables that aren’t 32 bits wide need to be masked out to be read correctly, and in the case of the ARM compiler, using non-int typed variables as loop iterators could result in massive amounts of extra assembly being produced.

I’ve gone through all of the iterators in the main Woopsi codebase (still need to do the bonus folder) and swapped all of the iterators to u32s (or s32s, as appropriate). As a side effect, this should fix any remaining problems that could arise from complex layouts (too many child gadgets) or complex display splits (too many rects for the u8 to represent).

Secondly, I’ve fixed a bug reported by Azaydius on the forum. The ScrollingListBox wasn’t passing calls to setAllowMultipleSelections() to its ListData object.