2010-01-12

Unicode Complete

Woopsi is now 100% unicode. Except for anything in the “bonus” folder, which I’ll get around to at some point.

I’ve tidied up the WoopsiString, removing redundant methods and sorting out which should be public and which shouldn’t. To cater for iteration, rather than trying to redesign the string, I’ve added a new StringIterator class. The WoopsiString can produce instances of these (in the same way that Gadgets produce instances of the GraphicsPort class) that can iterate over the string. It has some cunning optimisations so that it will always choose the optimal way of moving to a new character index.

On top of that, I’ve moved the FileRequester and associated classes into the main codebase. Lakedaemon pointed out that, unless you use any libfat features, it isn’t included in the final ROM. A quick test revealed that this was indeed the case, so there’s really no reason not to enable libfat in the makefile by default and include the FileRequester in the main library.

I’ve fixed a few problems in the FileRequester. Its background was transparent in places, which was caused by one of the gadgets in its children not having a draw() method. The sorting was broken - I thought I’d fixed this - so although the items were sorted alphabetically, directories were intermingled with files. Lastly, it doesn’t try to draw before the rest of Woopsi has been drawn, a problem which caused some graphical corruption.

The Bmp2Font .NET utility for converting BMP files to Woopsi fonts now creates working Font subclasses. I’ve removed four fonts from Woopsi as they didn’t seem to be converted properly by Jeff’s original process - they either had glyphs with missing sections or the vertical positioning of some of the glyphs was off.

2010-01-07

A Better UTF-8 String

The WoopsiString Class - Optimised for ASCII

The current WoopsiString class stores its data in the usual way: an array of chars. When the string is initially set, the object creates an array large enough to hold the data it should store. It then copies the data into that new array.

When appending to the string, the object checks to see if it contains enough free space within the existing array to contain the new data. If so, the data is just copied into the free space. If not, a new array is allocated large enough to contain both the existing data and the new data. The existing data is copied into it, and the new data is copied to the end.

Insertion works in a similar way, except two portions of the existing data will be copied separately into disjointed sections of the new array if the insertion point falls within the existing string.

In contrast, removal is very simple. A new terminator character is placed at the new endpoint of the string, and everything following it is deemed to be free space, ready for future insert/append operations.

Although it involves a lot of copying when inserting and appending, this system works exceptionally well for the two most frequent string operations - random access to a character within a string, and iterating sequentially over the characters in a string. Keeping all of the characters in a sequential block of RAM makes these two operations trivial.

Reducing Data Copying Via Linked Lists

When I was writing the WoopsiString class, however, the amount of copying annoyed me. I came up with a solution that performed a lot less copying by breaking the string data up into a linked list of string fragments. For example, imagine that we have this string:

hello

Drawn with pipe characters as byte boundaries, it looks like this:

| h | e | l | l | o | \0 |

Suppose we want to append the text “ world!” onto that string. The current WoopsiString class would do this:

  1. Allocate a new char array of length 13.
  2. Copy “hello” from existing array into new array.
  3. Copy “ world!” from second array into new array.
  4. Append terminator to end of new array.
  5. Delete existing array.
  6. Update array pointer to point to new array.

We end up with this arrangement:

| h | e | l | l | o |   | w | o | r | l | d | ! | \0 |

The proposed new WoopsiString class would avoid all of the tedious copying by appending the “ world!” text as a new string fragment:

  1. Allocate a new linked list item containing a char array of length 7.
  2. Copy “ world!” into the new array.
  3. Point the “next” pointer of the “hello” fragment to the new “ world!” fragment.

The data structure we end up with looks like this (“-->” indicates a pointer):

| h | e | l | l | o | --> |   | w | o | r | l | d | ! |"

This new system has clear advantages over the current class, at least when it comes to appending. Very little copying needs to be performed.

Inserting has similar benefits. If we want to insert a comma after the word “hello” we would need to perform the following operations in the current class:

  1. Allocate a new char array of length 14.
  2. Copy “hello” from existing array into new array.
  3. Copy “,” from second array into new array after the word “hello”.
  4. Copy “ world!” from existing array into new array.
  5. Append terminator to end of new array.
  6. Delete existing array.
  7. Update array pointer to point to new array.

We end up with this:

| h | e | l | l | o | , |   | w | o | r | l | d | ! | \0 |

Again, lots of copying. The new class would do this instead:

  1. Allocate a new linked list item containing a char array of length 1.
  2. Copy “,” into the new array.
  3. Point the “next” pointer of the “,” fragment to the “ world!” fragment.
  4. Point the “next” pointer of the “hello” fragment to the “,” fragment.

We get this:

| h | e | l | l | o | --> | , | --> |   | w | o | r | l | d | ! |"

No copying involved. However, we are cheating a little here; we’re taking advantage of the fact that we’re inserting between two existing fragments. What happens if we try to insert “llo he” into our “hello, world!” string just after the second character?

  1. Allocate a new linked list item containing a char array of length 2.
  2. Copy “he” from the “hello” fragment into the new array.
  3. Allocate a new linked list item containing a char array of length 6.
  4. Copy “llo he” from the second array into the new array.
  5. Allocate a new linked list item containing a char array of length 3.
  6. Copy “llo” from the “hello” fragment into the new array.
  7. Point the “next” pointer of the “he” fragment to the “llo he” fragment.
  8. Point the “next” pointer of the “llo he” fragment to the “llo” fragment.
  9. Point the “next” pointer of the “llo” fragment to the “,” fragment.
  10. Delete the “hello” fragment.

The final data structure looks like this:

| h | e | -- > | l | l | o |   | h | e | --> l | l | o | --> | , | -->
|   | w | o | r | l | d | ! |"

(We could optimise that slightly by making use of the existing “hello” fragment instead of deleting it, which would ultimately give us “hello”-->” hello”->“,”-->” world!” instead. It makes the algorithm slightly more complex but slightly more efficient.)

The existing class would need to make 1 memory allocation, copy 19 characters and perform 1 memory free to be able to perform this operation. The new class would make 3 memory allocations, copy 11 characters and perform 1 memory free to do the same thing. Though this doesn’t look like much of an improvement, imagine that the string we’re inserting into is 5,000 characters long, and has been created by concatenating five 1,000 character strings together. The existing class would copy all 5,000 characters during the insert. The proposed new class would copy just 1,000, as it could work with an individual fragment instead of the entire string.

Problems with Random Access and Iteration

If this is such a great improvement, then, why didn’t I implement it? The answer lies right at the top of this post. The two most frequent operations performed on strings are extracting substrings and iterating over their characters. If a string is stored non-contiguously, these two operations become much more complex. In a sequential string, extracting a character from a string can be done in O(1) time, and is as simple as getting a pointer to the start of the string and adding an offset. Iterating over the string is simply a matter of adding 1 to the pointer.

In the proposed class, extracting a character is far more laborious. Assuming each fragment includes an int that identifies its length, the algorithm for finding a character looks like this:

Store a pointer to the head of the list in "listPointer"
Store its length in a temporary variable "totalLength"
While the desired index is greater "totalLength"
    Point "listPointer" to its own "next" pointer
    Add the length of the "listPointer" fragment to "totalLength"

Retrieve the character from the "listPointer" char array at desired index
minus "totalLength"

The worst-case scenario, in which the desired index is the last character in a string that has been built by appending single characters, will result in a retrieval time of O(n).

For ASCII strings, this is disastrous.

When dealing with UTF-8 strings, however, this approach starts to look far more appealing. In the best-case scenario for retrieving a character from a UTF-8 string the time is O(n), because there is no way of knowing where a character is stored without iterating over the entire string.

In Defense of UTF-8

To explain this, here’s a quick description of the problem that UTF-8 tries to solve. In ASCII text, each character is represented by a single byte. The most significant bit is not used, so each byte can represent one of 128 different characters. This is great for the English-speaking world, as ASCII text can represent all characters from the standard English alphabet, all numerals, and most of the important symbols (though us Brits are screwed when trying to talk about money, as the symbol for the British pound is not in the ASCII set).

The rest of the world is not so lucky. The ASCII set does not include accent marks, so the French are stuck. It doesn’t include umlauts or the Eszett (double ’s’) so the Germans are stuck. And it definitely doesn’t include the tens of thousands of characters in any of the multiple Chinese languages, so they’re really stuck.

The solution is obviously to make characters longer than a single byte. Suppose we make all characters 32-bits wide, for example. Now we have another problem - all of our existing applications, that assume characters are 8-bits wide, no longer work. Plus we’ve quadrupled the size of all strings in all future applications. If those applications are for American users, ¾ths of that is wasted space.

UTF-8 solves this by using multiple bytes for characters if necessary. It will represent ASCII characters as single bytes, but other characters will be represented as either 2, 3 or 4-byte sequences.

The problem with variable-length encodings is that it is impossible to know where a particular character is. For example, suppose we decide that the letter “w” should be represented by 4 bytes. “Hello world” now looks like this:

| H | e | l | l | o |   | w | w | w | w | o | r | l | d | \0 |

In order to find the first “r” in the string (character 8 ) we need to retrieve byte 11. Since all of the characters in the string could be multibyte characters, the only way we can possibly guarantee that we retrieve the correct character is by starting at “H” and working forwards, a character at a time.

Optimising UTF-8 with Linked Lists

This is where the new WoopsiString idea becomes extraordinarily helpful. Suppose that the string above has been created by concatenating the strings “Hello ” and “world”:

| H | e | l | l | o |   | --> | w | w | w | w | o | r | l | d |

Locating the “r” in the string is now considerably faster. We can bypass the entire “Hello” portion of the string simply by comparing the index we’re looking for (8) with the length of the “Hello” fragment (5), which tells us that the character we’re looking for isn’t contained in that fragment. For ASCII strings, this was a horrible performance decrease, but for UTF-8 strings represents a considerable performance increase.

If strings were automatically divided into fragments (32 characters - not bytes - wide, for example), and if the fragments included “previous” as well as “next” pointers (ie. a doubly-linked list), random access performance can be improved even further.

Iterating with Linked Lists - The StringIterator Class

As always, there is a penalty to pay for this performance increase. Iterating over a UTF-8 string is more complex than iterating over an ASCII string, but not much more complex. Once the starting character has been located, iterating over the string is simply a question of consuming bytes until a valid codepoint (the “value” of the character, sans the UTF-8 multibyte markers) has been identified, returning the codepoint, moving to the next byte and starting the process again.

If linked lists are introduced into the mix, finding the starting character becomes much faster, but iterating over the string from that point (assuming the string section being iterated over crosses a fragment boundary) becomes more complicated.

At the moment, Woopsi handles iteration over strings by extracting a char pointer from the WoopsiString object, then iterates over it by adding the size of each character to the pointer (the same algorithm as outlined above). If linked lists were introduced, this would not be possible. In fact, the only practical and sturdy solution I can some up with is to have a separate StringIterator class which would handle all string iteration. You’d need to give it a starting character to locate, then use its “MoveNext()” and “MovePrevious()” methods to iterate over the string.

Summary

In summary, if it’s possible to sum up these disparate thoughts, it’s possible to create a string class that uses linked lists in order to require far fewer copy operations than the existing WoopsiString class for standard string functions. Unfortunately, for ASCII text, it would be counterproductive to do so. Conversely, the same design presents a significant performance improvement for UTF-8 text, at the cost of making iteration more complex and time consuming.

2010-01-05

Unicode News

Unicode support in Woopsi is coming along nicely. Most of the gadgets now expect to receive WoopsiString objects instead of chars or char arrays when dealing with text. As the WoopsiString now works with UTF-8, that means most of the gadgets now also support UTF-8.

I’ve overloaded the WoopsiString’s “=” operator for WoopsiString objects, char arrays and u32s, which means that any parameter that should be a WoopsiString can accept char arrays or u32s instead. Converting from string literals to WoopsiStrings is therefore seamless and automatic.

The glyphs are now stored in a separate “GlyphFont” font. The default glyph font is stored in the defaultGadgetStyle global variable, in the same way that the default system font is stored. Converting this unearthed a couple of bugs in the Bmp2Font program - it doesn’t write out the resultant font constuctor properly (it misses out the transparent colour and uses the raw bitmap data instead of the bitmap wrapper object). I still need to fix those.

The most problematic changes were convincing buttons that displayed glyphs to centre their text properly (turns out that they were centring based on the wrong font) and getting the MultiLineTextBox to redraw at a reasonable speed. After some cunning hacking about with the WoopsiString and the MultiLineTextBox I’ve got it back up to something approaching its old speed.

Regarding speed, unicode definitely has a negative impact. Having to scan through an entire string to locate a codepoint is a horrible performance killer. WoopsiString::getCharAt() is definitely not a function you should use if you’re reading out sequential characters, as it looks for the codepoint from the start of the string each time it is called. It’s much more advisable to use getToken() to get a pointer to the first token, read out the codepoint, then increase the pointer by the size of each character as it is read. Before I went back to optimise things, the MultiLineTextBox was using getCharAt() when drawing and so was the PackedFontBase, meaning the string was being scanned through twice for each character being printed. Xcode’s “Shark” profiling utility noted that the MultiLineTextBox was using 31% of the total CPU time when printing a line of text solely because of all the UTF-8 parsing. Yikes.

There are still a lot of gadgets to convert over, and probably a hefty set of bugfixes and optimisations to make.

2010-01-03

UTF-8 and TrueType Fonts

If you haven’t been following this thread lately, you won’t be aware of Woopsi’s latest contributor, Lakedaemon. He’s doing a fantastic job of introducing two new, much sought-after features to Woopsi - unicode (UTF-8) and TrueType font support.

Both features seem to work well. Both of these features have pros and cons, which I’ve been contemplating a lot lately. Unicode support, for example, has a very negative impact upon random access within strings. Unicode is a variable width encoding system, so there is no way to locate a random character within a string without first parsing all of the bytes that come before the character. However, Woopsi doesn’t really do much random string access, so this is not a huge problem.

The real issue with supporting extra character sets is the amount of memory consumed by each string. As a variable width system, UTF-8 uses single bytes for the standard ASCII set and so the problem is a non-issue.

In order to have system-wide support for UTF-8, all gadgets that have previously worked with char arrays must be altered to work with the WoopsiString instead. This implies some overhead, which is unavoidable, but as support for non-ASCII character sets is one of the most requested features for Woopsi, I think we can live with it.

Supporting TrueType fonts is more complex. Lakedaemon has used the FreeType library. It looks like a solid project and is dual licenced under the GPL and MIT licences. The MIT licence is basically the BSD licence with a different name, so there are no legal problems with using it. However, it does make the build process more complicated, and it adds substantial bulk to the size of a Woopsi ROM file. The TrueType font must be loaded from the flash disk, so the ROM must be linked with libfat, further adding to the ROM size.

I haven’t tried out a Woopsi GUI that uses TrueType fonts yet, but I’m guessing that it will be slower than a simple bitmap font-based GUI. FreeType is designed for embedded devices, however, which may mean that the difference in performance isn’t noticeable.

Bearing these points in mind, I think I’ll add TrueType support as an optional extra that can be enabled in the makefile. I’m not sure how that’s going to work yet; it may be that I split Woopsi into two pre-compiled libraries, one of which is lightweight (no libfat/freetype) and one of which is heavyweight (libfat and freetype enabled).

Someone using a TrueType font as their system font will soon come across an interesting problem - the glyphs on their buttons (close gadgets, screen flip gadgets, radio buttons, etc) will be drawn as characters, not images. The reason for this is that the glyphs are stored in the extended ASCII portion of the bitmap fonts. At this point, it has clearly become a bad idea, and I’ll need to split the glyphs into a separate glyph font. It is unlikely that anyone will want two or more glyph fonts active at once, so I’ll make the glyph font a global variable initialised in the woopsifuncs files. It’ll be user-definable, just like the system font.

Aside from that, I’ve made a couple of other fixes. I’ve removed the “raisesEvents” flag from the Gadget class, as that is now handled by the GadgetEventHandlerList class. The setRaisesEvents() and raisesEvents() functions interact with the GadgetEventHandlerList instance correctly, which has fixed a glitch in the ScrollingPanel. The ScrollingPanel::raiseScrollEvent() method refers to the GadgetEventHandler’s isEnabled() function to see if events should be enabled or not.

Finally, the FontBase class now includes a getCharHeight() method, which should hopefully help out Lakedaemon’s desire to support variable-height rows of text.