A Better UTF-8 String

The WoopsiString Class - Optimised for ASCII

The current WoopsiString class stores its data in the usual way: an array of chars. When the string is initially set, the object creates an array large enough to hold the data it should store. It then copies the data into that new array.

When appending to the string, the object checks to see if it contains enough free space within the existing array to contain the new data. If so, the data is just copied into the free space. If not, a new array is allocated large enough to contain both the existing data and the new data. The existing data is copied into it, and the new data is copied to the end.

Insertion works in a similar way, except two portions of the existing data will be copied separately into disjointed sections of the new array if the insertion point falls within the existing string.

In contrast, removal is very simple. A new terminator character is placed at the new endpoint of the string, and everything following it is deemed to be free space, ready for future insert/append operations.

Although it involves a lot of copying when inserting and appending, this system works exceptionally well for the two most frequent string operations - random access to a character within a string, and iterating sequentially over the characters in a string. Keeping all of the characters in a sequential block of RAM makes these two operations trivial.

Reducing Data Copying Via Linked Lists

When I was writing the WoopsiString class, however, the amount of copying annoyed me. I came up with a solution that performed a lot less copying by breaking the string data up into a linked list of string fragments. For example, imagine that we have this string:


Drawn with pipe characters as byte boundaries, it looks like this:

| h | e | l | l | o | \0 |

Suppose we want to append the text “ world!” onto that string. The current WoopsiString class would do this:

  1. Allocate a new char array of length 13.
  2. Copy “hello” from existing array into new array.
  3. Copy “ world!” from second array into new array.
  4. Append terminator to end of new array.
  5. Delete existing array.
  6. Update array pointer to point to new array.

We end up with this arrangement:

| h | e | l | l | o |   | w | o | r | l | d | ! | \0 |

The proposed new WoopsiString class would avoid all of the tedious copying by appending the “ world!” text as a new string fragment:

  1. Allocate a new linked list item containing a char array of length 7.
  2. Copy “ world!” into the new array.
  3. Point the “next” pointer of the “hello” fragment to the new “ world!” fragment.

The data structure we end up with looks like this (“-->” indicates a pointer):

| h | e | l | l | o | --> |   | w | o | r | l | d | ! |"

This new system has clear advantages over the current class, at least when it comes to appending. Very little copying needs to be performed.

Inserting has similar benefits. If we want to insert a comma after the word “hello” we would need to perform the following operations in the current class:

  1. Allocate a new char array of length 14.
  2. Copy “hello” from existing array into new array.
  3. Copy “,” from second array into new array after the word “hello”.
  4. Copy “ world!” from existing array into new array.
  5. Append terminator to end of new array.
  6. Delete existing array.
  7. Update array pointer to point to new array.

We end up with this:

| h | e | l | l | o | , |   | w | o | r | l | d | ! | \0 |

Again, lots of copying. The new class would do this instead:

  1. Allocate a new linked list item containing a char array of length 1.
  2. Copy “,” into the new array.
  3. Point the “next” pointer of the “,” fragment to the “ world!” fragment.
  4. Point the “next” pointer of the “hello” fragment to the “,” fragment.

We get this:

| h | e | l | l | o | --> | , | --> |   | w | o | r | l | d | ! |"

No copying involved. However, we are cheating a little here; we’re taking advantage of the fact that we’re inserting between two existing fragments. What happens if we try to insert “llo he” into our “hello, world!” string just after the second character?

  1. Allocate a new linked list item containing a char array of length 2.
  2. Copy “he” from the “hello” fragment into the new array.
  3. Allocate a new linked list item containing a char array of length 6.
  4. Copy “llo he” from the second array into the new array.
  5. Allocate a new linked list item containing a char array of length 3.
  6. Copy “llo” from the “hello” fragment into the new array.
  7. Point the “next” pointer of the “he” fragment to the “llo he” fragment.
  8. Point the “next” pointer of the “llo he” fragment to the “llo” fragment.
  9. Point the “next” pointer of the “llo” fragment to the “,” fragment.
  10. Delete the “hello” fragment.

The final data structure looks like this:

| h | e | -- > | l | l | o |   | h | e | --> l | l | o | --> | , | -->
|   | w | o | r | l | d | ! |"

(We could optimise that slightly by making use of the existing “hello” fragment instead of deleting it, which would ultimately give us “hello”-->” hello”->“,”-->” world!” instead. It makes the algorithm slightly more complex but slightly more efficient.)

The existing class would need to make 1 memory allocation, copy 19 characters and perform 1 memory free to be able to perform this operation. The new class would make 3 memory allocations, copy 11 characters and perform 1 memory free to do the same thing. Though this doesn’t look like much of an improvement, imagine that the string we’re inserting into is 5,000 characters long, and has been created by concatenating five 1,000 character strings together. The existing class would copy all 5,000 characters during the insert. The proposed new class would copy just 1,000, as it could work with an individual fragment instead of the entire string.

Problems with Random Access and Iteration

If this is such a great improvement, then, why didn’t I implement it? The answer lies right at the top of this post. The two most frequent operations performed on strings are extracting substrings and iterating over their characters. If a string is stored non-contiguously, these two operations become much more complex. In a sequential string, extracting a character from a string can be done in O(1) time, and is as simple as getting a pointer to the start of the string and adding an offset. Iterating over the string is simply a matter of adding 1 to the pointer.

In the proposed class, extracting a character is far more laborious. Assuming each fragment includes an int that identifies its length, the algorithm for finding a character looks like this:

Store a pointer to the head of the list in "listPointer"
Store its length in a temporary variable "totalLength"
While the desired index is greater "totalLength"
    Point "listPointer" to its own "next" pointer
    Add the length of the "listPointer" fragment to "totalLength"

Retrieve the character from the "listPointer" char array at desired index
minus "totalLength"

The worst-case scenario, in which the desired index is the last character in a string that has been built by appending single characters, will result in a retrieval time of O(n).

For ASCII strings, this is disastrous.

When dealing with UTF-8 strings, however, this approach starts to look far more appealing. In the best-case scenario for retrieving a character from a UTF-8 string the time is O(n), because there is no way of knowing where a character is stored without iterating over the entire string.

In Defense of UTF-8

To explain this, here’s a quick description of the problem that UTF-8 tries to solve. In ASCII text, each character is represented by a single byte. The most significant bit is not used, so each byte can represent one of 128 different characters. This is great for the English-speaking world, as ASCII text can represent all characters from the standard English alphabet, all numerals, and most of the important symbols (though us Brits are screwed when trying to talk about money, as the symbol for the British pound is not in the ASCII set).

The rest of the world is not so lucky. The ASCII set does not include accent marks, so the French are stuck. It doesn’t include umlauts or the Eszett (double ’s’) so the Germans are stuck. And it definitely doesn’t include the tens of thousands of characters in any of the multiple Chinese languages, so they’re really stuck.

The solution is obviously to make characters longer than a single byte. Suppose we make all characters 32-bits wide, for example. Now we have another problem - all of our existing applications, that assume characters are 8-bits wide, no longer work. Plus we’ve quadrupled the size of all strings in all future applications. If those applications are for American users, ¾ths of that is wasted space.

UTF-8 solves this by using multiple bytes for characters if necessary. It will represent ASCII characters as single bytes, but other characters will be represented as either 2, 3 or 4-byte sequences.

The problem with variable-length encodings is that it is impossible to know where a particular character is. For example, suppose we decide that the letter “w” should be represented by 4 bytes. “Hello world” now looks like this:

| H | e | l | l | o |   | w | w | w | w | o | r | l | d | \0 |

In order to find the first “r” in the string (character 8 ) we need to retrieve byte 11. Since all of the characters in the string could be multibyte characters, the only way we can possibly guarantee that we retrieve the correct character is by starting at “H” and working forwards, a character at a time.

Optimising UTF-8 with Linked Lists

This is where the new WoopsiString idea becomes extraordinarily helpful. Suppose that the string above has been created by concatenating the strings “Hello ” and “world”:

| H | e | l | l | o |   | --> | w | w | w | w | o | r | l | d |

Locating the “r” in the string is now considerably faster. We can bypass the entire “Hello” portion of the string simply by comparing the index we’re looking for (8) with the length of the “Hello” fragment (5), which tells us that the character we’re looking for isn’t contained in that fragment. For ASCII strings, this was a horrible performance decrease, but for UTF-8 strings represents a considerable performance increase.

If strings were automatically divided into fragments (32 characters - not bytes - wide, for example), and if the fragments included “previous” as well as “next” pointers (ie. a doubly-linked list), random access performance can be improved even further.

Iterating with Linked Lists - The StringIterator Class

As always, there is a penalty to pay for this performance increase. Iterating over a UTF-8 string is more complex than iterating over an ASCII string, but not much more complex. Once the starting character has been located, iterating over the string is simply a question of consuming bytes until a valid codepoint (the “value” of the character, sans the UTF-8 multibyte markers) has been identified, returning the codepoint, moving to the next byte and starting the process again.

If linked lists are introduced into the mix, finding the starting character becomes much faster, but iterating over the string from that point (assuming the string section being iterated over crosses a fragment boundary) becomes more complicated.

At the moment, Woopsi handles iteration over strings by extracting a char pointer from the WoopsiString object, then iterates over it by adding the size of each character to the pointer (the same algorithm as outlined above). If linked lists were introduced, this would not be possible. In fact, the only practical and sturdy solution I can some up with is to have a separate StringIterator class which would handle all string iteration. You’d need to give it a starting character to locate, then use its “MoveNext()” and “MovePrevious()” methods to iterate over the string.


In summary, if it’s possible to sum up these disparate thoughts, it’s possible to create a string class that uses linked lists in order to require far fewer copy operations than the existing WoopsiString class for standard string functions. Unfortunately, for ASCII text, it would be counterproductive to do so. Conversely, the same design presents a significant performance improvement for UTF-8 text, at the cost of making iteration more complex and time consuming.