2009-11-05

Strings, C and C++

A function to read a single line of text from a text file is a tricky thing to write in C++, at least when one is limited to fgetc() and fgets() from the standard C library.

Memory allocation is the main problem. A readline() function must obviously store its output somewhere. There are three possible approaches. In the first, and most usually recommended, the function should receive a block of pre-allocated memory. In the second, the function creates and returns a new block of memory. An in-between method could receive pre-allocated memory but enlarge it as needed.

The fgets() function is the most primitive example of a readline() function in the standard C library. It relies on the caller to supply it with pre-allocated memory in which to store its output, and it looks like this:

char* fgets(char* buffer, int count, FILE* stream)

The function reads from stream until it hits either a newline, end of file, or has read the number of characters indicated by count (this should be at most the size of buffer). It stores the output in buffer, and also returns a pointer to buffer when the function ends.

fgets() has a major problem, of course. How many characters are there in the line being read? Answer: No idea. fgets() is a low-level function that may read a whole line or just part of a line, and it is the programmer’s responsibility to identify when a partial line has been returned and repeat the process until all characters are retrieved.

A function that solves this problem is getline() from the GNU C library (not the same as the getline() method available to C++ istreams). It looks like this:

ssize_t getline(char** lineptr, size_t* count, FILE* stream)

The function is guaranteed to read an entire line of text. lineptr performs the same function as buffer in fgets() - the text retrieved by the function is stored here. count contains the size of the lineptr buffer. stream is a pointer to the file being read. In this function, memory can be allocated before the function is called, but it does not need to be (in that case - deep breath - the pointer to which lineptr points must point to NULL and count must be 0). The function uses realloc() to increase the size of the lineptr buffer as necessary. The eventual size of the buffer is stored in count and returned as the output of the function.

It can be thought of as an example of the “in-between” approach described earlier. It relies on externally-created memory but can enlarge the memory area to accommodate the full string. Choosing a generous initial buffer size will reduce the number of realloc() operations performed. Assuming realloc() is usually successful in increasing a block of allocated memory, rather than needing to allocate a new region and copy the old data into it, getline() should be comparable in speed to fgets(). The minor speed penalty is worth the ease of use the function affords.

However, this function has its own set of problems, at least within the scope of Woopsi. It is not available for starters - devkitARM does not include the GNU C library. Even if I were to borrow the code, it really highlights one of the problems with C++.

getline() expects to be passed a pointer to either memory allocated via malloc() or NULL via the lineptr parameter. C++, however, essentially deprecates the malloc/free method of allocating heap memory. A C++ programmer should instinctively use the new/delete memory allocation system. This will potentially cause all manner of chaos when getline() calls realloc() on memory that was allocated using the “new” keyword. Since memory allocated with “new” cannot be reallocated, a C++ re-implementation of getline() will always copy memory every time the buffer is resized. getline() is a good example of why C++’s palimpsest-like layering over C was perhaps not such a good idea.

Rather than allowing memory to be allocated outside of the function and populated within the function, a different approach is to handle all memory allocation internally:

char* getline(FILE* file)

This version of getline() would be more familiar to programmers using higher-level languages such as C# and Java. It doesn’t perform magic with pointers-to-pointers. It will allocate memory within itself, grow it as necessary, and simply return a pointer to the string once it has been retrieved.

Outside of memory-managed languages, though, this function has problems. How has the memory been allocated - new or malloc()? We can document this in the function’s comments, and we can also follow the C++ standard of using “new” to make this more intuitive, but the question still needs to be asked before the function can be used. More importantly, who is responsible for deleting the memory when it is no longer being used? If this is a standalone function it is fairly obvious that it is the caller’s responsibility to delete the string once it has been used, but what if the function is actually a public method of a class? Does the class use the memory internally, or has it allocated the memory solely for the caller’s exclusive use?

My reason for pondering this is directly related to the last post. I’m writing a simple textfile class that abstracts away the filesystem and can do useful things such as reading individual lines of text. The problem I’m coming up against is how to make the API for the class familiar to C and C++ programmers, embody the ease-of-use of a Java or C# class, whilst simultaneously trying to make it efficient for the DS hardware.

Following the fgets() pattern makes sense since fgets() is one of the functions offered by libfat. The standard getline() method is considerably more user-friendly, however, as it eliminates the tedium of ensuring that fgets() does really return a complete line. The alternative version of getline() is easier to use than both of the others, but ownership of the resultant memory is not at all clear solely from the code itself.

It is worth mentioning that all of the usual C++ ways of handling this are off-limits due to Woopsi’s strict adherence to binary-size-reducing “no STL” policy.