2013-02-21

Encoding HTML Entities with libXML

iOS doesn’t have a standard, easy-to-use way of encoding UTF-8 strings into HTML entities (ie. from “>” into “>“). One library that can be used to achieve this is libXML, but its API is particularly unpleasant:

int htmlEncodeEntities (unsigned char * out, 
                        int * outlen, 
                        const unsigned char * in, 
                        int * inlen, 
                        int quoteChar)

out: a pointer to an array of bytes to store the result
outlen: the length of @out
in: a pointer to an array of UTF-8 chars
inlen: the length of @in
quoteChar: the quote character to escape (' or ") or zero.

Returns: 0 if success, -2 if the transcoding fails, or -1 otherwise
The value of @inlen after return is the number of octets consumed as
the return value is positive, else unpredictable. The value of
@outlen after return is the number of octets consumed.

What’s wrong with that function signature, you ask? Other than the quality of the documentation, that is? Consider this situation. You need to encode this string:

<p>

That will encode into this string:

&lt;p&gt;

Now consider this braindead catch-22 situation: You must allocate the memory for the “out” parameter before you call the function, but until you call the function you won’t know how much memory you need to allocate.

The raw string in the example above is 4 bytes long; the encoded version is 10 bytes (don’t forget that this is C and all strings are NULL-terminated). If you follow the advice from Stack Overflow and double the initial size you’ll end up with 7 bytes (1 terminator + (3 characters * 2)), which is still too short. Another alternative is to figure out the maximum amount of memory that an encoded string could possibly consume (9 bytes per character) and use that, but then you’ll potentially be wasting massive amounts of memory.

If I were writing the function I’d probably return a pointer to a block of memory allocated inside the function itself. It could re-allocate the memory as needed and users of the function wouldn’t need to guesstimate the buffer size. That would raise my favourite C question, though: Who owns this memory?

I couldn’t find any examples of how to use the httpEncodeEntities() method, so I came up with my own. This solution uses a loop and encodes the string in chunks. It uses an encoded buffer twice the size of the initial string, but will resize it if it finds that the buffer isn’t large enough for a single encoded character. It’s implemented as a category method on NSString.


#import <libxml/htmlparser.h>

@implementation NSString (SZHTMLEncoding)

- (NSString *)stringByEncodingHTMLEntities {

    if (self.length == 0) return;
    
    NSData *data = [self dataUsingEncoding:NSUTF8StringEncoding];
    
    int remainingBytes = data.length;
    int bufferSize = (data.length * 2) + 1;
    const unsigned char *bytes = (const unsigned char *)[data bytes];
    
    // We have to add an extra byte on the end of the encoded string to enable
    // us to add a terminator character.
    unsigned char *buffer = malloc(bufferSize);
    buffer[bufferSize - 1] = '\0';
    
    NSMutableString *output = [NSMutableString stringWithCapacity:remainingBytes];
    
    do {
        
        int outLen = bufferSize - 1;
        int inLen = remainingBytes;
    
        int result = htmlEncodeEntities(buffer, &outLen, bytes, &inLen, '"');
        
        // libXML doesn't append a terminator to the string - presumably because
        // NSString doesn't include one - so we'll have to take care of that in
        // order to convert back to an NSString.  We only add this if we haven't
        // completely filled the buffer.  If we've filled it, we've already
        // added the terminator character.
        if ((NSUInteger)outLen < bufferSize - 1) {
            buffer[outLen] = '\0';
        }
        
        if (result == 0) {
            
            NSString *string = [NSString stringWithCString:(const char *)buffer encoding:NSUTF8StringEncoding];
            
            [output appendString:string];
            
            remainingBytes -= inLen;
            
            if (remainingBytes > 0 && inLen == 0) {
                
                // Oh no!  We've got characters left to encode but they aren't
                // encoding.  This happens if our buffer isn't big enough, so
                // we'll resize it.
                free(buffer);
                bufferSize = ((bufferSize - 1) * 2) + 1;
                buffer = malloc(bufferSize);
                buffer[bufferSize - 1] = '\0';
            }
        } else {
            
            // Something bad happened
            break;
        }

        bytes += inLen;
        
    } while(remainingBytes > 0);

    free(buffer);
        
    return output;
}

@end