2013-02-21

Encoding HTML Entities with libXML

iOS doesn’t have a standard, easy-to-use way of encoding UTF-8 strings into HTML entities (ie. from “>” into “>“). One library that can be used to achieve this is libXML, but its API is particularly unpleasant:

int htmlEncodeEntities (unsigned char * out, 
                        int * outlen, 
                        const unsigned char * in, 
                        int * inlen, 
                        int quoteChar)

out: a pointer to an array of bytes to store the result
outlen: the length of @out
in: a pointer to an array of UTF-8 chars
inlen: the length of @in
quoteChar: the quote character to escape (' or ") or zero.

Returns: 0 if success, -2 if the transcoding fails, or -1 otherwise
The value of @inlen after return is the number of octets consumed as
the return value is positive, else unpredictable. The value of
@outlen after return is the number of octets consumed.

What’s wrong with that function signature, you ask? Other than the quality of the documentation, that is? Consider this situation. You need to encode this string:

<p>

That will encode into this string:

&lt;p&gt;

Now consider this braindead catch-22 situation: You must allocate the memory for the “out” parameter before you call the function, but until you call the function you won’t know how much memory you need to allocate.

The raw string in the example above is 4 bytes long; the encoded version is 10 bytes (don’t forget that this is C and all strings are NULL-terminated). If you follow the advice from Stack Overflow and double the initial size you’ll end up with 7 bytes (1 terminator + (3 characters * 2)), which is still too short. Another alternative is to figure out the maximum amount of memory that an encoded string could possibly consume (9 bytes per character) and use that, but then you’ll potentially be wasting massive amounts of memory.

If I were writing the function I’d probably return a pointer to a block of memory allocated inside the function itself. It could re-allocate the memory as needed and users of the function wouldn’t need to guesstimate the buffer size. That would raise my favourite C question, though: Who owns this memory?

I couldn’t find any examples of how to use the httpEncodeEntities() method, so I came up with my own. This solution uses a loop and encodes the string in chunks. It uses an encoded buffer twice the size of the initial string, but will resize it if it finds that the buffer isn’t large enough for a single encoded character. It’s implemented as a category method on NSString.


#import <libxml/htmlparser.h>

@implementation NSString (SZHTMLEncoding)

- (NSString *)stringByEncodingHTMLEntities {

    if (self.length == 0) return;
    
    NSData *data = [self dataUsingEncoding:NSUTF8StringEncoding];
    
    int remainingBytes = data.length;
    int bufferSize = (data.length * 2) + 1;
    const unsigned char *bytes = (const unsigned char *)[data bytes];
    
    // We have to add an extra byte on the end of the encoded string to enable
    // us to add a terminator character.
    unsigned char *buffer = malloc(bufferSize);
    buffer[bufferSize - 1] = '\0';
    
    NSMutableString *output = [NSMutableString stringWithCapacity:remainingBytes];
    
    do {
        
        int outLen = bufferSize - 1;
        int inLen = remainingBytes;
    
        int result = htmlEncodeEntities(buffer, &outLen, bytes, &inLen, '"');
        
        // libXML doesn't append a terminator to the string - presumably because
        // NSString doesn't include one - so we'll have to take care of that in
        // order to convert back to an NSString.  We only add this if we haven't
        // completely filled the buffer.  If we've filled it, we've already
        // added the terminator character.
        if ((NSUInteger)outLen < bufferSize - 1) {
            buffer[outLen] = '\0';
        }
        
        if (result == 0) {
            
            NSString *string = [NSString stringWithCString:(const char *)buffer encoding:NSUTF8StringEncoding];
            
            [output appendString:string];
            
            remainingBytes -= inLen;
            
            if (remainingBytes > 0 && inLen == 0) {
                
                // Oh no!  We've got characters left to encode but they aren't
                // encoding.  This happens if our buffer isn't big enough, so
                // we'll resize it.
                free(buffer);
                bufferSize = ((bufferSize - 1) * 2) + 1;
                buffer = malloc(bufferSize);
                buffer[bufferSize - 1] = '\0';
            }
        } else {
            
            // Something bad happened
            break;
        }

        bytes += inLen;
        
    } while(remainingBytes > 0);

    free(buffer);
        
    return output;
}

@end

Comments

Jeff on 2013-02-21 at 07:10 said:

From an efficiency perspective, surely it’s easier to just do:

p = bytes; count = 0;
while (*p) {
  switch (*p) {
    default: 
       if (*p & 0x80) count += 4;  // &#xx;
       count++; break;
    case '&': count+=5; break; // &amp;
    case '<': count ++ 4; break; // &lt
  }
}

repeating appropriately for the problematic case values, and be done with it. A switch statement and a one-pass trawl through UTF8 is going to be way more efficient than encoding and re-encoding until you get there…

The pure performance chaser would set up a bit mask that told them if (*p) needed 3 bytes or 4, and then computes a couply of 256/8 byte array to index into. Yes, memory access is expensive by again, not as expensive as a cross-framework call. The problematic set of characters that need special encoding isn’t variable, nor is it so large that it’s not possible to code for.

And since you are allocating into a buffer, then converting to a string anyway, what’s the big deal if you overestimate a few bytes? The bit tells you “this guys needs 5”, you don’t need the bitmask for anything > 0x80 (since you know they will be 5 chars).

Jeff on 2013-02-21 at 07:11 said:

Half my previous comment got lost somewhere… I had a better answer involving indexing into a precomputed bitmask that identified which chars < 0x80 require up to 5 bytes.

Ant on 2013-02-21 at 13:27 said:

That’s exactly the solution I used in the end, via an existing class from Google:

http://code.google.com/p/google-toolbox-for-mac/source/browse/trunk/Foundation/GTMNSString%2BHTML.m?r=314

This class handles all HTML entities, including characters like upsilon (two bytes in UTF-8; nine bytes when encoded as an HTML entity).

I switched away from libXML when I realised the catch-22 allocation problem, but I was still interested in figuring out exactly how libXML was supposed to be used.