2012-12-04

More Barcode Disasters

Here’s a follow-up to my last rant about driving licence barcode standards that aren’t standard.

Header

The barcode header is supposed to be “@\n\x1e\rANSI “. If the first nine characters of the decoded data equal this string, we’ve got a valid driving licence barcode. South Carolina’s driving licences have the file separator character (ASCII 0x1c) as the third byte instead of the record separator character (ASCII 0x1e) as defined by the standard.

ZIP Codes

US zip codes come in two formats. The first is the standard 5-digit code, such as “90210”. This was found to be insufficiently accurate, so a “+4” extension is often tagged on the end giving the format “90120+1234”.

In the first version of the DLID spec, the zip code field was 11 characters long. If the zip code didn’t fill the entire 11 characters, the extra places were padded with spaces.

Consider the file format. Each record in the file is split into two parts: an identifier, which is a 3-character header (“DAQ”, “DBC”, etc) that indicates what the data represents; and the data itself (“JOE”, “BLOGGS”, etc). Records are separated by line breaks. If the data has separators, why does the zip code field have a fixed width? Most (all?) of the other fields are variable width.

Assuming there’s a reason for the field to be fixed-width, you should be able to see immediately that the spec is still broken. If all zip codes are at most 10 characters long, why does the field allow for 11 characters? Even 10 characters is too long. If the field is 5 characters long, a parser can infer that it has no +4 extension. If it is 9 characters long a parser can infer that it has the extension and split it up accordingly.

Version 3 of the spec tried to rectify the situation. The field was shortened to 9 characters, but this time zeros were used as padding instead of spaces. The upshot is that every parser must extract the zip and +4 sections by dividing up based on expected data lengths (5 and 4 respectively) and then ditch the extension if it is equal to “0000”. Why not just make the field a variable width? Why not pad it with spaces that can be trimmed without potentially losing a trailing zero in the zip?

There is no documentation as to how to format the field in versions 1 and 2 of the DLID spec. Thus, Colorado just uses the 5-character zip. South Carolina uses both the zip and the extension and smooshes them together into one 9-character string padded with two spaces. Massachusetts includes both sections of the zip separated by a hyphen and pads with one space. Who knows what the other states do.

At version 3 of the spec, the standard embraced the Canadians. Canadians have a 6-character post code that looks just like the UK standard. There is no documentation anywhere as to which padding character is used when representing these post codes or their format, if indeed one is used, nor if the post codes should have their two 3-character sections divided by a space or not.

Names

The 7 versions of the DLID spec include 3 ways of storing names. They started out with a single record that stored a comma-delimited list of names in the format “LAST,FIRST,MIDDLE,…”. Colorado, being unique and special, uses the format “FIRST,MIDDLE,…,LAST”.

Presumably to prevent this foolishness, the standards body changed this in the second version of the spec. This version included a standalone “last name” field and a field for other names in the format “FIRST,MIDDLE,…”. Actually, that’s not strictly true; the documented format is “FIRSTxMIDDLEx…”, where “x” is an undocumented separator. Wisconsin used a space whilst Virginia used a comma.

The fourth version finally seems to have fixed it. Names are divided into three fields: “first”, “last” and “middle”, where “middle” can contain multiple comma-separated names. Documentation at last!

Social Security Numbers

Version 1 of the spec optionally allowed states to include their drivers’ social security numbers on their licences. Careful with that licence, now…

Gender

Version 1 of the spec allowed gender to be expressed using 6 possible values: M, F, 0, 1, 2 and 9. “M” and “F” are self-explanatory. The others are pulled from the ANSI-D20 gender codes, in which the values mean “Unknown”, “Male”, “Female” and “Not specified” respectively. Obviously two ways of representing the same piece of data is better than one. Version 2 dumped all but values “1” and “2”. I imagine that the standards body figured that, if they were going to allow someone to be in control of a 26,000lb vehicle, they should take enough of an interest in the driver to know his or her gender.

2012-11-30

Parsing US Driving Licence Barcodes

Most states in the US and some Canadian provinces include a PDF417 barcode on the back of their driving licences. The barcode contains a host of information about its owner, such as names, address, height, weight, eye colour, date of birth, etc. There are currently 7 different versions of the standard, which you can download here (click on the “Documentation” tab):

Unfortunately, the standards are full of breathtakingly stupid mistakes. Dates are currently my favourite.

This is the date format used in version 1:

yyyymmdd

That’s one of the ISO-8601 standards for representing a date.

In version 2 they switched to this:

mmddyyyy

That’s the standard US way of representing dates (I like to think of them as “lumpy” dates, because the format goes “large-small-large”, whereas ISO dates are big-endian). I have no idea why they did this. I presume they got a lot of complaints from Americans who were stumped by the unusual date format whilst decoding the PDF417 barcodes with nothing more sophisticated than their eyes. Any automated parser would naturally re-format the date into the local standard, so they must have been doing it manually. An impressive skill.

In version 3 the Canadians decided to get in on the barcode action. Canadians use the big-endian date format, so the spec now states that date fields can store the dates in one of two ways:

yyyymmdd
mmddyyyy

Any parsers need to check the licence’s country code before they can parse dates. Not only does this version of the spec introduce a new standard but it contains multiple standards within a single field.

Wow.

2012-05-01

Email Address Validation For iOS

If you’ve ever needed to validate an email address, the chances are you used a regex engine and validated the address’ structure against something like this:

^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$

It works well enough for the average case. But what about this email address?

joe+bloggs@example.com

Not so good. You could modify the regex to support the “+” character, but if you read RFC 822 and RFC 5321 you’ll quickly discover that the structure of an email address isn’t just “alphanumeric@alphanumeric.alphanumeric”. Email addresses include a built-in comment system using parentheses, escape characters, quoted sections, “dot-atoms” and more. A regex to correctly parse the entirety of RFC 822 would be huge, and that’s the older, simpler version of an email address.

Once you start really digging into the syntax of an email address, you find that:

  • Domain names (the part after the “@”) can now be in character sets other than Latin, so your primitive [a-z][0-9] regex won’t work;
  • The local part of the address (the part before the “@”) can also now be in other character sets, but as this idea is about 2 months old nothing supports it yet;
  • Trying to figure out if Exchange supports internationalised domain names (IDN) or not is basically guesswork;
  • There are multiple standards for email addresses, most of which disagree with each other and most of which appear to be deprecated. Figuring out which is the current standard is like trying to find a particular Skittle in a bag of Skittles.

Pretty much the only fact you’ll be entirely sure of after your research is that regex is a really bad solution for validating email addresses.

My favourite part of RFC 3696 is where it says these addresses are valid:

Fred\ Bloggs@example.com
"Fred Bloggs"@example.com

This makes sense. The backslash is used to escape a single character, whereas the quotes are used to signify that everything between them should be automatically escaped. One’s really a shortcut for the other. Simple.

However, the errata suggests that this is a mistake, and that this is the correct form:

 "Fred\ Bloggs"@example.com

Now we have a backslash system embedded within a quoting system. The former makes the latter entirely superfluous. I think this amendment is probably why so many developers just stick with regex: The standard is crap.

So, you have two choices:

  1. Ignore the complexity, do what everyone else does, and use regex regardless;
  2. Write an email address parser.

As I’m crazy, here’s an email address parser written in Objective-C. It has some limitations:

  • It does not allow the local part of the address to be internationalised as nothing supports this yet.
  • It does not allow domain name literals (ie. joe@[10.0.1.2]) as Exchange does not support them and I’ve never seen one in the wild.
  • It may not apply all of the backslash escape rules correctly, as this is one area where every document entirely disagrees with every other document.
  • It may not enforce the omission of certain ASCII ranges correctly, as this is another area where the docs suck.

On the plus side:

  • It validates comments and nested comments (comments are apparently deprecated, but I can’t find anywhere that specifically says that this is the case).
  • It correctly enforces at least some of the backslash escape rules.
  • It allows the use of quoted sections (that can include characters like “@”, “(”, etc) and enforces their rules (quotes must be matched; quoted sections must be the only element of the local part or must be delimited by periods).
  • It enforces period rules (periods cannot be the first or last part of the local or domain parts of the address; periods cannot be adjacent except within a quoted section).
  • It enforces the maximum and minimum lengths of each section (up to 64 chars for the local part; up to 254 chars for the entire address; there must be a local and domain part of the address).
  • It allows IDN.
  • Domain names can only contain letters, numbers, periods and hyphens.
  • It passes all of the test cases written by Dominic Sayers except those concerning carriage/line returns and domain name literals, which aren’t relevant to what I’m trying to do.

// .h
@interface SZEmailValidator : NSObject

+ (BOOL)isValid:(NSString *)candidate;

@end

// .m
#import "SZEmailValidator.h"

struct SZEmailParserState {
    BOOL quoted : 1;
    BOOL escaped : 1;
    BOOL domain : 1;
    BOOL dot : 1;
    BOOL followingQuoteBlock : 1;
};

@implementation SZEmailValidator

+ (BOOL)isValid:(NSString *)candidate {

    unsigned int domainPartStart = 0;
    unsigned int commentDepth = 0;
    
    struct SZEmailParserState state;
    
    state.dot = NO;
    state.quoted = NO;
    state.escaped = NO;
    state.followingQuoteBlock = NO;
    state.domain = NO;
    
    for (unsigned int i = 0; i < candidate.length; ++i) {
        unichar character = [candidate characterAtIndex:i];
                
        if (!state.domain) {
            
            // Do not allow characters beyond the ASCII set in the username
            if (character > 126) return NO;
             
            // Do not allow NULL
            if (character == 0) return NO;
            
            // Do not allow LF
            if (character == 10) return NO;
        }
        
        if (i > 253) {
            
            // Do not allow more than 254 characters in the entire address
            return NO;
        }
        
        // The only characters that can follow a quote block are @ and period.
        if (state.followingQuoteBlock) {
            if (character != '@' && character != '.') {
                return NO;
            }
            
            state.followingQuoteBlock = NO;
        }
        
        switch (character) {
            case '@':
                
                if (state.domain) {
                    
                    // @ not allowed in the domain portion of the address
                    return NO;
                    
                } else if (state.quoted) {
                    
                    // Ignore @ signs when quoted
                    
                } else if (state.dot) {
                    
                    // Dots are not allowed as the final character in the local
                    // part
                    return NO;
                    
                } else {
                    
                    // Swapping to the domain portion of the address
                    state.domain = YES;
                    domainPartStart = i + 1;
                    
                    if (i > 64) {
                        
                        // Do not allow more than 63 characters in the local part
                        return NO;
                        
                    }
                }
                
                // No longer in dot/escape mode
                state.dot = NO;
                state.escaped = NO;
                
                break;
                
            case '(':
                
                // Comments only activate when not quoted or escaped
                if (!state.quoted && !state.escaped) {
                    ++commentDepth;
                }
                
                break;
                
            case ')':

                // Comments only activate when not quoted or escaped
                if (!state.quoted && !state.escaped) {
                    
                    if (commentDepth == 0) return NO;
                    
                    --commentDepth;
                }
                
                break;
                
            case '\\':
                
                if (!state.quoted && commentDepth == 0) {
                    
                    // Backslash isn't allowed outside of quote/comment mode
                    return NO;
                }
                    
                // Flip the escape bit to enter/exit escape mode
                state.escaped = !state.escaped;
                
                // No longer in dot mode
                state.dot = NO;
                
                break;
            
            case '"':
                
                if (state.domain && commentDepth == 0) {
                    
                    // quote not allowed in the domain portion of the address
                    // outside of a comment
                    return NO;
                }
                
                if (!state.escaped) {
                    
                    // Quotes are only allowed at the start of the local part,
                    // after a dot or to close an existing quote part
                    if (i == 0 || state.dot || state.quoted) {
                        
                        // Remember that we just left a quote block
                        if (state.quoted) {
                            state.followingQuoteBlock = YES;
                        }
                    
                        // Flip the quote bit to enter/exit quote mode
                        state.quoted = !state.quoted;
                    } else {
                        return NO;
                    }
                }
                
                // No longer in dot/escape mode
                state.dot = NO;
                state.escaped = NO;
                
                break;
            
            case '.':
    
                if (i == 0) {
                    
                    // Dots are not allowed as the first character of the local
                    // part
                    return NO;
                    
                } else if (i == domainPartStart) {
                    
                    // Dots are not allowed as the first character of the domain
                    // part
                    return NO;
                    
                } else if (i == candidate.length - 1) {
                    
                    // Dots are not allowed as the last character of the domain
                    // part
                    return NO;
                }
                
                if (!state.quoted) {
                    
                    if (state.dot) {
            
                        // Cannot allow adjacent dots
                        return NO;
                    } else {
                        
                        // Entering dot mode
                        state.dot = YES;
                    }
                    
                }
                    
                // No longer in escape mode
                state.escaped = NO;

                break;

            case ' ':
            case ',':
            case '[':
            case ']':
            case 1:
            case 2:
            case 3:
            case 4:
            case 5:
            case 6:
            case 7:
            case 8:
            case 9:
            case 11:
            case 13:
            case 15:

                // These characters can only appear when quoted
                if (!state.quoted) {
                    return NO;
                }
                
            default:
                
                // No longer in dot/escape mode
                state.dot = NO;
                state.escaped = NO;

                // Do not allow characters outside of unicode, numerals, hyphens
                // and periods in the domain part.  We use letterCharacterSet
                // because we're supporting internationalised domain names.
                // We don't have to do anything special with the name; that's up
                // to the email client/server to handle.
                if (state.domain) {
                    if (![[NSCharacterSet letterCharacterSet] characterIsMember:character] &&
                        ![[NSCharacterSet decimalDigitCharacterSet] characterIsMember:character] &&
                        character != '-') {
                        
                        return NO;
                    }
                }
                
                break;
        }
    }
    
    // Do not allow unclosed comments
    if (commentDepth > 0) return NO;
    
    // If we didn't identify a local and a domain part the address isn't valid
    if (!state.domain) return NO;
    if (candidate.length == domainPartStart) return NO;
    if (domainPartStart == 1) return NO;
    
    // Validate domain name components
    NSArray *components = [[candidate substringFromIndex:domainPartStart] componentsSeparatedByString:@"."];
    
    for (NSString *item in components) {
        
        // We can't allow a hyphen as the first or last char in a domain name
        // component
        if ([item characterAtIndex:0] == '-' || [item characterAtIndex:item.length - 1] == '-') {
            return NO;
        }
        
        // Items must not be longer than 63 chars
        if (item.length > 63) return NO;
    }

    return YES;
}

@end

2009-10-04

Networked PacMan and Embedding Video

Adding video to a blog not hosted on one of the main blog providers is a real pain. WordPress has about a dozen plugins that handle video, most of which haven’t been updated in a while and are tricky to configure. The worst offenders are the plugins that handle anything other than FLV format videos or streamed videos hosted on sites like YouTube. That’s a shame, because actually creating FLV videos is itself far more work than it should be. A simple plugin that played MOV or AVI files would be greatly appreciated.

I eventually decided upon the FLV Embed plugin. It seems to be reasonably up-to-date and is very easy to use. Unlike the YouTube streaming plugins, it lets me keep the video files hosted on my own server.

Once I’d installed the plugin I needed to get my video into FLV format. This, as previously mentioned, is a huge pain. My version of Flash is PPC-only and I don’t want to have to install Rosetta on my nice clean Snow Leopard install, so using Flash is out. FFmpegX depends on a binary that is also PPC-only (and hasn’t been updated in over a year) so that’s out too. All of the other FLV converters for both the Mac and Windows are either crap shareware applications, look like they contain malware, or both.

(As a side note, why is it that all of these crappy shareware programs insist on creating skinned UIs? They all look like the kind of shovelware that gets installed on new PCs that anyone with any sense uninstalls before they even attempt to use the computer.)

Anyway, the only FLV converter I found that was remotely useful was iVideoConverter. It is a shareware app, but it’s got a good UI and it actually works, which is more than I can say for the rest. As it’s just a front-end for ffmpeg, you’d expect nothing less from it. I nearly resorted to using the HTML5 video tag, but since the browser manufacturers refuse to agree on a video standard, it’s rather pointless. (EDIT: After some testing, it seems that Safari will play Ogg Theora - though that may be because I’ve got all sorts of video players installed, including Perian - whilst Firefox refuses to play anything at all, despite me having configured the appropriate MIME types.)

Anyway, the whole purpose of this waste of a weekend was to post a video of a project I wrote a while ago. I created a networked version of PacMan as part of a five week group project for my master’s degree. The project as it was submitted was rather larger - it had a second game and a database behind it. I’ve stripped out just about everything that I didn’t personally write (barring the XML config file stuff) and the database (simply to make it easier to distribute).

In this version, the first player to connect to the server plays as PacMan; the other players control the ghosts. By default, it supports 2, 4, 6 or 8-player games, but the code itself will support any number of players. Here’s the video of it in action; it shows two networked clients running on the same host:

The game can be downloaded here:

PacMan

This includes the NetBeans project with full sourcecode, plus Java binaries and instructions on how to run them. Note that in order to play the game you need to have the JVM 1.5 (at least) installed.