2012-12-04

More Barcode Disasters

Here’s a follow-up to my last rant about driving licence barcode standards that aren’t standard.

Header

The barcode header is supposed to be “@\n\x1e\rANSI “. If the first nine characters of the decoded data equal this string, we’ve got a valid driving licence barcode. South Carolina’s driving licences have the file separator character (ASCII 0x1c) as the third byte instead of the record separator character (ASCII 0x1e) as defined by the standard.

ZIP Codes

US zip codes come in two formats. The first is the standard 5-digit code, such as “90210”. This was found to be insufficiently accurate, so a “+4” extension is often tagged on the end giving the format “90120+1234”.

In the first version of the DLID spec, the zip code field was 11 characters long. If the zip code didn’t fill the entire 11 characters, the extra places were padded with spaces.

Consider the file format. Each record in the file is split into two parts: an identifier, which is a 3-character header (“DAQ”, “DBC”, etc) that indicates what the data represents; and the data itself (“JOE”, “BLOGGS”, etc). Records are separated by line breaks. If the data has separators, why does the zip code field have a fixed width? Most (all?) of the other fields are variable width.

Assuming there’s a reason for the field to be fixed-width, you should be able to see immediately that the spec is still broken. If all zip codes are at most 10 characters long, why does the field allow for 11 characters? Even 10 characters is too long. If the field is 5 characters long, a parser can infer that it has no +4 extension. If it is 9 characters long a parser can infer that it has the extension and split it up accordingly.

Version 3 of the spec tried to rectify the situation. The field was shortened to 9 characters, but this time zeros were used as padding instead of spaces. The upshot is that every parser must extract the zip and +4 sections by dividing up based on expected data lengths (5 and 4 respectively) and then ditch the extension if it is equal to “0000”. Why not just make the field a variable width? Why not pad it with spaces that can be trimmed without potentially losing a trailing zero in the zip?

There is no documentation as to how to format the field in versions 1 and 2 of the DLID spec. Thus, Colorado just uses the 5-character zip. South Carolina uses both the zip and the extension and smooshes them together into one 9-character string padded with two spaces. Massachusetts includes both sections of the zip separated by a hyphen and pads with one space. Who knows what the other states do.

At version 3 of the spec, the standard embraced the Canadians. Canadians have a 6-character post code that looks just like the UK standard. There is no documentation anywhere as to which padding character is used when representing these post codes or their format, if indeed one is used, nor if the post codes should have their two 3-character sections divided by a space or not.

Names

The 7 versions of the DLID spec include 3 ways of storing names. They started out with a single record that stored a comma-delimited list of names in the format “LAST,FIRST,MIDDLE,…”. Colorado, being unique and special, uses the format “FIRST,MIDDLE,…,LAST”.

Presumably to prevent this foolishness, the standards body changed this in the second version of the spec. This version included a standalone “last name” field and a field for other names in the format “FIRST,MIDDLE,…”. Actually, that’s not strictly true; the documented format is “FIRSTxMIDDLEx…”, where “x” is an undocumented separator. Wisconsin used a space whilst Virginia used a comma.

The fourth version finally seems to have fixed it. Names are divided into three fields: “first”, “last” and “middle”, where “middle” can contain multiple comma-separated names. Documentation at last!

Social Security Numbers

Version 1 of the spec optionally allowed states to include their drivers’ social security numbers on their licences. Careful with that licence, now…

Gender

Version 1 of the spec allowed gender to be expressed using 6 possible values: M, F, 0, 1, 2 and 9. “M” and “F” are self-explanatory. The others are pulled from the ANSI-D20 gender codes, in which the values mean “Unknown”, “Male”, “Female” and “Not specified” respectively. Obviously two ways of representing the same piece of data is better than one. Version 2 dumped all but values “1” and “2”. I imagine that the standards body figured that, if they were going to allow someone to be in control of a 26,000lb vehicle, they should take enough of an interest in the driver to know his or her gender.