2012-12-04

More Barcode Disasters

Here’s a follow-up to my last rant about driving licence barcode standards that aren’t standard.

Header

The barcode header is supposed to be “@\n\x1e\rANSI “. If the first nine characters of the decoded data equal this string, we’ve got a valid driving licence barcode. South Carolina’s driving licences have the file separator character (ASCII 0x1c) as the third byte instead of the record separator character (ASCII 0x1e) as defined by the standard.

ZIP Codes

US zip codes come in two formats. The first is the standard 5-digit code, such as “90210”. This was found to be insufficiently accurate, so a “+4” extension is often tagged on the end giving the format “90120+1234”.

In the first version of the DLID spec, the zip code field was 11 characters long. If the zip code didn’t fill the entire 11 characters, the extra places were padded with spaces.

Consider the file format. Each record in the file is split into two parts: an identifier, which is a 3-character header (“DAQ”, “DBC”, etc) that indicates what the data represents; and the data itself (“JOE”, “BLOGGS”, etc). Records are separated by line breaks. If the data has separators, why does the zip code field have a fixed width? Most (all?) of the other fields are variable width.

Assuming there’s a reason for the field to be fixed-width, you should be able to see immediately that the spec is still broken. If all zip codes are at most 10 characters long, why does the field allow for 11 characters? Even 10 characters is too long. If the field is 5 characters long, a parser can infer that it has no +4 extension. If it is 9 characters long a parser can infer that it has the extension and split it up accordingly.

Version 3 of the spec tried to rectify the situation. The field was shortened to 9 characters, but this time zeros were used as padding instead of spaces. The upshot is that every parser must extract the zip and +4 sections by dividing up based on expected data lengths (5 and 4 respectively) and then ditch the extension if it is equal to “0000”. Why not just make the field a variable width? Why not pad it with spaces that can be trimmed without potentially losing a trailing zero in the zip?

There is no documentation as to how to format the field in versions 1 and 2 of the DLID spec. Thus, Colorado just uses the 5-character zip. South Carolina uses both the zip and the extension and smooshes them together into one 9-character string padded with two spaces. Massachusetts includes both sections of the zip separated by a hyphen and pads with one space. Who knows what the other states do.

At version 3 of the spec, the standard embraced the Canadians. Canadians have a 6-character post code that looks just like the UK standard. There is no documentation anywhere as to which padding character is used when representing these post codes or their format, if indeed one is used, nor if the post codes should have their two 3-character sections divided by a space or not.

Names

The 7 versions of the DLID spec include 3 ways of storing names. They started out with a single record that stored a comma-delimited list of names in the format “LAST,FIRST,MIDDLE,…”. Colorado, being unique and special, uses the format “FIRST,MIDDLE,…,LAST”.

Presumably to prevent this foolishness, the standards body changed this in the second version of the spec. This version included a standalone “last name” field and a field for other names in the format “FIRST,MIDDLE,…”. Actually, that’s not strictly true; the documented format is “FIRSTxMIDDLEx…”, where “x” is an undocumented separator. Wisconsin used a space whilst Virginia used a comma.

The fourth version finally seems to have fixed it. Names are divided into three fields: “first”, “last” and “middle”, where “middle” can contain multiple comma-separated names. Documentation at last!

Social Security Numbers

Version 1 of the spec optionally allowed states to include their drivers’ social security numbers on their licences. Careful with that licence, now…

Gender

Version 1 of the spec allowed gender to be expressed using 6 possible values: M, F, 0, 1, 2 and 9. “M” and “F” are self-explanatory. The others are pulled from the ANSI-D20 gender codes, in which the values mean “Unknown”, “Male”, “Female” and “Not specified” respectively. Obviously two ways of representing the same piece of data is better than one. Version 2 dumped all but values “1” and “2”. I imagine that the standards body figured that, if they were going to allow someone to be in control of a 26,000lb vehicle, they should take enough of an interest in the driver to know his or her gender.

2012-11-30

Parsing US Driving Licence Barcodes

Most states in the US and some Canadian provinces include a PDF417 barcode on the back of their driving licences. The barcode contains a host of information about its owner, such as names, address, height, weight, eye colour, date of birth, etc. There are currently 7 different versions of the standard, which you can download here (click on the “Documentation” tab):

Unfortunately, the standards are full of breathtakingly stupid mistakes. Dates are currently my favourite.

This is the date format used in version 1:

yyyymmdd

That’s one of the ISO-8601 standards for representing a date.

In version 2 they switched to this:

mmddyyyy

That’s the standard US way of representing dates (I like to think of them as “lumpy” dates, because the format goes “large-small-large”, whereas ISO dates are big-endian). I have no idea why they did this. I presume they got a lot of complaints from Americans who were stumped by the unusual date format whilst decoding the PDF417 barcodes with nothing more sophisticated than their eyes. Any automated parser would naturally re-format the date into the local standard, so they must have been doing it manually. An impressive skill.

In version 3 the Canadians decided to get in on the barcode action. Canadians use the big-endian date format, so the spec now states that date fields can store the dates in one of two ways:

yyyymmdd
mmddyyyy

Any parsers need to check the licence’s country code before they can parse dates. Not only does this version of the spec introduce a new standard but it contains multiple standards within a single field.

Wow.

2012-01-10

Blogging with Mercurial from the Rocky Mountains

It’s been a while since the last update, but I do have a good excuse. I’ve upped sticks and moved from dingy Birmingham, in the UK, to sunny Littleton, Colorado, in the USA.

So, who’s hiring?

I’ve decided to start blogging about those experiences, but rather than flood this blog with personal posts I’m making a separate blog. But what blogging platform to use? It’s not going to have much traffic, I don’t want to put any admin time into it, I don’t want comments or tags or categories. All I want is to be able to post text as quickly as possible using Markdown.

It would be great if there was a blogging platform for Mercurial like the one available to Github users, but I’ve looked around and haven’t been able to find one. So why not write one?

Here’s my list of requirements:

  • Displays blog posts in reverse chronological order (newest first).
  • Posts are written in Markdown and reformatted into HTML automatically.
  • Posts are stored in a repository on BitBucket.
  • Pushing to the posts repository automatically updates the blog.
  • Allows paging through blog posts.
  • Includes an “Archive” page that lists all blog posts.
  • Has an RSS feed.
  • Maintains a cache of blog posts to minimise queries to BitBucket’s API.
  • Looks a bit like WordPress’ “TwentyEleven” theme, as that’s my favourite theme at the moment.

After a couple of evenings of hacking, here’s the result:

And here’s a demo:

A blog powered by BitBucket and Mercurial. Neat!

I’ve included instructions for getting a BitBlogger instance up and running on AppHarbor, a .NET hosting site that offers a basic freemium option. As BitBlogger doesn’t use any local storage or background events (BitBucket notifies it of changes, rather than BitBlogger needing to poll for updates), it doesn’t need anything above the free package.

I’ve copied bits and pieces of the TwentyEleven theme’s CSS file in order to replicate some of the appearance, so the licence for this application is the GPL rather than my preferred MIT licence. If anyone feels like putting together a simple and very legible theme to replace the current look and feel, I’ll be able to switch it over.