2012-05-01

Email Address Validation For iOS

If you’ve ever needed to validate an email address, the chances are you used a regex engine and validated the address’ structure against something like this:

^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$

It works well enough for the average case. But what about this email address?

joe+bloggs@example.com

Not so good. You could modify the regex to support the “+” character, but if you read RFC 822 and RFC 5321 you’ll quickly discover that the structure of an email address isn’t just “alphanumeric@alphanumeric.alphanumeric”. Email addresses include a built-in comment system using parentheses, escape characters, quoted sections, “dot-atoms” and more. A regex to correctly parse the entirety of RFC 822 would be huge, and that’s the older, simpler version of an email address.

Once you start really digging into the syntax of an email address, you find that:

  • Domain names (the part after the “@”) can now be in character sets other than Latin, so your primitive [a-z][0-9] regex won’t work;
  • The local part of the address (the part before the “@”) can also now be in other character sets, but as this idea is about 2 months old nothing supports it yet;
  • Trying to figure out if Exchange supports internationalised domain names (IDN) or not is basically guesswork;
  • There are multiple standards for email addresses, most of which disagree with each other and most of which appear to be deprecated. Figuring out which is the current standard is like trying to find a particular Skittle in a bag of Skittles.

Pretty much the only fact you’ll be entirely sure of after your research is that regex is a really bad solution for validating email addresses.

My favourite part of RFC 3696 is where it says these addresses are valid:

Fred\ Bloggs@example.com
"Fred Bloggs"@example.com

This makes sense. The backslash is used to escape a single character, whereas the quotes are used to signify that everything between them should be automatically escaped. One’s really a shortcut for the other. Simple.

However, the errata suggests that this is a mistake, and that this is the correct form:

 "Fred\ Bloggs"@example.com

Now we have a backslash system embedded within a quoting system. The former makes the latter entirely superfluous. I think this amendment is probably why so many developers just stick with regex: The standard is crap.

So, you have two choices:

  1. Ignore the complexity, do what everyone else does, and use regex regardless;
  2. Write an email address parser.

As I’m crazy, here’s an email address parser written in Objective-C. It has some limitations:

  • It does not allow the local part of the address to be internationalised as nothing supports this yet.
  • It does not allow domain name literals (ie. joe@[10.0.1.2]) as Exchange does not support them and I’ve never seen one in the wild.
  • It may not apply all of the backslash escape rules correctly, as this is one area where every document entirely disagrees with every other document.
  • It may not enforce the omission of certain ASCII ranges correctly, as this is another area where the docs suck.

On the plus side:

  • It validates comments and nested comments (comments are apparently deprecated, but I can’t find anywhere that specifically says that this is the case).
  • It correctly enforces at least some of the backslash escape rules.
  • It allows the use of quoted sections (that can include characters like “@”, “(”, etc) and enforces their rules (quotes must be matched; quoted sections must be the only element of the local part or must be delimited by periods).
  • It enforces period rules (periods cannot be the first or last part of the local or domain parts of the address; periods cannot be adjacent except within a quoted section).
  • It enforces the maximum and minimum lengths of each section (up to 64 chars for the local part; up to 254 chars for the entire address; there must be a local and domain part of the address).
  • It allows IDN.
  • Domain names can only contain letters, numbers, periods and hyphens.
  • It passes all of the test cases written by Dominic Sayers except those concerning carriage/line returns and domain name literals, which aren’t relevant to what I’m trying to do.

// .h
@interface SZEmailValidator : NSObject

+ (BOOL)isValid:(NSString *)candidate;

@end

// .m
#import "SZEmailValidator.h"

struct SZEmailParserState {
    BOOL quoted : 1;
    BOOL escaped : 1;
    BOOL domain : 1;
    BOOL dot : 1;
    BOOL followingQuoteBlock : 1;
};

@implementation SZEmailValidator

+ (BOOL)isValid:(NSString *)candidate {

    unsigned int domainPartStart = 0;
    unsigned int commentDepth = 0;
    
    struct SZEmailParserState state;
    
    state.dot = NO;
    state.quoted = NO;
    state.escaped = NO;
    state.followingQuoteBlock = NO;
    state.domain = NO;
    
    for (unsigned int i = 0; i < candidate.length; ++i) {
        unichar character = [candidate characterAtIndex:i];
                
        if (!state.domain) {
            
            // Do not allow characters beyond the ASCII set in the username
            if (character > 126) return NO;
             
            // Do not allow NULL
            if (character == 0) return NO;
            
            // Do not allow LF
            if (character == 10) return NO;
        }
        
        if (i > 253) {
            
            // Do not allow more than 254 characters in the entire address
            return NO;
        }
        
        // The only characters that can follow a quote block are @ and period.
        if (state.followingQuoteBlock) {
            if (character != '@' && character != '.') {
                return NO;
            }
            
            state.followingQuoteBlock = NO;
        }
        
        switch (character) {
            case '@':
                
                if (state.domain) {
                    
                    // @ not allowed in the domain portion of the address
                    return NO;
                    
                } else if (state.quoted) {
                    
                    // Ignore @ signs when quoted
                    
                } else if (state.dot) {
                    
                    // Dots are not allowed as the final character in the local
                    // part
                    return NO;
                    
                } else {
                    
                    // Swapping to the domain portion of the address
                    state.domain = YES;
                    domainPartStart = i + 1;
                    
                    if (i > 64) {
                        
                        // Do not allow more than 63 characters in the local part
                        return NO;
                        
                    }
                }
                
                // No longer in dot/escape mode
                state.dot = NO;
                state.escaped = NO;
                
                break;
                
            case '(':
                
                // Comments only activate when not quoted or escaped
                if (!state.quoted && !state.escaped) {
                    ++commentDepth;
                }
                
                break;
                
            case ')':

                // Comments only activate when not quoted or escaped
                if (!state.quoted && !state.escaped) {
                    
                    if (commentDepth == 0) return NO;
                    
                    --commentDepth;
                }
                
                break;
                
            case '\\':
                
                if (!state.quoted && commentDepth == 0) {
                    
                    // Backslash isn't allowed outside of quote/comment mode
                    return NO;
                }
                    
                // Flip the escape bit to enter/exit escape mode
                state.escaped = !state.escaped;
                
                // No longer in dot mode
                state.dot = NO;
                
                break;
            
            case '"':
                
                if (state.domain && commentDepth == 0) {
                    
                    // quote not allowed in the domain portion of the address
                    // outside of a comment
                    return NO;
                }
                
                if (!state.escaped) {
                    
                    // Quotes are only allowed at the start of the local part,
                    // after a dot or to close an existing quote part
                    if (i == 0 || state.dot || state.quoted) {
                        
                        // Remember that we just left a quote block
                        if (state.quoted) {
                            state.followingQuoteBlock = YES;
                        }
                    
                        // Flip the quote bit to enter/exit quote mode
                        state.quoted = !state.quoted;
                    } else {
                        return NO;
                    }
                }
                
                // No longer in dot/escape mode
                state.dot = NO;
                state.escaped = NO;
                
                break;
            
            case '.':
    
                if (i == 0) {
                    
                    // Dots are not allowed as the first character of the local
                    // part
                    return NO;
                    
                } else if (i == domainPartStart) {
                    
                    // Dots are not allowed as the first character of the domain
                    // part
                    return NO;
                    
                } else if (i == candidate.length - 1) {
                    
                    // Dots are not allowed as the last character of the domain
                    // part
                    return NO;
                }
                
                if (!state.quoted) {
                    
                    if (state.dot) {
            
                        // Cannot allow adjacent dots
                        return NO;
                    } else {
                        
                        // Entering dot mode
                        state.dot = YES;
                    }
                    
                }
                    
                // No longer in escape mode
                state.escaped = NO;

                break;

            case ' ':
            case ',':
            case '[':
            case ']':
            case 1:
            case 2:
            case 3:
            case 4:
            case 5:
            case 6:
            case 7:
            case 8:
            case 9:
            case 11:
            case 13:
            case 15:

                // These characters can only appear when quoted
                if (!state.quoted) {
                    return NO;
                }
                
            default:
                
                // No longer in dot/escape mode
                state.dot = NO;
                state.escaped = NO;

                // Do not allow characters outside of unicode, numerals, hyphens
                // and periods in the domain part.  We use letterCharacterSet
                // because we're supporting internationalised domain names.
                // We don't have to do anything special with the name; that's up
                // to the email client/server to handle.
                if (state.domain) {
                    if (![[NSCharacterSet letterCharacterSet] characterIsMember:character] &&
                        ![[NSCharacterSet decimalDigitCharacterSet] characterIsMember:character] &&
                        character != '-') {
                        
                        return NO;
                    }
                }
                
                break;
        }
    }
    
    // Do not allow unclosed comments
    if (commentDepth > 0) return NO;
    
    // If we didn't identify a local and a domain part the address isn't valid
    if (!state.domain) return NO;
    if (candidate.length == domainPartStart) return NO;
    if (domainPartStart == 1) return NO;
    
    // Validate domain name components
    NSArray *components = [[candidate substringFromIndex:domainPartStart] componentsSeparatedByString:@"."];
    
    for (NSString *item in components) {
        
        // We can't allow a hyphen as the first or last char in a domain name
        // component
        if ([item characterAtIndex:0] == '-' || [item characterAtIndex:item.length - 1] == '-') {
            return NO;
        }
        
        // Items must not be longer than 63 chars
        if (item.length > 63) return NO;
    }

    return YES;
}

@end

Comments

Bob on 2013-02-03 at 06:15 said:

Thanks!

Anand Agarwal on 2013-03-27 at 06:35 said:

Thanks for this post. I have one doubt, this validator is allowing aa@aaa format too. It means it is considering aa@aaa format as valid email address, whereas there is no perion(.) in domain part. As per my knowledge there should be at least one period in domain part for valid email address. Please clarify it

Ant on 2013-03-27 at 16:44 said:

The official email spec (as close as there is to an official spec) says that you can send email to a top-level domain. Joe@com is therefore a valid email address.

Ant on 2013-03-27 at 18:27 said:

If you do want to ensure that you adhere to the popular spec, rather than the official spec, you could add this at line 275 of SZEmailValidator.m:

if (components.length < 2) return NO;

This line will ensure that the email address has at least 2 dot atoms in the domain name part.

Anand Agarwal on 2013-03-28 at 07:02 said:

thanks Ant for clarification

Anand Agarwal on 2013-04-05 at 14:37 said:

Could you please let me know what changes need to be done for validating it internationalize email address. I mean accept unicode(e.g. ü like character) in email address.

Ant on 2013-04-05 at 17:39 said:

Near the top of the isValid: method are these lines:

// Do not allow characters beyond the ASCII set in the username
if (character > 126) return NO;

Delete that line to allow internationised characters in the local part of the address (ie. the bit that comes before the @ symbol). The parser already allows for internationalised domain names (the bit after the @ symbol).

Anand Agarwal on 2013-04-11 at 08:20 said:

Ant, I removed below given lines: // Do not allow characters beyond the ASCII set in the username if (character > 126) return NO; It started allowing international character in local part. You said parser already allowing international character in domain part. I checked it is not allowing.

if (![[NSCharacterSet letterCharacterSet] characterIsMember:character] && ![[NSCharacterSet decimalDigitCharacterSet] characterIsMember:character]&& character != ‘-’) {

                    return NO;
                }

Pls give your suggestion, what change I need to make in this code so that it start allowing international character in domain part too.

Ant on 2013-04-13 at 14:11 said:

Just tested it here and it does allow IDN. Here’s an example test:

NSLog(@"%@", [NSString stringWithFormat:@"%@: %@", [SZEmailValidator isValid:@"bob@Bücher.ch"] ? @"Yes" : @"No", address]);

If you’re having problems, post the address you’re having problems with and I’ll take a look.