2013-05-28

Diffing NSManagedObjects Part 2

When I last discussed comparing two NSManagedObjects and producing a diff, I said that I’d come back to a few issues later. It turned out to be much later than expected; this post has been floating around on my hard disk for 6 months.

This time around I’ll deal with two of those issues:

  • What do you do if you want to include a related object in the diff if it has its deletion rule set to nullify?
  • What do you do if you want to exclude an attribute or relationship from the diff?

The answers are “whitelist” and “blacklist”.

In the previous post, we used deletion rules to signify object ownership. This works well, but means that we can’t automatically include crucial information from a relationship if the deletion rule indicates a lack of ownership.

For example, imagine we’re a dogfood company and have a Dog class and a Person class:

  • Person (User info: @{ @“id”: @“guid” })

    • Attributes:
      • guid
      • name
      • age
    • Relationships:
      • dog (Dog) (Delete: cascade) (To one)
  • Dog (User info: @{ @“id”: @“guid” })

    • Attributes:
      • guid
      • name
    • Relationships:
      • owner (Person) (Delete: nullify) (To one)

If we delete a person we’ll also delete his dog. We don’t care about the dog if we no longer track the person, presumably because we can’t sell anything directly to a dog. On the other hand, if we delete the dog we don’t necessarily want to delete the person. The person could get a new dog.

If a dog changes owner, it’d be handy if we could represent that information in a diff. Unfortunately, the deletion rule will prevent the automated diff methods from examining this relationship. We need to whitelist this relationship to ensure that it is included in the diff.

The whitelist looks like this:

NSDictionary *whitelistedRelationships = @{
@"Dog": @[ @"owner" ]
};

Each key in the dictionary represents the name of a class. Each value represents an array of properties of that class. In this case, the whitelist says, “If the diff creator encounters an instance of the Dog class, include the owner relationship in the diff regardless of the deletion rule”.

There are two issues here. Firstly, we’re hard-coding information that should really be metadata in the data model. Unfortunately, there’s not much we can do about that if we want to be able to produce a variety of different diffs that can be started in multiple places within the object graph.

The second issue is that we’ve re-introduced a cycle into the graph. The diff creator will repeatedly walk from the Dog to its owner and back, eventually causing a stack overflow. To get around that, we need to introduce the concept of a blacklist. Anything in a blacklist will not be examined. Here’s what the blacklist would look like in this situation:

NSDictionary *blacklistedProperties = @{
@"Owner": @[ @"dog" ]
};

The dictionary format is that same as the blacklist: each key is the name of a class, whilst each value is an array of attribute and relationship names that should be ignored by the diff creator. This particular blacklist says, “Do not traverse the dog relationship of any Owner instances”. This would be counter-productive if we were starting the diff from the owner, due to the direction of the relationship established by the deletion rule, but if we start from the dog it is very useful.

Now that we’ve decided upon workable solutions, we just need to implement them. We’ll start off with a method that can retrieve all of the property names for an NSManagedObject instance that exist in a blacklist or whitelist:

+ (NSArray *)propertyNamesForManagedObject:(NSManagedObject *)managedObject inList:(NSDictionary *)list {

    NSMutableArray *output = [NSMutableArray array];

    for (NSString *className in [list allKeys]) {
        Class class = NSClassFromString(className);

        if ([managedObject isKindOfClass:class]) {
            [output addObjectsFromArray:list[className]];
        }
    }

    return output;
}

We’ve introduced a new metaprogramming technique here: getting a pointer to a metaclass using the NSClassFromString() function. As the name suggests, we can get a class definition from a string representing its name. We can use that to determine the properties of the class.

This function will eventually give us a list of all property names of object managedObject that match a name in the list dictionary. For example, if we passed it an Owner object and the blacklist defined above, we’d get the following output:

@[ @"dog" ]

We know that the dog property of our Owner object is blacklisted and shouldn’t be traversed.

Next we’ll add a little helper method to determine if an array contains a string:

+ (BOOL)isString:(NSString *)string inList:(NSArray *)list {

    for (NSString *item in list) {
        if ([string isEqualToString:item]) {
            return YES;
        }
    }

    return NO;
}

Perhaps a set would have been a better data structure here; I might revisit that.

Now we can modify the diff method so that it checks the blacklist before attempting to examine an attribute or relationship:

...

NSArray *attributeNames = newObject == nil ? [[[oldObject entity] attributesByName] allKeys] : [[[newObject entity] attributesByName] allKeys];

NSArray *blacklist = [self propertyNamesForManagedObject:newObject ? newObject : oldObject inList:blacklistedProperties];

for (NSString *name in attributeNames) {

    if ([self isString:name inList:blacklist]) continue;

    ...

}

We do something similar to ensure that we traverse whitelisted relationships:

...

NSArray *whitelist = [self propertyNamesForManagedObject:newObject ? newObject : oldObject inList:whitelistedRelationships];

for (NSString *name in relationships) {

    NSRelationshipDescription *relationship = [relationships objectForKey:name];

    if (![self isString:name inList:whitelist]) {
        if (relationship.deleteRule != NSCascadeDeleteRule) continue;
    }

    ...
}

It’s a simple check - if the relationship is whitelisted, ignore the deletion rule.

At this point, the code is far too big to post here. I’ll create a BitBucket repository with the diff/patching system as a library with a demo project at some point. The explanation of the patching code will come later.

2012-11-27

Diffing NSManagedObjects

I’m currently working on a system to calculate the differences between two CoreData objects. Why would you want to do that, you ask? In my case, I’m trying to transmit the changes made to the object to a web server using the least amount of data and I don’t want to bother trying to store the changes as some kind of activity log. As discussed previously, I’m storing two copies of my objects: one that represents the last-known server state, and one that is changed by the user.

Here’s an imaginary CoreData model that we’ll use for all examples:

  • Person

    • firstName (string)
    • lastName (string)
    • age (int)
    • address (Address)
  • Address

    • street (string)
    • city (string)
    • state (string)
    • postalCode (string)

The “Person” entity has a related “Address” object, which gives the address at which the person lives.

In my first attempt, I came up with a particularly brittle solution and created the diff manually. The method takes in two parameters (the new and old objects that represent its different states) and computes the diff like this:


NSMutableDictionary *diff = [NSMutableDictionary dictionary];

if (![newObj.firstName isEqualToString:oldObj.firstName]) {
    diff[@"firstName"] = @{ @"old": oldObj.firstName, @"new": newObj.firstName };
}

Each attribute is compared and, if inequalities are found, the new and old values are stored in the dictionary. We can then transmit this minimal amount of information and the server can figure out what changed and how.

This solution is tied to the structure of the data model. Any changes in the data model mean we have to change the diff method. A better way would be to exploit features of the CoreData API that allow us to use a metaprogramming approach that has no foreknowledge of the data model and figure out the whole thing out on the fly.

CoreData has a number of useful features that allow us to do this. Firstly, all NSManagedObject instances are composed of two types of data: attributes and relationships. Attributes are basic properties of an object, such as ints, strings, doubles, etc. Relationships connect different NSManagedObjects together. To demonstrate this on the example model:

  • Person

    • Attributes:
      • firstName (string)
      • lastName (string)
      • age (int)
    • Relationships:
      • address (Address)
  • Address

    • Attributes:
      • street (string)
      • city (string)
      • state (string)
      • postalCode (string)

Each NSManagedObject has an associated “entity” instance that contains metadata about its attributes and relationships. If we wanted to list all of the names of the attributes for a Person instance, we could do this:


NSArray *attributeNames = [[[person entity] attributesByName] allKeys];

for (NSString *name in attributeNames) {
    NSLog(@"%@", name);
}

The output would be as follows:

age
firstName
lastName

We can do the same with relationship names:


NSArray *relationshipNames = [[[person entity] relationshipsByName] allKeys];

for (NSString *name in relationshipNames) {
    NSLog(@"%@", name);
}

Here’s the output:

address

We’re now able to list all of the properties of an NSManagedObject. Armed with this, and adding in a little Objective-C metaprogramming, we can write a function that can identify differences between two managed objects. Assuming we have two valid NSManagedObject pointers, here’s a simple diff function:


- (NSDictionary *)diffNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {
    
    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    NSArray *attributeNames = [[[newObject entity] attributesByName] allKeys];

    for (NSString *name in attributeNames) {

        SEL selector = NSSelectorFromString(name);

        id newValue = [newObject performSelector:selector];
        id oldValue = [oldObject performSelector:selector];

        if (![newValue isEqual:oldValue]) {
            diff[name] = @{ @"new": newValue, @"old": oldValue };
        }
    }

    return diff;
}

One of the metaprogramming tricks we’re using is “NSSelectorFromString()”, which can be thought of as giving us a function pointer, though that isn’t really at all accurate. Since entity attributes all map to properties of the NSManagedObject instance, we can treat them as selectors and call them using the “performSelector:” method. Once we’ve extracted the new and old values we can compare them and, if they are different, add them to the diff dictionary.

Let’s see what happens if we run it:


Person *person1 = [self createPerson];
person1.firstName = @"Joe";
person1.lastName = @"McJoe";
person1.age = @23;

Person *person2 = [self createPerson];
person2.firstName = @"Bob";
person2.lastName = @"McJoe";
person2.age = @23;

NSDictionary *diff = [self diffNewObject:person1 withOldObject:person2];

NSLog(@"%@", diff);

The output we get is this:

{
    firstName = {
        new = Joe;
        old = Bob;
    };
};

Now the attributes of the entities in our data model can change without the diff method breaking.

The method is still a little brittle, however. What happens if we’re trying to diff two objects of different types that share some of the same properties? What happens if some of the properties are nil? (NSDictionaries cannot store nil values.) What if one of the objects is nil?

Here’s an updated version of the method that accounts for all of these possibilities:


- (NSDictionary *)diffNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {
    
    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    NSArray *attributeNames = newObject == nil ? [[[oldObject entity] attributesByName] allKeys] : [[[newObject entity] attributesByName] allKeys];

    for (NSString *name in attributeNames) {

        SEL selector = NSSelectorFromString(name);

        id newValue = nil;
        id oldValue = nil;

        if (newObject != nil && [newObject respondsToSelector:selector]) newValue = [newObject performSelector:selector];
        if (oldObject != nil && [oldObject respondsToSelector:selector]) oldValue = [oldObject performSelector:selector];

        newValue = newValue ? newValue : [NSNull null];
        oldValue = oldValue ? oldValue : [NSNull null];

        if (![newValue isEqual:oldValue]) {
            diff[name] = @{ @"new": newValue, @"old": oldValue };
        }
    }

    return diff;
}

There are a few changes:

  • We ensure that we get the attribute list from a non-nil object;
  • We ensure that the objects respond to the selectors before trying to call them;
  • We replace nil values with NSNull.

Unfortunately, finding the diff between two objects’ attributes is the easy part. Producing a diff of the relationships is rather harder. We made a mistake in our object model, so let’s correct that now and see one of the reasons why diffing relationships is hard:

  • Person

    • Attributes:
      • firstName (string)
      • lastName (string)
      • age (int)
    • Relationships:
      • address (Address)
  • Address

    • Attributes:
      • street (string)
      • city (string)
      • state (string)
      • postalCode (string)
    • Relationships:
      • person (Person)

Apple recommend including an inverse of all relationships. Our Person entity has a relationship to the Address entity, so our Address entity should include the inverse relationship back to the Person entity.

Hopefully the problem we’ve introduced is apparent. We’ve created a cycle in what was previously an acyclic graph. If we try and walk the graph from a Person to an Address, we’ll find that we need to walk back to the Person. And back to the Address. And back to the Person. And now we’ve recursed too far and blown the stack.

CoreData includes another feature that can help mitigate this problem. It isn’t a perfect solution, but it should suffice in most situations. Each relationship includes a deletion rule that tells CoreData what to do with related entities each time an entity is deleted. In this case, if we were to delete a Person we’d want their Address to be deleted. However, if we delete an Address (because the person sells his house), we don’t want to delete the person.

We’ll add that metadata to our data model:

  • Person

    • Attributes:
      • firstName (string)
      • lastName (string)
      • age (int)
    • Relationships:
      • address (Address) (Delete: cascade)
  • Address

    • Attributes:
      • street (string)
      • city (string)
      • state (string)
      • postalCode (string)
    • Relationships:
      • person (Person) (Delete: nullify)

We can use the deletion rule to infer ownership of one entity by another. The Person entity can be said to “own” its related Address entity because deleting the Person deletes the Address. The Address entity can be said to be “owned” by its related Person entity because deleting the Address won’t delete the Person. We still have cycles in the graph, but now we can use its deletion rules as direction indicators that tell a walk algorithm which paths it can take through the graph.

Here’s a graph walk algorithm that uses this new information:


- (void)walkGraph:(NSManagedObject *)managedObject {

    NSLog(@"%@", [[managedObject entity] name]);
    
    NSDictionary *relationships = [[managedObject entity] relationshipsByName];

    for (NSString *name in relationships) {

        NSRelationshipDescription *relationship = relationships[name];

        if (relationship.deleteRule != NSCascadeDeleteRule) continue;

        SEL selector = NSSelectorFromString(name);

        NSManagedObject *child = [managedObject performSelector:selector];

        [self walkGraph:child];
    }
}

We’re using a familiar set of metaprogramming tools: NSSelectorFromString(), performSelector: and relationshipsByName. Here’s what it will output if we feed it a fully-populated Person instance:

Person
Address

Note that an alternative would be to maintain a list of walked objects instead of using the deletion rule/ownership approach, but this could introduce its own set of problems.

At this point, we can add a couple of methods and produce a full diff between two Person objects:


- (NSDictionary *)diffNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {
    
    NSDictionary *attributeDiff = [self diffAttributesOfNewObject:newObject withOldObject:oldObject];

    NSDictionary *relationshipsDiff = [self diffRelationshipsOfNewObject:newObject withOldObject:oldObject];

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    if (attributeDiff.count > 0) {
        diff[@"attributes"] = attributeDiff;
    }

    if (relationshipsDiff.count > 0) {
        diff[@"relationships"] = relationshipsDiff;
    }

    return diff;
}

- (NSDictionary *)diffRelationshipsOfNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];
    
    NSDictionary *relationships = newObject == nil ? [[oldObject entity] relationshipsByName] : [[newObject entity] relationshipsByName];

    for (NSString *name in relationships) {

        NSRelationshipDescription *relationship = relationships[name];

        if (relationship.deleteRule != NSCascadeDeleteRule) continue;

        SEL selector = NSSelectorFromString(name);

        id newValue = nil;
        id oldValue = nil;

        if (newObject != nil && [newObject respondsToSelector:selector]) newValue = [newObject performSelector:selector];
        if (oldObject != nil && [oldObject respondsToSelector:selector]) oldValue = [oldObject performSelector:selector];

        NSDictionary *relationshipDiff = [self diffNewObject:newValue withOldObject:oldValue];

        if (relationshipDiff.count > 0) {
            diff[name] = relationshipDiff;
        }
    }

    return diff;
}

- (NSDictionary *)diffAttributesOfNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    NSArray *attributeNames = newObject == nil ? [[[oldObject entity] attributesByName] allKeys] : [[[newObject entity] attributesByName] allKeys];

    for (NSString *name in attributeNames) {

        SEL selector = NSSelectorFromString(name);

        id newValue = nil;
        id oldValue = nil;

        if (newObject != nil && [newObject respondsToSelector:selector]) newValue = [newObject performSelector:selector];
        if (oldObject != nil && [oldObject respondsToSelector:selector]) oldValue = [oldObject performSelector:selector];

        newValue = newValue ? newValue : [NSNull null];
        oldValue = oldValue ? oldValue : [NSNull null];

        if (![newValue isEqual:oldValue]) {
            diff[name] = @{ @"new": newValue, @"old": oldValue };
        }
    }

    return diff;
}

We’ve added two new methods. The new diffRelationshipsOfNewObject:withOldObject: method expands on our walkGraph: method by checking for nil, the ability to perform a given selector, and by producing a diff. The new diffNewObject:withOldObject: method collates the attribute and relationship diffs into a single dictionary. It is called recursively by the relationship diff method to ensure that the entire object graph is compared.

Although we can now successfully create a full diff of the graph described by our Person entity, we haven’t yet managed to cover all of the possibilities offered by CoreData. Suppose we modify our data model to accommodate people who own multiple houses?

  • Person

    • Attributes:
      • firstName (string)
      • lastName (string)
      • age (int)
    • Relationships:
      • address (Address) (Delete: cascade) (To many)
  • Address

    • Attributes:
      • street (string)
      • city (string)
      • state (string)
      • postalCode (string)
    • Relationships:
      • person (Person) (Delete: nullify) (To one)

Our relationship diff method no longer works. When we retrieve the value of the relationship we receive an NSSet pointer rather than an NSManagedObject pointer and we haven’t written any code to handle that. Even if we write code to iterate over the set and diff individual items, we run into some nasty problems:

  • If we have two addresses and change one, how do we match up the addresses in the two Person objects?
  • How can we tell if an address has been deleted?
  • How can we tell if an address has been added?

It’s clear that our Address entity needs to include an immutable identifier that is unique within the context of its set, but is shared between copies of a Person. NSManagedObjects include an objectID property, but this is unique within the entire dataset so can’t be used. We’ll need to add a new attribute to the Address entity:

  • Address
    • Attributes:
      • guid (string)
      • street (string)
      • city (string)
      • state (string)
      • postalCode (string)
    • Relationships:
      • person (Person) (Delete: nullify) (To one)

I’ve plumped for a GUID, but you could use a bog-standard string or a number.

Now we can identify Address entities by their GUID, but we’ve got another problem. By design, our diff method is entirely ignorant of the structure of our object model. It doesn’t know that it is trying to diff Address objects and therefore can’t know that it should identify them using their “guid” attribute. To get around this, we’ll use yet another meta-feature of CoreData: user info.

Everything in a CoreData model includes a “user info” dictionary. This is a collection of key/value pairs of user-created metadata that can be examined at run time. By “everything”, I mean precisely that: entities, attributes and relationships all contain a “user info” dictionary.

We need the guid property of the Address entity to be identifiable as a unique identifier at run time, so we can exploit the user info dictionary to do this. We’ll add a single key/value pair to the Address entity’s user info that contains “id” as the key and “guid” as the value. At run time, any time we encounter a one-to-many relationship, we’ll examine the members of the related set and locate their “id” user info key. The value will tell us which property contains that entity’s unique identifier, which we can then use to match up objects in the two sets.

Our Address entity now looks like this:

  • Address (User info: @{ @“id”: @“guid” })
    • Attributes:
      • guid (string)
      • street (string)
      • city (string)
      • state (string)
      • postalCode (string)
    • Relationships:
      • person (Person) (Delete: nullify) (To one)

The method for diffing two sets looks like this:


- (NSArray *)diffNewSet:(NSSet *)newSet withOldSet:(NSSet *)oldSet {
    
    NSMutableArray *changes = [NSMutableArray array];

    // Find all items that have been newly created or updated.
    for (NSManagedObject *newItem in newSet) {

        NSString *idAttributeName = newItem.entity.userInfo[@"id"];

        NSAssert(idAttributeName, @"Entities must have an id property set in their user info.");

        id newItemId = [newItem valueForKey:idAttributeName];

        NSManagedObject *oldItem = nil;

        for (NSManagedObject *setItem in oldSet) {
            id setItemId = [setItem valueForKey:idAttributeName];

            if ([setItemId isEqual:newItemId]) {
                oldItem = setItem;
                break;
            }
        }

        NSDictionary *diff = [self diffNewObject:newItem withOldObject:oldItem];

        if (diff) {
            [changes addObject:diff];
        }
    }

    // Find all items that have been deleted.
    for (NSManagedObject *oldItem in oldSet) {

        NSString *idAttributeName = oldItem.entity.userInfo[@"id"];

        NSAssert(idAttributeName, @"Entities must have an id property set in their user info.");

        id oldItemId = [oldItem valueForKey:idAttributeName];

        NSManagedObject *newItem = nil;

        for (NSManagedObject *setItem in newSet) {
            id setItemId = [setItem valueForKey:idAttributeName];

            if ([setItemId isEqual:oldItemId]) {
                newItem = setItem;
                break;
            }
        }

        if (!newItem) {
            NSDictionary *diff = [self diffNewObject:newItem withOldObject:oldItem];

            if (diff) {
                [changes addObject:diff];
            }
        }
    }

    return changes;
}

We’ll create a pair of example sets and see what we get out of the method. Here are our sets:

NewSet = (
    Address {
        guid = "123";
        street = "Somewhere";
        city = "Sometown";
        postalCode = "54532";
    },
    Address {
        guid = "567";
        street = "Elsewhere";
        city = "Another Town";
        postalCode = "222";
    }
};

OldSet = (
    Address {
        guid = "123";
        street = "Somewhere";
        city = "Townsville";
        postalCode = "54532";
    }
);

You should be able to see that we’ve added a new address and changed the existing address’ city attribute. This is the array produced by the method:

(
    {
        attributes: {
            city: {
                old: "Townsville",
                new: "Sometown"
            }
        }
    },
    {
        attributes: {
            guid: {
                old: "<null>",
                new: "567"
            },
            street: {
                old: "<null>",
                new: "Elsewhere"
            },
            city: {
                old: "<null>",
                new: "Another Town"
            },
            postalCode {
                old: "<null>",
                new: "222"
            }
        }
    }
)

Looks good. We can see that one of the array entries contains just the old and new values for “city” as we’d expect. The other entry contains values for all of the Address object’s attributes as we’ve compared it to nil. It didn’t exist previously. The only issue we have is that we can’t tell which Address object the first list entry represents because its guid attribute isn’t included. We can rectify this omission by modifying the diffNewObject:withOldObject: method:


- (NSDictionary *)diffNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {
    
    NSDictionary *attributeDiff = [self diffAttributesOfNewObject:newObject withOldObject:oldObject];

    NSDictionary *relationshipsDiff = [self diffRelationshipsOfNewObject:newObject withOldObject:oldObject];

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    if (attributeDiff.count > 0) {
        diff[@"attributes"] = attributeDiff;
    }

    if (relationshipsDiff.count > 0) {
        diff[@"relationships"] = relationshipsDiff;
    }

    if (diff.count > 0) {
        diff[@"entityName"] = newObject ? newObject.entity.name : oldObject.entity.name;

        NSString *idAttributeName = newObject ? newObject.entity.userInfo[@"id"] : oldObject.entity.userInfo[@"id"];

        if (idAttributeName) {
            id itemId = newObject ? [newObject valueForKey:idAttributeName] : [oldObject valueForKey:idAttributeName];

            if (itemId) {
                diff[idAttributeName] = itemId;
            }
        }
    }

    return diff;
}

We’re fetching the unique identifier attribute and adding its value to the diff. We’ve also added the entity’s name, which may help when dealing with the diff once it has been transmitted and received and needs to be parsed. The output from our previous pair of sets now looks like this:

(
    {
        attributes: {
            city: {
                old: "Townsville",
                new: "Sometown"
            }
        },
        entityName = "Address",
        guid: "123"
    },
    {
        attributes: {
            guid: {
                old: "<null>",
                new: "567"
            },
            street: {
                old: "<null>",
                new: "Elsewhere"
            },
            city: {
                old: "<null>",
                new: "Another Town"
            },
            postalCode {
                old: "<null>",
                new: "222"
            }
        },
        entityName = "Address",
        guid: "567"
    }
)

The final piece of the puzzle is to plug this set diffing method into our relationship diff method. Fortunately, the relationship description class includes an “isToMany” property that will allow us to use the correct diffing method for a given relationship. Here’s the final set of methods required for diffing a pair of managed objects:


- (NSDictionary *)diffNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {
    
    NSDictionary *attributeDiff = [self diffAttributesOfNewObject:newObject withOldObject:oldObject];

    NSDictionary *relationshipsDiff = [self diffRelationshipsOfNewObject:newObject withOldObject:oldObject];

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    if (attributeDiff.count > 0) {
        diff[@"attributes"] = attributeDiff;
    }

    if (relationshipsDiff.count > 0) {
        diff[@"relationships"] = relationshipsDiff;
    }

    if (diff.count > 0) {
        diff[@"entityName"] = newObject ? newObject.entity.name : oldObject.entity.name;

        NSString *idAttributeName = newObject ? newObject.entity.userInfo[@"id"] : oldObject.entity.userInfo[@"id"];

        if (idAttributeName) {
            id itemId = newObject ? [newObject valueForKey:idAttributeName] : [oldObject valueForKey:idAttributeName];

            if (itemId) {
                diff[idAttributeName] = itemId;
            }
        }
    }

    return diff;
}

- (NSDictionary *)diffRelationshipsOfNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];
    
    NSDictionary *relationships = newObject == nil ? [[oldObject entity] relationshipsByName] : [[newObject entity] relationshipsByName];

    for (NSString *name in relationships) {

        NSRelationshipDescription *relationship = relationships[name];

        if (relationship.deleteRule != NSCascadeDeleteRule) continue;

        SEL selector = NSSelectorFromString(name);

        id newValue = nil;
        id oldValue = nil;

        if (newObject != nil && [newObject respondsToSelector:selector]) newValue = [newObject performSelector:selector];
        if (oldObject != nil && [oldObject respondsToSelector:selector]) oldValue = [oldObject performSelector:selector];

        if (relationship.isToMany) {

            NSArray *changes = [self diffNewSet:newValue withOldSet:oldValue];

            if (changes.count > 0) {
                diff[name] = changes;
            }

        } else {

            NSDictionary *relationshipDiff = [self diffNewObject:newValue withOldObject:oldValue];

            if (relationshipDiff.count > 0) {
                diff[name] = relationshipDiff;
            }
        }
    }

    return diff;
}

- (NSDictionary *)diffAttributesOfNewObject:(NSManagedObject *)newObject withOldObject:(NSManagedObject *)oldObject {

    NSMutableDictionary *diff = [NSMutableDictionary dictionary];

    NSArray *attributeNames = newObject == nil ? [[[oldObject entity] attributesByName] allKeys] : [[[newObject entity] attributesByName] allKeys];

    for (NSString *name in attributeNames) {

        SEL selector = NSSelectorFromString(name);

        id newValue = nil;
        id oldValue = nil;

        if (newObject != nil && [newObject respondsToSelector:selector]) newValue = [newObject performSelector:selector];
        if (oldObject != nil && [oldObject respondsToSelector:selector]) oldValue = [oldObject performSelector:selector];

        newValue = newValue ? newValue : [NSNull null];
        oldValue = oldValue ? oldValue : [NSNull null];

        if (![newValue isEqual:oldValue]) {
            diff[name] = @{ @"new": newValue, @"old": oldValue };
        }
    }

    return diff;
}

- (NSArray *)diffNewSet:(NSSet *)newSet withOldSet:(NSSet *)oldSet {
    
    NSMutableArray *changes = [NSMutableArray array];

    // Find all items that have been newly created or updated.
    for (NSManagedObject *newItem in newSet) {

        NSString *idAttributeName = newItem.entity.userInfo[@"id"];

        NSAssert(idAttributeName, @"Entities must have an id property set in their user info.");

        id newItemId = [newItem valueForKey:idAttributeName];

        NSManagedObject *oldItem = nil;

        for (NSManagedObject *setItem in oldSet) {
            id setItemId = [setItem valueForKey:idAttributeName];

            if ([setItemId isEqual:newItemId]) {
                oldItem = setItem;
                break;
            }
        }

        NSDictionary *diff = [self diffNewObject:newItem withOldObject:oldItem];

        if (diff.count > 0) {
            [changes addObject:diff];
        }
    }

    // Find all items that have been deleted.
    for (NSManagedObject *oldItem in oldSet) {

        NSString *idAttributeName = oldItem.entity.userInfo[@"id"];

        NSAssert(idAttributeName, @"Entities must have an id property set in their user info.");

        id oldItemId = [oldItem valueForKey:idAttributeName];

        NSManagedObject *newItem = nil;

        for (NSManagedObject *setItem in newSet) {
            id setItemId = [setItem valueForKey:idAttributeName];

            if ([setItemId isEqual:oldItemId]) {
                newItem = setItem;
                break;
            }
        }

        if (!newItem) {
            NSDictionary *diff = [self diffNewObject:newItem withOldObject:oldItem];

            if (diff.count > 0) {
                [changes addObject:diff];
            }
        }
    }

    return changes;
}

That’s it: diff created! At this point, if you find that you don’t like the way old and new values are grouped together, or want to change any other aspect of the way the diff is produced, you should have enough information about how to use metaprogramming with NSManagedObjects to be able to change things around.

There are still two unsolved problems that I’ll tackle in another post:

  • What do you do if you want to include a related object in the diff if it has its deletion rule set to nullify?
  • What do you do if you want to exclude an attribute or relationship from the diff?

Once those are solved, there’s another big one: how do you apply the diff to an object? I’ll tackle that in another post, too.

2012-06-29

iOS Data Synchronisation Strategies

I’m currently writing an iOS app that is essentially a front-end to a SQL database. Users see data formatted into an attractive hierarchical layout and can enter information using the usual set of lists, pickers and textboxes. However, what makes this app unusual is the requirement that it be usable regardless of whether or not the device has an internet connection. Data can be pulled from the server when the user has an internet connection and can be edited even if the connection drops. When the connection resumes, the device sends the updates to the server and fetches any other changes made.

Immediately this raises all sorts of questions. The really, really big question is this: How does the system resolve conflicts? What happens if two users try to change the same information at the same time? What happens if a user makes changes on a device without a connection, makes conflicting changes on a second device with a connection, and then tries to sync the first device?

Here’s another mindbending requirement: Users can never be expected to manually merge data. When you consider that Apple is trying to hide the filesystem because the average user can’t cope with the concept of hierarchies, this makes sense. How can someone who doesn’t understand a simple folder hierarchy be expected to perform a 3-way merge?

After putting some thought into the problem, I came up with three possible solutions.

Last Write Wins

This is the easiest solution to implement and the most likely to result in data loss. When a device sends its local changes to the server it simply overwrites anything stored there.

Consider this scenario:

  • A user makes some trivial changes to the data on his iPhone.
  • He switches off the phone.
  • He spends a week making extensive changes to the data on his iPad.
  • He switches on his iPhone.
  • His week of changes are entirely overwritten with the data from the iPhone.

The server’s database already exists and cannot have its schema altered, and unfortunately it doesn’t support versioning. Once the data is overwritten it is gone.

Checkin/Checkout

This is the TFS model of working. If I want to edit some data (which can be thought of as a document), I need to check it out first. The document is locked to me and no-one else can edit it in the meantime. Edits are therefore serialised so there’s no chance of conflicting edits being made.

In order to support this, each device must have a unique identifier. Checking a document out to a user isn’t specific enough, because a user could have two devices (as per the “last write wins” scenario) and make conflicting edits on both. As Apple no longer allow apps to access the iOS device’s unique identifier, each installation of the app must generate its own unique ID and store it on the device. This allows a document to be checked out by a specific user on a specific device.

But what if a user leaves his phone at home and needs to checkout the document on a different device? We’ll have to amend the system so that checkouts can be overridden. That creates a new problem: what do we do with documents at checkin time that have had their checkout overridden and are therefore subject to conflicting edits? We have two choices: overwrite everything on the server and lose all changes made on the other device, or pull down the server data and lose everything on this device. We’re losing data again.

Even if we’re happy to accept the possibility of lost data (at least we can blame the users for ignoring the lock warnings) there’s another scenario we have to deal with. What happens if a user has a document stored on his device and wants to edit it but doesn’t have an internet connection? The device can’t contact the server to obtain the lock. Do we allow the edits and hope that no-one else grabs the lock before we get a connection back? What if someone else updates the document and releases the lock before that happens? We won’t know that the document has changed and we lose data.

Checkin/checkout is clearly a bad model:

  • Obtaining a lock without a connection is impossible and any workaround will lead to lost data;
  • Not allowing editing without a lock will prevent the app being used without an internet connection;
  • Allowing locks to be overridden will lead to lost data;
  • Not allowing locks to be overridden will lead to severe usage limitations.

Distributed Version Control

My reference to TFS in the “checkin/checkout” model should suggest my thought process so far: It’s essentially a distributed version control problem. We have a central repository and multiple clients that will:

  • Pull the latest state;
  • Change their data offline;
  • Push back to the server.

Unlike a DVCS, we have two big limitations:

  • The server doesn’t store a history;
  • Merges must be entirely automatic.

It’s important that the clients do as little work as possible in resolving conflicts. It’s possible that clients for other platforms will get written, and their programmers won’t want to re-implement a bunch of merging code that should have been on the server in the first place.

How can you tell a server to merge changes from a client if the server has no idea what its data looked like when the client last performed a pull?

This is my solution:

  • Client pulls data from server.
  • Client stores two copies of the data: One is the “pristine” server state and is immutable; one will be used for editing.
  • When the client pushes, it sends both the pristine and edited states of the data.
  • The server receives the data and compares its current state, the pristine state and the edited state of the data.
  • If the pristine and edited data matches, no changes have been made and the data should not be altered regardless of the current state.
  • If the pristine and edited data doesn’t match, the current data is overwritten with the edited state.
  • If the edited data matches the current data, no changes are made.
  • The resulting dataset is sent back to the client.
  • The client updates its local data with the data received from the server.

Note that, unlike a text document, the data in question can be broken down into discrete pieces. For example, it could contain a person’s name and address, which in turn would be broken down into first name, last name, street, county, post code, etc. Changing any element of the address would change the meaning of the entire address, so any single change would cause all fields to be overwritten with the client’s data. However, changing the address does not imply that the person’s name should change, so that would be examined separately from the address and updated appropriately.

Data that hasn’t been changed by the client won’t overwrite data that has been changed by another client. Data that has been changed by the client will overwrite any other changes. The system automatically merges where there are no conflicts and resolves conflicting edits via “last write wins”.

Other Thoughts

There doesn’t seem to be a foolproof way of ordering overwrites such that the most recently changed data ends up as the final version. I could make changes on my phone, switch it off, make more changes on my iPad and then switch my phone back on. My phone’s older data becomes the canonical version. I could try using a timestamp, but there’s no guarantee that those are correct. Lamport clocks won’t help because, as far as they are concerned, the two edits happened simultaneously.

The problem can be generalised from being considered as a DVCS problem to a distributed database problem, which opens up some more potential research. Reading up on distributed databases led me to the CAP theorem, which states that you can’t have immediate consistency of data if your database is always available (even if the device has no internet connection) and is split into several partitions (ie. a central SQL instance and a local CoreData instance). That means conflicts and merging are inevitable, and the way around it is “eventual consistency”. The disparate datastores will eventually synchronise and will eventually all have the same data; in the meantime, the absolute truth is fractured into the various stores and can only be determined by considering the entire cloud of devices participating in the system.

I installed and played with CouchDB for a while, which quickly supplanted MongoDB as my new favourite NoSQL database. Its approach to handling conflicts during data replication between nodes? Push back to the application and let it deal with the problem. It seems there is no “correct” way to handle merge conflicts automatically. My merging system with its “last write wins” bodge is, I think, the best solution to the problem given the constraints.