2012-06-29

iOS Data Synchronisation Strategies

I’m currently writing an iOS app that is essentially a front-end to a SQL database. Users see data formatted into an attractive hierarchical layout and can enter information using the usual set of lists, pickers and textboxes. However, what makes this app unusual is the requirement that it be usable regardless of whether or not the device has an internet connection. Data can be pulled from the server when the user has an internet connection and can be edited even if the connection drops. When the connection resumes, the device sends the updates to the server and fetches any other changes made.

Immediately this raises all sorts of questions. The really, really big question is this: How does the system resolve conflicts? What happens if two users try to change the same information at the same time? What happens if a user makes changes on a device without a connection, makes conflicting changes on a second device with a connection, and then tries to sync the first device?

Here’s another mindbending requirement: Users can never be expected to manually merge data. When you consider that Apple is trying to hide the filesystem because the average user can’t cope with the concept of hierarchies, this makes sense. How can someone who doesn’t understand a simple folder hierarchy be expected to perform a 3-way merge?

After putting some thought into the problem, I came up with three possible solutions.

Last Write Wins

This is the easiest solution to implement and the most likely to result in data loss. When a device sends its local changes to the server it simply overwrites anything stored there.

Consider this scenario:

  • A user makes some trivial changes to the data on his iPhone.
  • He switches off the phone.
  • He spends a week making extensive changes to the data on his iPad.
  • He switches on his iPhone.
  • His week of changes are entirely overwritten with the data from the iPhone.

The server’s database already exists and cannot have its schema altered, and unfortunately it doesn’t support versioning. Once the data is overwritten it is gone.

Checkin/Checkout

This is the TFS model of working. If I want to edit some data (which can be thought of as a document), I need to check it out first. The document is locked to me and no-one else can edit it in the meantime. Edits are therefore serialised so there’s no chance of conflicting edits being made.

In order to support this, each device must have a unique identifier. Checking a document out to a user isn’t specific enough, because a user could have two devices (as per the “last write wins” scenario) and make conflicting edits on both. As Apple no longer allow apps to access the iOS device’s unique identifier, each installation of the app must generate its own unique ID and store it on the device. This allows a document to be checked out by a specific user on a specific device.

But what if a user leaves his phone at home and needs to checkout the document on a different device? We’ll have to amend the system so that checkouts can be overridden. That creates a new problem: what do we do with documents at checkin time that have had their checkout overridden and are therefore subject to conflicting edits? We have two choices: overwrite everything on the server and lose all changes made on the other device, or pull down the server data and lose everything on this device. We’re losing data again.

Even if we’re happy to accept the possibility of lost data (at least we can blame the users for ignoring the lock warnings) there’s another scenario we have to deal with. What happens if a user has a document stored on his device and wants to edit it but doesn’t have an internet connection? The device can’t contact the server to obtain the lock. Do we allow the edits and hope that no-one else grabs the lock before we get a connection back? What if someone else updates the document and releases the lock before that happens? We won’t know that the document has changed and we lose data.

Checkin/checkout is clearly a bad model:

  • Obtaining a lock without a connection is impossible and any workaround will lead to lost data;
  • Not allowing editing without a lock will prevent the app being used without an internet connection;
  • Allowing locks to be overridden will lead to lost data;
  • Not allowing locks to be overridden will lead to severe usage limitations.

Distributed Version Control

My reference to TFS in the “checkin/checkout” model should suggest my thought process so far: It’s essentially a distributed version control problem. We have a central repository and multiple clients that will:

  • Pull the latest state;
  • Change their data offline;
  • Push back to the server.

Unlike a DVCS, we have two big limitations:

  • The server doesn’t store a history;
  • Merges must be entirely automatic.

It’s important that the clients do as little work as possible in resolving conflicts. It’s possible that clients for other platforms will get written, and their programmers won’t want to re-implement a bunch of merging code that should have been on the server in the first place.

How can you tell a server to merge changes from a client if the server has no idea what its data looked like when the client last performed a pull?

This is my solution:

  • Client pulls data from server.
  • Client stores two copies of the data: One is the “pristine” server state and is immutable; one will be used for editing.
  • When the client pushes, it sends both the pristine and edited states of the data.
  • The server receives the data and compares its current state, the pristine state and the edited state of the data.
  • If the pristine and edited data matches, no changes have been made and the data should not be altered regardless of the current state.
  • If the pristine and edited data doesn’t match, the current data is overwritten with the edited state.
  • If the edited data matches the current data, no changes are made.
  • The resulting dataset is sent back to the client.
  • The client updates its local data with the data received from the server.

Note that, unlike a text document, the data in question can be broken down into discrete pieces. For example, it could contain a person’s name and address, which in turn would be broken down into first name, last name, street, county, post code, etc. Changing any element of the address would change the meaning of the entire address, so any single change would cause all fields to be overwritten with the client’s data. However, changing the address does not imply that the person’s name should change, so that would be examined separately from the address and updated appropriately.

Data that hasn’t been changed by the client won’t overwrite data that has been changed by another client. Data that has been changed by the client will overwrite any other changes. The system automatically merges where there are no conflicts and resolves conflicting edits via “last write wins”.

Other Thoughts

There doesn’t seem to be a foolproof way of ordering overwrites such that the most recently changed data ends up as the final version. I could make changes on my phone, switch it off, make more changes on my iPad and then switch my phone back on. My phone’s older data becomes the canonical version. I could try using a timestamp, but there’s no guarantee that those are correct. Lamport clocks won’t help because, as far as they are concerned, the two edits happened simultaneously.

The problem can be generalised from being considered as a DVCS problem to a distributed database problem, which opens up some more potential research. Reading up on distributed databases led me to the CAP theorem, which states that you can’t have immediate consistency of data if your database is always available (even if the device has no internet connection) and is split into several partitions (ie. a central SQL instance and a local CoreData instance). That means conflicts and merging are inevitable, and the way around it is “eventual consistency”. The disparate datastores will eventually synchronise and will eventually all have the same data; in the meantime, the absolute truth is fractured into the various stores and can only be determined by considering the entire cloud of devices participating in the system.

I installed and played with CouchDB for a while, which quickly supplanted MongoDB as my new favourite NoSQL database. Its approach to handling conflicts during data replication between nodes? Push back to the application and let it deal with the problem. It seems there is no “correct” way to handle merge conflicts automatically. My merging system with its “last write wins” bodge is, I think, the best solution to the problem given the constraints.

Comments

Jeff on 2012-07-16 at 10:46 said:

The problem with comparing version-control systems like TFS or SVN with the data synchronisation problem is that you need to be very careful to understand what a “document” is. That is, what is the entity that you are going to manage distinct versions of.

In the traditional file-based version control systems, each file represents a single atomic unit whose contents can presumably be merged by a user performing visual inspection.

In a complex data model, particularly one that wants to implement undo, it is not the enclosing database that is the version controlled document. Rather, you want each of your underlying tables (classes) to be version controlled, and if you want to do it properly, you want each row (entity) within each table to be version controlled. However, your model edits need to be atomic transactions since changes to one entity may only be valid if they are taken in concert with corresponding changes to another.

Rarely do version control systems allow you to combine the merging of two or more distinct files as one atomic operation. However, any “distributed database” requires exactly this, due to the inter-relationships between anything but the most simplistic data models. While it is easy to do something like, say, a Contacts database where every separate record is really just an instance of the same class, and thus the data is a single table with unrelated rows, in practice introducing structure like “Groups” is where the complexity comes in.

Little wonder that iTunes does not have a separate database record for “Album” or “Artist” but implies their existence by looking at common values in the “Track” attributes. If you take a look under the covers, you can see that the Track entity can represent a music track, a movie, even a book, and it has a bunch of attributes which are only appropriate to the particular track type. Fundamentally, the data mode is extremely denormalised, to make one particular part of the problem easier, but at the loss of some features.

Syncing changes in a data model where relationships may be manipulated from either end, where user 1 only edits record 1 and user 2 only edits record 2, but record 1 depends on the contents of record 2, is where the complexity comes in.

You also end up needing to implement some form of tombstones in your data model so that “deleted rows” live on long enough to sync the deletion requests back to wherever the “master” information is. Taking the GIT model of not having a specific master, is harder, and falls back on the fact that different versions of a file are tied together by yet more metadata. And you need to give it more information so that it can resolve the ambiguous situations.

(By the way, TFS does not mandate that you lock things to edit them. We use it every day in a team of 10+ developers and none of us lock the files we check out - all that the lock does is reduce the chances of needing to merge later on. TFS also has the ability to work offline, and apply a manual comparison process to compare your workspace with what the server thinks your workspace should have, and generate appropriate checkout requests after the event - again, in this case, you need to let manually differentiate a few ambiguous cases by hand, and no one solution will be correct for 100% of situations you can create)

Ant on 2012-07-16 at 14:33 said:

There’s lots to consider, but fortunately I’m dealing with a fairly simple set of data. There aren’t any complex relationships between different areas of a dataset. I’m also extremely limited by the backend architecture, which is inflexible and has no concept of versioning.

My current strategy is to produce a JSON file that looks something like this:

{
    "id:"123123123",
    "changeset":
    {
        "customer":
        {
            "firstName":
            {
                "old":"Joe",
                "new":Bob"
            }
        },
        "groceries":
        [
            {
                "id":1,
                "old":
                {
                    "name":"Beans",
                    "quantity":"1 tin"
                },
                "new":
                {
                    "name":"Beans",
                    "quantity":"1 tin"
                }
            }
        ]
    }
}

In this case, the data represents a customer’s buying habits (the app itself has nothing to do with this, but it’s a simple example). The JSON document can be produced by comparing the “pristine” copy of the data with the edited copy. It can store:

  • Individual fields that have changed (“firstName”);
  • Groups of fields that are dependent and therefore presented as a unit (“name” and “quantity”);
  • Deleted items (their new value will be “null”);
  • New items (the old value will be “null”);
  • Lists and singletons.

That file gets transmitted to the server, which can figure out which fields have conflicting edits (by comparing the “old” value with its current value) and deal with them appropriately (it may be that last write wins or, in some cases, the server’s current value might win). Non-conflicting edits just overwrite.

Doing so much work in the client smells like an anti-pattern, but when the server is as dumb as it is there isn’t much choice but to have smarter clients. On the other hand, this system can create a new record simply by transmitting the subset of fields that have been given values, as opposed to transmitting an entire dataset mostly consisting of nulls.

Denormalisation is a good idea. I’m actually denormalising the data to an extent as I produce the JSON document, but again that’s mostly because of the way the server has been designed.

I’ve seen TFS’ ability to handle a non-locking editing workflow. It mostly seems to result in people ignoring the “remote files have changed” dialog and then spending a few hours figuring out what they overwrote and merging things back together again. In the meantime, the rest of the developers grumble and everything in TeamCity turns red. You can reasonably argue that TFS did warn them, and that it’s their own fault, but on the other hand: Mercurial. Or Git.

Jeff on 2012-07-17 at 11:29 said:

I have to say, it sounds like you aren’t using TFS correctly. We have lots of people working on top of one another, and there is never a need to “figure out what you overwrote”. That is, unless you insist on working ala Git and just forcing the read-only bit off manually rather than checking files out. Visual Studio can be configured to auto-checkout as you try to change files so there’s really no serious overhead.

We spent ten minutes writing a script that does a “get latest” of the various TFS projects that we combine into our multi-million-line product, everyone types ‘tfsupdate’ once a day and their workspaces are kept up to date - it lets them know at that point that they need to merge, and Beyond Compare does a spectacular job at the 3-way merge, making it all really simple.

With respect to your old/new images, why not just send an MD5 of the original objects properties, rather than the entire old image? Or are you proposing to resolve collisions at the “Property Within Entity” level in which case you again end up with the “inter-related data” issue. ie, if one app has changed name from Beans to Corn, and the other has changed the quantity from 1 tin to 3 tins, that is surely a collision that requires manual intervention even though technically they didn’t edit “the same value”.

Ant on 2012-07-18 at 14:24 said:

The idea is that the diff schema supports the resolution of both individual properties within entities and sets of properties. In the corn/beans/1 tin/3 tins example, changing either property does indeed change the meaning of the entire entity, so both fields should be updated simultaneously regardless of whether one or both fields have actually been changed. That’s why the JSON schema transmits dictionaries containing both “name” and “quantity” as the “old” and “new” values: together they represent an atomic piece of data.

On the other hand, I could fix the spelling of the customer’s first name and someone else could change his date of birth. These two things aren’t dependent on each other even though they are properties of the same entity, so changing one shouldn’t overwrite the other (unless I’m trying to represent an entirely different customer by changing someone else’s details, but then all of the data is wrong and I’m not really using the software in the right way).

Ideally I’d just present the user with a list of conflicts via a simple merge UI, but that’s not an option. Last-write-wins is the only viable merge strategy available to me, so by trying to divide the data up into the smallest logical atoms I’m hoping to reduce the amount of data loss via unnecessary overwrites.

MD5 hashes are a good idea, but for the most part the data consists of numbers, dates and GUIDs, so the hashes would typically be larger than the values of the properties they represented. For data atoms consisting of several fields it would make sense, though. Great suggestion!

As for TFS, I’m fortunate not to have to use it at all. Microsoft’s unrelenting focus on Windows means they never bothered to make a proper Mac client (I don’t count the Eclipse plugin they bought and promptly abandoned as a proper client) so I use a cross-platform DVCS instead.