I’m currently writing an iOS app that is essentially a front-end to a SQL database. Users see data formatted into an attractive hierarchical layout and can enter information using the usual set of lists, pickers and textboxes. However, what makes this app unusual is the requirement that it be usable regardless of whether or not the device has an internet connection. Data can be pulled from the server when the user has an internet connection and can be edited even if the connection drops. When the connection resumes, the device sends the updates to the server and fetches any other changes made.
Immediately this raises all sorts of questions. The really, really big question is this: How does the system resolve conflicts? What happens if two users try to change the same information at the same time? What happens if a user makes changes on a device without a connection, makes conflicting changes on a second device with a connection, and then tries to sync the first device?
Here’s another mindbending requirement: Users can never be expected to manually merge data. When you consider that Apple is trying to hide the filesystem because the average user can’t cope with the concept of hierarchies, this makes sense. How can someone who doesn’t understand a simple folder hierarchy be expected to perform a 3-way merge?
After putting some thought into the problem, I came up with three possible solutions.
Last Write Wins
This is the easiest solution to implement and the most likely to result in data loss. When a device sends its local changes to the server it simply overwrites anything stored there.
Consider this scenario:
- A user makes some trivial changes to the data on his iPhone.
- He switches off the phone.
- He spends a week making extensive changes to the data on his iPad.
- He switches on his iPhone.
- His week of changes are entirely overwritten with the data from the iPhone.
The server’s database already exists and cannot have its schema altered, and unfortunately it doesn’t support versioning. Once the data is overwritten it is gone.
This is the TFS model of working. If I want to edit some data (which can be thought of as a document), I need to check it out first. The document is locked to me and no-one else can edit it in the meantime. Edits are therefore serialised so there’s no chance of conflicting edits being made.
In order to support this, each device must have a unique identifier. Checking a document out to a user isn’t specific enough, because a user could have two devices (as per the “last write wins” scenario) and make conflicting edits on both. As Apple no longer allow apps to access the iOS device’s unique identifier, each installation of the app must generate its own unique ID and store it on the device. This allows a document to be checked out by a specific user on a specific device.
But what if a user leaves his phone at home and needs to checkout the document on a different device? We’ll have to amend the system so that checkouts can be overridden. That creates a new problem: what do we do with documents at checkin time that have had their checkout overridden and are therefore subject to conflicting edits? We have two choices: overwrite everything on the server and lose all changes made on the other device, or pull down the server data and lose everything on this device. We’re losing data again.
Even if we’re happy to accept the possibility of lost data (at least we can blame the users for ignoring the lock warnings) there’s another scenario we have to deal with. What happens if a user has a document stored on his device and wants to edit it but doesn’t have an internet connection? The device can’t contact the server to obtain the lock. Do we allow the edits and hope that no-one else grabs the lock before we get a connection back? What if someone else updates the document and releases the lock before that happens? We won’t know that the document has changed and we lose data.
Checkin/checkout is clearly a bad model:
- Obtaining a lock without a connection is impossible and any workaround will lead to lost data;
- Not allowing editing without a lock will prevent the app being used without an internet connection;
- Allowing locks to be overridden will lead to lost data;
- Not allowing locks to be overridden will lead to severe usage limitations.
Distributed Version Control
My reference to TFS in the “checkin/checkout” model should suggest my thought process so far: It’s essentially a distributed version control problem. We have a central repository and multiple clients that will:
- Pull the latest state;
- Change their data offline;
- Push back to the server.
Unlike a DVCS, we have two big limitations:
- The server doesn’t store a history;
- Merges must be entirely automatic.
It’s important that the clients do as little work as possible in resolving conflicts. It’s possible that clients for other platforms will get written, and their programmers won’t want to re-implement a bunch of merging code that should have been on the server in the first place.
How can you tell a server to merge changes from a client if the server has no idea what its data looked like when the client last performed a pull?
This is my solution:
- Client pulls data from server.
- Client stores two copies of the data: One is the “pristine” server state and is immutable; one will be used for editing.
- When the client pushes, it sends both the pristine and edited states of the data.
- The server receives the data and compares its current state, the pristine state and the edited state of the data.
- If the pristine and edited data matches, no changes have been made and the data should not be altered regardless of the current state.
- If the pristine and edited data doesn’t match, the current data is overwritten with the edited state.
- If the edited data matches the current data, no changes are made.
- The resulting dataset is sent back to the client.
- The client updates its local data with the data received from the server.
Note that, unlike a text document, the data in question can be broken down into discrete pieces. For example, it could contain a person’s name and address, which in turn would be broken down into first name, last name, street, county, post code, etc. Changing any element of the address would change the meaning of the entire address, so any single change would cause all fields to be overwritten with the client’s data. However, changing the address does not imply that the person’s name should change, so that would be examined separately from the address and updated appropriately.
Data that hasn’t been changed by the client won’t overwrite data that has been changed by another client. Data that has been changed by the client will overwrite any other changes. The system automatically merges where there are no conflicts and resolves conflicting edits via “last write wins”.
There doesn’t seem to be a foolproof way of ordering overwrites such that the most recently changed data ends up as the final version. I could make changes on my phone, switch it off, make more changes on my iPad and then switch my phone back on. My phone’s older data becomes the canonical version. I could try using a timestamp, but there’s no guarantee that those are correct. Lamport clocks won’t help because, as far as they are concerned, the two edits happened simultaneously.
The problem can be generalised from being considered as a DVCS problem to a distributed database problem, which opens up some more potential research. Reading up on distributed databases led me to the CAP theorem, which states that you can’t have immediate consistency of data if your database is always available (even if the device has no internet connection) and is split into several partitions (ie. a central SQL instance and a local CoreData instance). That means conflicts and merging are inevitable, and the way around it is “eventual consistency”. The disparate datastores will eventually synchronise and will eventually all have the same data; in the meantime, the absolute truth is fractured into the various stores and can only be determined by considering the entire cloud of devices participating in the system.
I installed and played with CouchDB for a while, which quickly supplanted MongoDB as my new favourite NoSQL database. Its approach to handling conflicts during data replication between nodes? Push back to the application and let it deal with the problem. It seems there is no “correct” way to handle merge conflicts automatically. My merging system with its “last write wins” bodge is, I think, the best solution to the problem given the constraints.