2013-07-10

Git

When I first made the switch from Subversion to Mercurial, I tried out most of the available DVCS options: Git, Mercurial, Fossil, Monotone, Darcs and Bazaar. I wanted something that was easy to use, powerful and cross-platform. Only Mercurial and Fossil met the criteria. Fossil’s weird file system handling and lack of built-in Creole or Markdown support in its wiki eventually killed off my enthusiasm for its unique feature set, leaving Mercurial.

Mercurial is possibly the greatest development tool ever created. It completely altered the way I write code and made me immeasurably more productive. Its command line UI is a work of art. Every command is obvious and works exactly as you’d expect. Any chance I get I promote it, especially to the folks who still think that TFS is a really neat idea.

When I was evaluating DVCSes, Git had two main failings:

  • Windows support was second-rate;
  • The UI was designed by Linux programmers and made absolutely no sense.

Though the official Windows client is still an afterthought, Windows support is getting better. Microsoft themselves are adding Git support to TFS and Visual Studio; considering their attitude to Linus Torvald’s other work, this is deeply ironic.

The UI hasn’t improved in the last few years. It still has counter-intuitive commands that are overloaded to perform multiple actions that have little relation to each other. Users are still required to be experts in Git’s data structures in order to use the tool effectively.

Despite these failings, Git has become the de-facto DVCS. It is vastly more popular than any of the others I looked at, including the dreamlike Mercurial.

Microsoft aren’t the only company to have integrated Git into their tools. Xcode 4 uses Git as its default source control system, and Xcode 5 looks to be basing a number of very enticing new features around it. Though I don’t want to, it’s time to add Git to my toolbox.

Here’s my initial Git session:

git init test
vim test.md
git add .
git commit -m "Initial commit."
vim test.md
git commit -m "Edited text."

Working well so far. Now let’s try branching.

git checkout HEAD~1

Hmm. You are in ‘detached HEAD’ state.

vim test.md
git add .
git commit -m "Branch edit."
git checkout master

Warning: you are leaving 1 commit behind, not connected to any of your branches. That’s unexpected. I have no idea what it means. Mercurial allows multiple heads on the same branch; doesn’t Git?

git log --oneline

e48b1bf Edited text. 6739f7f Initial commit.

Two commits. I made three. Yikes.

At this point I decided that the only way I was going to learn to use Git properly was if I forgot everything I knew about Mercurial and approach Git as an entirely new tool. I’m up to chapter 3.6 of the Git book and it’s making more sense. The command aliasing function goes some way to alleviating some of the pain of the insane UI, it has some exceptionally cool features, and suddenly I find that I’m looking forward to using it.

As a test run, I’ve migrated the Super Foul Egg iOS source code to GitHub:

2012-06-29

iOS Data Synchronisation Strategies

I’m currently writing an iOS app that is essentially a front-end to a SQL database. Users see data formatted into an attractive hierarchical layout and can enter information using the usual set of lists, pickers and textboxes. However, what makes this app unusual is the requirement that it be usable regardless of whether or not the device has an internet connection. Data can be pulled from the server when the user has an internet connection and can be edited even if the connection drops. When the connection resumes, the device sends the updates to the server and fetches any other changes made.

Immediately this raises all sorts of questions. The really, really big question is this: How does the system resolve conflicts? What happens if two users try to change the same information at the same time? What happens if a user makes changes on a device without a connection, makes conflicting changes on a second device with a connection, and then tries to sync the first device?

Here’s another mindbending requirement: Users can never be expected to manually merge data. When you consider that Apple is trying to hide the filesystem because the average user can’t cope with the concept of hierarchies, this makes sense. How can someone who doesn’t understand a simple folder hierarchy be expected to perform a 3-way merge?

After putting some thought into the problem, I came up with three possible solutions.

Last Write Wins

This is the easiest solution to implement and the most likely to result in data loss. When a device sends its local changes to the server it simply overwrites anything stored there.

Consider this scenario:

  • A user makes some trivial changes to the data on his iPhone.
  • He switches off the phone.
  • He spends a week making extensive changes to the data on his iPad.
  • He switches on his iPhone.
  • His week of changes are entirely overwritten with the data from the iPhone.

The server’s database already exists and cannot have its schema altered, and unfortunately it doesn’t support versioning. Once the data is overwritten it is gone.

Checkin/Checkout

This is the TFS model of working. If I want to edit some data (which can be thought of as a document), I need to check it out first. The document is locked to me and no-one else can edit it in the meantime. Edits are therefore serialised so there’s no chance of conflicting edits being made.

In order to support this, each device must have a unique identifier. Checking a document out to a user isn’t specific enough, because a user could have two devices (as per the “last write wins” scenario) and make conflicting edits on both. As Apple no longer allow apps to access the iOS device’s unique identifier, each installation of the app must generate its own unique ID and store it on the device. This allows a document to be checked out by a specific user on a specific device.

But what if a user leaves his phone at home and needs to checkout the document on a different device? We’ll have to amend the system so that checkouts can be overridden. That creates a new problem: what do we do with documents at checkin time that have had their checkout overridden and are therefore subject to conflicting edits? We have two choices: overwrite everything on the server and lose all changes made on the other device, or pull down the server data and lose everything on this device. We’re losing data again.

Even if we’re happy to accept the possibility of lost data (at least we can blame the users for ignoring the lock warnings) there’s another scenario we have to deal with. What happens if a user has a document stored on his device and wants to edit it but doesn’t have an internet connection? The device can’t contact the server to obtain the lock. Do we allow the edits and hope that no-one else grabs the lock before we get a connection back? What if someone else updates the document and releases the lock before that happens? We won’t know that the document has changed and we lose data.

Checkin/checkout is clearly a bad model:

  • Obtaining a lock without a connection is impossible and any workaround will lead to lost data;
  • Not allowing editing without a lock will prevent the app being used without an internet connection;
  • Allowing locks to be overridden will lead to lost data;
  • Not allowing locks to be overridden will lead to severe usage limitations.

Distributed Version Control

My reference to TFS in the “checkin/checkout” model should suggest my thought process so far: It’s essentially a distributed version control problem. We have a central repository and multiple clients that will:

  • Pull the latest state;
  • Change their data offline;
  • Push back to the server.

Unlike a DVCS, we have two big limitations:

  • The server doesn’t store a history;
  • Merges must be entirely automatic.

It’s important that the clients do as little work as possible in resolving conflicts. It’s possible that clients for other platforms will get written, and their programmers won’t want to re-implement a bunch of merging code that should have been on the server in the first place.

How can you tell a server to merge changes from a client if the server has no idea what its data looked like when the client last performed a pull?

This is my solution:

  • Client pulls data from server.
  • Client stores two copies of the data: One is the “pristine” server state and is immutable; one will be used for editing.
  • When the client pushes, it sends both the pristine and edited states of the data.
  • The server receives the data and compares its current state, the pristine state and the edited state of the data.
  • If the pristine and edited data matches, no changes have been made and the data should not be altered regardless of the current state.
  • If the pristine and edited data doesn’t match, the current data is overwritten with the edited state.
  • If the edited data matches the current data, no changes are made.
  • The resulting dataset is sent back to the client.
  • The client updates its local data with the data received from the server.

Note that, unlike a text document, the data in question can be broken down into discrete pieces. For example, it could contain a person’s name and address, which in turn would be broken down into first name, last name, street, county, post code, etc. Changing any element of the address would change the meaning of the entire address, so any single change would cause all fields to be overwritten with the client’s data. However, changing the address does not imply that the person’s name should change, so that would be examined separately from the address and updated appropriately.

Data that hasn’t been changed by the client won’t overwrite data that has been changed by another client. Data that has been changed by the client will overwrite any other changes. The system automatically merges where there are no conflicts and resolves conflicting edits via “last write wins”.

Other Thoughts

There doesn’t seem to be a foolproof way of ordering overwrites such that the most recently changed data ends up as the final version. I could make changes on my phone, switch it off, make more changes on my iPad and then switch my phone back on. My phone’s older data becomes the canonical version. I could try using a timestamp, but there’s no guarantee that those are correct. Lamport clocks won’t help because, as far as they are concerned, the two edits happened simultaneously.

The problem can be generalised from being considered as a DVCS problem to a distributed database problem, which opens up some more potential research. Reading up on distributed databases led me to the CAP theorem, which states that you can’t have immediate consistency of data if your database is always available (even if the device has no internet connection) and is split into several partitions (ie. a central SQL instance and a local CoreData instance). That means conflicts and merging are inevitable, and the way around it is “eventual consistency”. The disparate datastores will eventually synchronise and will eventually all have the same data; in the meantime, the absolute truth is fractured into the various stores and can only be determined by considering the entire cloud of devices participating in the system.

I installed and played with CouchDB for a while, which quickly supplanted MongoDB as my new favourite NoSQL database. Its approach to handling conflicts during data replication between nodes? Push back to the application and let it deal with the problem. It seems there is no “correct” way to handle merge conflicts automatically. My merging system with its “last write wins” bodge is, I think, the best solution to the problem given the constraints.

2010-08-26

Farewell to SourceForge

SourceForge has been Woopsi’s source code host since October 2007. They’ve been a great host. They offer some excellent features, and haven’t charged me a thing for nearly 3 years of diligently caring for my code. However, both technology and “social” coding techniques have improved since I adopted Subversion. I’ve found modern distributed version control systems to be faster and more flexible than Subversion, whilst the forking capabilities provided by sites such as GitHub and BitBucket make collaborating with other coders incredibly easy.

A request from Lakedaemon that I move Woopsi to a DVCS finally convinced me that it was time to move on, so I’ve switched from Subversion to Mercurial. Woopsi and its associated sub projects are now hosted on BitBucket:

The SourceForge page remains in place as it’s impossible to close a SourceForge site that has code committed to it. It now includes a notice indicating that the project has moved.