2012-12-07

Voting For Eggs

The iOS version of Super Foul Egg is still underway. Since the last post I’ve fixed all of the coding problems I identified and added the pause function, but there’s no visible button for that yet. There’s still no exit game function and I haven’t yet replaced the “press any key” bitmaps, either.

Instead, I’ve been working on adding back in a feature I removed: two-player mode. Where’s the fun in SFE if you can’t play against a real person? I took out the feature because it’s far to complicated to try and track and differentiate between two users interacting with the same touchscreen. To add it back in I’ve been playing with Gamekit.

Gamekit is a very simple networking framework that allows iOS devices to communicate over Bluetooth or wifi. It announces the availability of a peer via Bonjour, automatically assembles a network of peers with no user intervention, and then allows multicast messaging between them. Having written a network library for games I know how tricky it is, and I’m impressed at how easy Gamekit is to use.

Although Gamekit takes care of the messaging system admirably, there’s still the problem of co-ordinating the behaviour of multiple instances of the same game across a network. The control system was easy:

  • Every time the player moves an egg, transmit the move over the network;
  • Receive the message on the peer;
  • Send the move as notification to the notification centre;
  • Receive the notification in the grid representing the remote player;
  • Perform the move.

I disabled the drop timer for the second player’s grid and rely on a “drop egg” network message, which keeps the two devices in sync. Fortunately Gamekit has a messaging mode that is guaranteed to transmit and receive messages in the correct order.

The difficult part has been choosing the next pair of eggs. Eggs are chosen randomly. How can two devices be made to choose the same random pair of eggs?

I came up with a few solutions for this. Disregarding the option of using a pseudorandom progression for the sequence of eggs (which would have removed the need for any network communication but would also have made the game extremely predictable), the first was to allow network peers to choose their own eggs. Every time they choose an egg, they send a message to the network: “I chose red”. All peers add a red egg to their next-egg lists. They only need to choose an egg when they reach the end of the next-egg list.

This would work in most situations, but it falls apart if two or more peers try to choose an egg at the same time. Peer 1 would choose red; peer 2 would choose blue. Peer 1 and peer 2 now disagree on what the next egg will be. Other peers will have either red or blue as their next egg depending on which message reached them first.

Despite the race condition this is a temptingly simple approach, but the most likely time for a disagreement to occur is when the game first starts up and all peers choose their first set of eggs. Everyone disagrees; the game is a shambles.

The second approach was to designate one of the peers as a server and the others as its clients. Any time a client requires an egg it sends a message to the server: “I need a new egg”. The server responds to all of its clients: “The next egg is red”. All peers add a red egg to their lists. If the server needs a new egg it just broadcasts the next-egg message without being prompted.

One handy feature of Gamekit is the way it automatically chooses a randomised ID for all peers in the network. The server can be chosen by ordering all peer IDs numerically and selecting either the first or last in the list. As long as all peers use the same algorithm, no network traffic is required to select a server.

However, I was dissuaded from using this approach because it requires two sets of code for all network-related functionality. One set has to perform server actions, whilst the other has to perform client actions. It would be better if all network participants were peers and ran the same code.

Thinking about the problem some more led me to the phrase “reaching consensus in a distributed system”, which resonated with my distributed systems class at university: This is a solved problem. I need to implement a voting system.

The algorithm works like this:

  • Peer 1 needs a new egg.
  • Peer 1 sends out a message to all peers: “I vote for a red egg for vote 1”.
  • Peer 2 receives the message.
  • Peer 2 responds to all peers with the message: “I vote for a blue egg for vote 1”.
  • Peer 3 receives the message.
  • Peer 3 responds to all peers with the message: “I vote for a red egg for vote 1”.

As each peer votes, all participants in the network store each peer’s vote along with the peer ID and the vote number. Once the number of votes collected for that vote number is equal to the number of peers, all peers have voted. All peers use the same algorithm for selecting the “winning” vote. The network arrives at consensus.

It’s important to note that each peer will only vote on a given vote number once. Otherwise, each peer would respond to each incoming message, resulting in an never-ending, exponential cascade of voting.

To choose the winning vote, I initially planned to sum the votes for each egg colour and select the colour with the most votes. In the case of a tie, I’d choose the vote from the peer with the highest ID. I then realised that I could greatly simplify the algorithm by using the American approach to an election: ignore the popular vote and just use the single vote from the peer with the highest ID. At last I have a system that works.

Next steps are to ensure that the next-egg list is never empty. If the list is empty gameplay has to halt to allow voting to take place, which can take a few seconds. Other than that, I’m going to simplify the garbage egg placement algorithm. It currently fills up the columns with the fewest eggs in first, which is predictable across all peers, but given two columns with the same number of eggs it will randomly choose one of the options. I could implement a messaging system for garbage egg placement, but it is far easier to change the algorithm to be deterministic when running a network game.

2012-06-29

iOS Data Synchronisation Strategies

I’m currently writing an iOS app that is essentially a front-end to a SQL database. Users see data formatted into an attractive hierarchical layout and can enter information using the usual set of lists, pickers and textboxes. However, what makes this app unusual is the requirement that it be usable regardless of whether or not the device has an internet connection. Data can be pulled from the server when the user has an internet connection and can be edited even if the connection drops. When the connection resumes, the device sends the updates to the server and fetches any other changes made.

Immediately this raises all sorts of questions. The really, really big question is this: How does the system resolve conflicts? What happens if two users try to change the same information at the same time? What happens if a user makes changes on a device without a connection, makes conflicting changes on a second device with a connection, and then tries to sync the first device?

Here’s another mindbending requirement: Users can never be expected to manually merge data. When you consider that Apple is trying to hide the filesystem because the average user can’t cope with the concept of hierarchies, this makes sense. How can someone who doesn’t understand a simple folder hierarchy be expected to perform a 3-way merge?

After putting some thought into the problem, I came up with three possible solutions.

Last Write Wins

This is the easiest solution to implement and the most likely to result in data loss. When a device sends its local changes to the server it simply overwrites anything stored there.

Consider this scenario:

  • A user makes some trivial changes to the data on his iPhone.
  • He switches off the phone.
  • He spends a week making extensive changes to the data on his iPad.
  • He switches on his iPhone.
  • His week of changes are entirely overwritten with the data from the iPhone.

The server’s database already exists and cannot have its schema altered, and unfortunately it doesn’t support versioning. Once the data is overwritten it is gone.

Checkin/Checkout

This is the TFS model of working. If I want to edit some data (which can be thought of as a document), I need to check it out first. The document is locked to me and no-one else can edit it in the meantime. Edits are therefore serialised so there’s no chance of conflicting edits being made.

In order to support this, each device must have a unique identifier. Checking a document out to a user isn’t specific enough, because a user could have two devices (as per the “last write wins” scenario) and make conflicting edits on both. As Apple no longer allow apps to access the iOS device’s unique identifier, each installation of the app must generate its own unique ID and store it on the device. This allows a document to be checked out by a specific user on a specific device.

But what if a user leaves his phone at home and needs to checkout the document on a different device? We’ll have to amend the system so that checkouts can be overridden. That creates a new problem: what do we do with documents at checkin time that have had their checkout overridden and are therefore subject to conflicting edits? We have two choices: overwrite everything on the server and lose all changes made on the other device, or pull down the server data and lose everything on this device. We’re losing data again.

Even if we’re happy to accept the possibility of lost data (at least we can blame the users for ignoring the lock warnings) there’s another scenario we have to deal with. What happens if a user has a document stored on his device and wants to edit it but doesn’t have an internet connection? The device can’t contact the server to obtain the lock. Do we allow the edits and hope that no-one else grabs the lock before we get a connection back? What if someone else updates the document and releases the lock before that happens? We won’t know that the document has changed and we lose data.

Checkin/checkout is clearly a bad model:

  • Obtaining a lock without a connection is impossible and any workaround will lead to lost data;
  • Not allowing editing without a lock will prevent the app being used without an internet connection;
  • Allowing locks to be overridden will lead to lost data;
  • Not allowing locks to be overridden will lead to severe usage limitations.

Distributed Version Control

My reference to TFS in the “checkin/checkout” model should suggest my thought process so far: It’s essentially a distributed version control problem. We have a central repository and multiple clients that will:

  • Pull the latest state;
  • Change their data offline;
  • Push back to the server.

Unlike a DVCS, we have two big limitations:

  • The server doesn’t store a history;
  • Merges must be entirely automatic.

It’s important that the clients do as little work as possible in resolving conflicts. It’s possible that clients for other platforms will get written, and their programmers won’t want to re-implement a bunch of merging code that should have been on the server in the first place.

How can you tell a server to merge changes from a client if the server has no idea what its data looked like when the client last performed a pull?

This is my solution:

  • Client pulls data from server.
  • Client stores two copies of the data: One is the “pristine” server state and is immutable; one will be used for editing.
  • When the client pushes, it sends both the pristine and edited states of the data.
  • The server receives the data and compares its current state, the pristine state and the edited state of the data.
  • If the pristine and edited data matches, no changes have been made and the data should not be altered regardless of the current state.
  • If the pristine and edited data doesn’t match, the current data is overwritten with the edited state.
  • If the edited data matches the current data, no changes are made.
  • The resulting dataset is sent back to the client.
  • The client updates its local data with the data received from the server.

Note that, unlike a text document, the data in question can be broken down into discrete pieces. For example, it could contain a person’s name and address, which in turn would be broken down into first name, last name, street, county, post code, etc. Changing any element of the address would change the meaning of the entire address, so any single change would cause all fields to be overwritten with the client’s data. However, changing the address does not imply that the person’s name should change, so that would be examined separately from the address and updated appropriately.

Data that hasn’t been changed by the client won’t overwrite data that has been changed by another client. Data that has been changed by the client will overwrite any other changes. The system automatically merges where there are no conflicts and resolves conflicting edits via “last write wins”.

Other Thoughts

There doesn’t seem to be a foolproof way of ordering overwrites such that the most recently changed data ends up as the final version. I could make changes on my phone, switch it off, make more changes on my iPad and then switch my phone back on. My phone’s older data becomes the canonical version. I could try using a timestamp, but there’s no guarantee that those are correct. Lamport clocks won’t help because, as far as they are concerned, the two edits happened simultaneously.

The problem can be generalised from being considered as a DVCS problem to a distributed database problem, which opens up some more potential research. Reading up on distributed databases led me to the CAP theorem, which states that you can’t have immediate consistency of data if your database is always available (even if the device has no internet connection) and is split into several partitions (ie. a central SQL instance and a local CoreData instance). That means conflicts and merging are inevitable, and the way around it is “eventual consistency”. The disparate datastores will eventually synchronise and will eventually all have the same data; in the meantime, the absolute truth is fractured into the various stores and can only be determined by considering the entire cloud of devices participating in the system.

I installed and played with CouchDB for a while, which quickly supplanted MongoDB as my new favourite NoSQL database. Its approach to handling conflicts during data replication between nodes? Push back to the application and let it deal with the problem. It seems there is no “correct” way to handle merge conflicts automatically. My merging system with its “last write wins” bodge is, I think, the best solution to the problem given the constraints.