CouchDB Implements a Fundamental Algorithm

2009/10/25 18:49:03 +0000

We're seeing a lot of action in the key/value map/reduce world lately. On the one hand this is because simpler stuff scales more simply, and on the other because key/value and B-Tree stores map cleanly to some fundamental algorithms.

At the heart of CouchDB's value proposition is incremental, peer-to-peer replication. All copies of a database are complete and can function independently at all times. Data can be shared with other nodes via replication, which can be triggered or run in continuous mode.

Replication is damn simple. Most of the "hard" work is done at CouchDB write time. Did I mention how simple CouchDB writes are? The gist of it is that for document storage, CouchDB keeps two indexes. One is by document id -- this is used for most lookups, and to enforce uniqueness of document ids within a database. The other index is by local update sequence. Each time a database is updated, the update is tagged with a monotonically increasing integer. This is a just local sequence specific to the local database, so it doesn't have to be replicated or shared.

The by-sequence index drives CouchDB's incremental replication, map reduce views, and external indexers. How? When you start the replicator, it traverses the by-sequence index to find the original updates. This means the target database will see the updates in roughly the same order that the source database saw them. When the replicator gets to the end of the by-sequence index, it knows it has seen the complete state of the current database. The replicator finishes by storing the source db's final sequence number into a replication history document on the target db.

The next time replication begins, the replicator can pull the high-water mark from the history document. The new replication can pick up from the last sequence number, so it doesn't have to start from the top again.

In primitive form, you can start to see how this technique allows for incremental peer-based replication, as any node can track its status vis-a-via any other node, and bring itself up to date, just by traversing the by-sequence index. Incremental map views are just like replication, except instead of updates being copied verbatim to the view index file, they are filtered and transformed using JavaScript functions. For an added bonus, compaction can be thought of as local replication to an empty file.

Optimized

In practice, CouchDB makes a few optimization to the above technique. The big one is that the by-sequence index is sparse. This means if a single document is updated over and over again, replication will only transfer the current state, not all the intermediate states, saving on bandwidth, storage and processing costs.

Another optimization is that before an update is transfered, it is compared to the content of the target database. If the target already has that update (perhaps from replication with a different peer) it doesn't need to be transfered.

Lockless

When you sit down to start implementing something that exhibits these properties, there is a challenge you run into as soon as you start to think about concurrency. View query range-scans, replication, document listings, reduce queries -- these are all potentially long-running processes that depend on the consistency of an index over multiple operations.

What's more, think about the writer: it's got to update the by-sequence and the by-id indexes together in an all-or-nothing way. (The view engine also updates multiple B-Tree indexes atomically as well.)

Your standard database toolkit will tempt you to reach for a familiar abstraction layer: transactional isolation. If you can ensure that only one process can access the B-Tree at any time, then you don't have to worry about writes altering the data-structure while reads are in progress.

Homey don't play that: transactions are a form of locking, and I un-friended locking years ago, for posting too many lame updates to my log files.

To get to the core of how CouchDB provides these properties in a lockless way requires understanding the append-only file format, but I'll give you the basic picture: Each db file has a single writer process and multiple reader processes. Readers can proceed independently of the writer, getting a consistent snapshot of the data, even as changes are being made.

In a nutshell, since we only ever append to the file, we never touch anything that's already on disk. Each time we commit to the file, we rewrite the portion of the index that is invalidated by the changed documents. For geeks: this means we update O(log n) B-Tree nodes on each update. The last thing we write to the file for each commit is the B-Tree root (there's also a signed header written to ensure reliability.)

Each reader grabs the file, and finds the last header / B-Tree root. From there it can traverse the by-seq or the by-id index (or in the case of a view file, any of the map indexes, or the back-reference index). If the writer updates the file, it will go unnoticed by any existing readers, as they will not have pointers to the new portion of the file. In this way CouchDB gives readers long-lived consistent access to indexes without locking or transactions.

Original?

CouchDB is not the first implementation of an append-only B-Tree, but it may be unique that it has the ability to interleave multiple indexes on disk in an append-only way. This interleaving is what gives it the ability to update the by-sequence and by-id indexes together atomically, without transactions.

The overall effect of this design is to maximize for reliability and concurrency. There is no way to corrupt the data, as we never touch what we've already written to disk. Erlang can support as many simultaneous readers as resources allow, which we've found in practice to be in the tens of thousands.

If you're the type to have names for these techniques, let me know in the comments. I wouldn't be surprised to find CouchDB described in a dusty-old algorithms book somewhere.

The fact that CouchDB's API is such a thin wrapper around a fundamental algorithm, makes me think it is time to start talking and thinking about alternate implementations: C++, JavaScript, Ruby

Comment on this post