Fixing HTML 5 Storage
The NoSQL movement has raised a flag about possible inclusion of SQL in the HTML 5 standard. Many of these people are fans of simpler key/value APIs. I'm guilty of getting up on stage with a white-on-black slide that says "No SQL in HTML 5."
I've since changed my mind. Have all the SQL in HTML 5 you want. Seriously, go for it.
SQL in HTML 5
There are two storage APIs in the HTML 5 spec: SQL Storage and Local Storage. They are under two different parts of the spec, and they are both being implemented by browsers, so it looks like they are both going to be with us for a while. I'll leave it up to the HTML 5 crew to decide what to do about standardizing SQL. They say:
"The working group is currently debating whether SQL is the right abstraction for this API."
The problem with standardizing it (as I understand it) boils down to, on the one hand, documenting every nook and cranny of SQLite, and on the other hand, expecting vendors to follow a storage spec with so much surface area and potential for mishandling. Worst case scenario: every vendor uses a different SQL engine with its own quirks and differences.
Local Storage
It turns out that localStorage as featured in the HTML5 spec is very nearly rich enough to support a real implementation of the CouchDB API. Jeremy Orlow pointed me to the localStorage interface on the public-webapps mailing list.
Local storage is also on the right track as far as usage scenarios:
It's also fairing well with implementations in most of the major browsers. Some people have suggested that the existence of all these implementations will make it harder to change the spec, but I think the changes I'm suggesting are sensible and only enhance the existing use cases.
The current implementations are targeting megabytes of data, not gigabytes, but fixing that will come part and parcel with the B-tree API I'm suggesting. But first, another rallying cry:
No locks in HTML5 localStorage!
There a troubling part of the HTML 5 spec about the concurrency model. Specifically the phrase "obtain the storage mutex" and the other requirements about independent execution of scripts give me an instant fright. As a programmer I really don't want to be handling locks, especially in a web environment. As I've discovered through my database experience, there's no reason to use locks in a storage layer.
Locking access to the data store is one way to provide safe concurrent access, but it starts to fall over as soon as you have more than a few scripts accessing the same Storage object. We have no locks in CouchDB, accomplished by using multi version concurrency control, a storage technique popularized by the PostgreSQL database. The MVCC approach would seem to be a simpler and more humane solution to the problem of concurrent storage usage.
What it takes to build a Couch
When I started talking publicly about the HTML 5 storage problem I had, like many, not taken a full look at the problem. I decided to sit back and think about the big picture.
I know I want the browser to be able to interoperate with CouchDB via replication. I don't mind if page-authors have to include a CouchDB compatibility library to do this. I just want the underlying system to be able to support non-trivial amounts of data, and to be useable in a CouchDB-like fashion.
As a matter of fact, I think SQLite could be shoehorned into this role, in a pinch. However, it wouldn't be very fun, is my guess. Maybe it wouldn't be so bad. Who knows... actually, it's starting to sound like masochistic fun. Back to work, Chris!
So what really, is the bare-minimum to support Couch? At the heart of our storage engine is a B-tree index that supports efficient range traversal as well as lookup by id. Here's an example of what the raw storage API could look like in a browser. (Hint, there's more to it than just this but that comes later.)
JS pseudocode
var btree = new WebStore("dbname", <optional collation function definition>);
btree["mydocid"] = {"some":"json"};
btree.forEach(function(key, value) {
// in order traversal
})
btree.forEach(function(key, value) {
// reverse order traversal
}, false)
btree.forEach("startkey", function(key, value) {
// in order traversal, starting from "startkey"
// use throw() to stop traversal
})
btree.forEach("endkey", function(key, value) {
// reverse order traversal, starting from "endkey"
// use throw() to stop traversal
}, false)
// delete a btree
WebStore.drop("dbname");
This should be enough to implement CouchDB style storage, as well as map-reduce views, on top of the raw API. The idea of implementing CouchDB in the browser is really cool. Luckily, Atul Varma from Mozilla has already started a similar project, called BrowserCouch.
As I hinted above, there's one thing missing from this story. Accounting for concurrent access from multiple windows, frames, and tabs is crucial. The only sensible way I know of to allow long-running tasks in multiple tabs to access the same store is using multi-version concurrency control (MVCC) to provide whoever-saves-first-wins semantics. This way the frames can even use localStorage as a communications method between them.
Native MVCC
Thus, we need to incorporate native MVCC. Without MVCC concurrent users would not know if they were overwriting each others changes. MVCC is necessary in order for the B-tree to be safe for concurrent access from multiple windows or tabs in the same browser. Actually– I started to add this to the API and the reality of MVCC token management lead me pretty quickly to something that looks almost exactly like the CouchDB API:
var btree = new WebStore("dbname", <optional collation function definition>);
var mvcc_rev = btree.save({_id : "mydocid", some : "json"});
var doc = btree.get("mydocid");
// we need to provide the latest _rev for the save to succeed.
btree.save({
_id : "mydocid",
_rev : mvcc_rev,
some : "other json"
});
doc = btree.get("mydocid");
doc.some == "other json";
// missing the latest _rev, so the save will fail
btree.save({_id : "mydocid", some : "optimisic locking"});
// has the latest rev so save succeeds
doc.multi_version_concurrency_control = "ftw!";
btree.save(doc);
I tried hard not to go all the way to the CouchDB API, but everything else I tried looked like a code Turducken. The problem is how else to manage the {id, rev} tuple but in the document itself. If you try to pass the _id and _rev around explicitly, you end up with this Turducken:
[id, mvcc_rev, doc] = btree.get("mydocid");
btree.save(id, mvcc_rev, {some : "other json"});
First of all that's weird because JavaScript doesn't even have Erlang-style pattern matching, so the first line is a syntax error, or results in global variables, or both. But even if we could pattern match, that's icky code. The real option is more like this:
resp = btree.get("mydocid");
resp.doc.some = "extra typing";
btree.save(resp.id, resp.mvcc_rev, resp.doc);
Which you have to admit is still a duck stuffed in a turkey. What we want is this:
doc = btree.get("mydocid");
doc.some = "relaxing storage";
btree.save(doc);
This turns out to be the CouchDB model, so I guess that's what we want in the browser. Thanks, concurrency!
Where to go from here
Ian Hickson suggests that I follow these recommendations for contributing to the HTML5 spec. The next step based on what I know so far is to talk with the browser vendors and convince them to work on compatible implementations.
It's no secret that some people at Mozilla think Couch is pretty cool, and WebKit has a strong localStorage implementation as well. Hopefully those projects will see something interesting in what I've said here.
There's also the matter of implementing CouchDB on top of a local B-tree, which is not a trivial task. Some of the harder questions are about how to replicate with remote CouchDB instances. The nice thing about this is CouchDB's test suite is already written in JavaScript so that should help in implementing an all-JavaScript CouchDB.
