Reflections on CouchDB 1.0

2010/07/14 14:05:31 +0000

On the occasion of CouchDB hitting 1.0, I figured I'd take the time to reminisce about how I got interested in it, how I joined the project, and some fun digressions along the way.

I first heard of CouchDB in October 2007 thanks to Jacob Kaplan-Moss's oft-quoted blog post. I was intrigued by the idea of an HTTP and JSON database (what else could you need?) but I was deep in the middle of my Rails-based startup. Even though I was feeling acute pain from my efforts to model the messy world of free-range MP3s in Active Record, I didn't quite realize that document model could work for me. "I'll have to remember this for later," I thought, and went back to my 4th-normal-form contortions.

It wasn't until a few months later, when we started to iterate on the requirements for Grabbit, my music startup (with Greg Borenstein, creator of robots), that I considered diving into CouchDB. We moved from a del.icio.us-like model where users would post a page URL to us, and we'd process the page and turn it into a playable Ajax playlist, to a more Google-like model, where we started actively crawling the web to find pages that would make good playlists.

It turns out that spidering the web into Postgres "doesn't scale." By this I mean we were only able to shove about 100 pages an hour into our database, and as the tables grew, query times were growing worse than linearly. By the end of the old architecture, we started caching denormalized JSON representations of our data (needed for API calls) on the main object tables. At some point we even crossed the line into updating that JSON in place.

I'm sure I could have buckled down and optimized the Postgres backend, but it turned out to be easier to rearchitect. We didn't have to rewrite, as most of our interesting code (aside from the internecine ActiveRecord models) was in the parsers, which we could reuse, storing their output as JSON documents in CouchDB instead of 30-something relational tables.

The new architecture would be every engineers dream (circa early 2008): we used Hadoop and Nutch to spider the web, and then ran our Ruby and Hpricot based parsers to convert web-pages to JSON, which we stored in CouchDB. CouchDB's map-reduce was the engine for queries.

Spidering speed increased by three orders of magnitude, and the codebase was significantly simpler. We were able to re-enable features we'd removed due to slow queries blocking the database. We even implemented the recommendation engine we'd always planned on. There was only one catch: the music industry.

Warning: One paragraph non-sequitur on the futility of the music industry.

We pitched our technically impressive prototype to Rob Hayes and Christine Herron of First Round Capital. To their credit, they didn't just stare blankly at us, the way VCs who aren't gonna invest do in stories, they got down to brass tacks and asked the hard questions. We were pitching a recommendation engine to help bands know which mp3-blogs to promote themselves to. Rob did some back of the envelope math - if there are (optimistically) 5,000 PR firms out there who are in a position to pay for this, we'd have to charge them hundreds, if not thousands of dollars a month, in order to get the kind of return VCs are looking for. Alternatively, there are roughly 20 million bands on MySpace. If 10% of them were willing to pay $10 / month for the service, we'd be banking $240 million a year, which is a nice company, but not quite VC-sized, especially if these are our blue-sky numbers. Thank you Rob for the reality check! (If you are doing a music startup, don't give up, but do run realistic numbers.)

One good side-effect of building this system, was that it happened to be in good condition for a demo, when Jan Lehnardt was in Portland for OSCON. I think the impressiveness of my recommendation engine (all Ajax and Couch, with a little bit of Merb thrown in for caching) played a big role in my becoming a CouchDB committer.

Now that we've had that brief digression about why you shouldn't go into the music industry, I should tell you a little bit about myself. I've never been the kind of programmer who's pleased about complexity or other markers of how hard I'm working. I get flummoxed by the simplest things, like compiling php5 for dreamhost, remembering how to deal with mod-rewrite, or making sure my Mongrels stay alive despite memory leaks.

At some point I realized that CouchDB offered a chance to simplify all of that. When I started with the project, Couch served JSON (and binary attachments) over HTTP. Because it's written in Erlang, it has an uncanny knack for never dying. In all the time I was hammering it from my Hadoop cluster, I never once had to think about CouchDB's reliability. It just worked.

Then I had a Eureka moment: If CouchDB is just a web-server, why can't it be my only web server? Why deal with keeping Mongrel alive when I could just run one unkillable program and that'd be that?

I'd like to say the CouchApp idea was born fully-formed all at once, but it wasn't like that. I got my commit bit in September 2008, after writing a big patch that touched most of CouchDB (to simplify the internal JSON term format), and the first thing I did was stupid.

I was smitten with the idea that CouchDB could be an application server, so I spearheaded some code to allow the JavaScript engine that ships with CouchDB to act as a traditional app server. Users could write entire JavaScript applications that would be triggered via HTTP requests to CouchDB, and could make queries against CouchDB (or other web services), construct responses, and alter the database contents. Imagine Ruby on Rails but fronted by an Erlang web server that also happened to be a database. This approach was powerful -- too powerful.

I didn't yet understand the extent of CouchDB's ambitions. It's one thing to write a web server framework meant to be used and deployed by developers. In that context, you can give absolute flexibility, and require that developers understand the security implications of the code they write (no open proxies please!). But when you're writing code that will be deployed to servers, desktops, mobile phones, and even household appliances, you have to be much more cognizant of the security risks and trade-offs.

The action.js patch was too flexible for CouchDB's use-case, so I quickly reverted it, after Damien explained the context to me. Since then, I've noticed that a fair proportion of new contributors have a crazy idea up their sleeves that needs taming before they are ready to be stewards of the project. The tamed version of action.js turned into the externals protocol, which was completed with a lot of help from Paul Davis, and is now used to power CouchDB-Lucene among other extensions.

So that's the story of my first big failed patch (I've reverted a few things since then, but nothing so major.)

I'll skim over 2009 -- for me it was mostly about speaking at all the conferences I could, working on CouchApp (action.js without the fatal flaws), and getting to know the CouchDB community better.

Probably the CouchDB highlight of 2009 was CouchHack at Damien's house in Asheville, NC. It was there that I learned that not only is Damien a damn smart developer, he's someone I'd hang out with even if we weren't hacking code together. It was also that week that Damien, Jan and I started laying plans to create the startup that would become Couchio (aka Relaxed, Inc).

Since I started working on the CouchApp model, we've added a whole suite of RESTful capabilities to CouchDB, aimed at making it suitable for deployment on port 80. The show and list functions can transform JSON documents and views into HTML output (or any other format.) The motivation here is that without link-following, you can't be RESTful.

There have been a lot of other notable improvements to CouchDB and the ecosystem in the last 18 months. Not just in performance, security, and ease of use, but also in the things you can do with it.

Now that we've reached 1.0, the vision we've been working on is a reality. It feels really really good to be at this point. But now would be a horrible time to stop. There is a lot more to do in the realm of performance and usability improvements. But more importantly, we have a lot of work ahead of us showing the world just how powerful replicated local data is, and how simple CouchDB can be to use.

Couchio is in the process of writing professional documentation, creating a web-hosted CouchDB service, sponsoring the first CouchDB user conference at CouchCamp., and readying CouchDB for deployment to Android and other mobile platforms.