CouchDB MapReduce example: word count
Note: Updated for CouchDB trunk as of 1/15/09
One of the classic Hadoop MapReduce tutorials counts words in a text corpus. Word counts are a great way to teach the fundamentals of MapReduce, and there’s a lot of free books on Project Gutenburg.
To follow along at home, checkout Couchrest from the Github
git clone git://github.com/jchris/couchrest.git
Also install the gem
sudo gem install couchrest
I’ve included the example code as well as 3 books in the “example” directory.
The short version is:
cd couchrest
ruby examples/word_count/word_count.rb #loads the books into couchdb
ruby examples/word_count/word_count_views.rb #creates the design document
ruby examples/word_count/word_count_query.rb #runs the query
The last step could take a few minutes (and you may have to rerun it if Ruby times out). But eventually you’ll get some happy output.
To re run the queries – also fun to edit and play with params:
ruby examples/word_count/word_count_query.rb
The initial reduction can take about 5 minutes to run on the average MacBook, so this ruby script will probably time out and fail the first time. Go get some coffee. When you come back, run it again. Once the reduce has run, queries should be nearly instantaneous.
The code teaches the fundamentals of CouchDB view functions, collation order, and reduce query params, and provides some helpful output while doing so.
The upshot is that you can now query for the count of any word, in one of the three indexed books, or in all three. And those queries are fast!