litany against fear

coding with spice ¤ by nick quaranto

Using Berkeley DB and Ruby for Large Data Sets Notes

published 20 Nov 2008

Part of my series of notes from the Professional Ruby Conference. See them all here.

This is a talk by Matt Bauer from Pedal Brain.

Large amount of data in their system with cycling. Tires, speed, wind, etc. 46 data points. 29 collected every 5 seconds. 1.7 TRILLION data points (50 GB!) with over 5000 concurrent requests. Everything from GPS location, power, etc. Devices are peridocially connected so data much be captured quickly and latency must be low. Realtime reporting of where cyclists are at with this system. Ability to do reports on many different levels. Typical report is 40k-220k rows.

They settled with lighttpd, merb, and sleepycat software’s stuff with BerkeleyDB. It’s fantastic: open source and highly concurrent. Can scale to 256 terabytes! Fully featured DB too…ACID, locking, and more. What’s special: it’s embedded, written in C with many bindings. No SQL or schema, doesn’t operate on keys or data. Like sqlite, but way more battle tested.

Two data stores. Concurrent data store: no recoverability or transaction capabilities. Transactional data store: fully recoverable and complete ACID feature set. So it’s very powerful…but is it good through Ruby? YES! Showing off quite a few tests about how blazingly fast it is. No marshalling data over the wire or through a socket so it’s very fast.

Very difficult to handle though. All sorts of special constraints that can be customizable. Two bindings that let Ruby talk to Berkeley. Guy Decoux wrote the original, which is pretty slow and built for a much older Ruby version. The newer once with Dan Janowski is an exact implementation of the C API, quite stable. Only three test cases though.

Guy’s bindings let you feel like the data is literally an object in Ruby. Dan’s way is a lot like C. A bit messy and uses bit flags, but it’s a direct translation from C anyway. There’s only one table with BerkeleyDB, but different types of access:


  • BTree: sorted, balanced tree. Ordered insertion results in full pages.

  • Hash: extended linear hashing

  • Queue: fixed length records

  • Recno: stable record numbers

Variable length key and data means 4 gigs of data. Damn. Transactions and cursors are in there too and can be done completely through Ruby code. Indexes are in too, but requires two separate database connections.

The future: combine the Ruby feel of Guy’s code with Dan’s C codebase, improve tests, more marshaling options. Support for newer version of the DB and moving to GitHub soon.

Databasess != SQL or Entity/Relationship. Just be happy and use Ruby! Use the right tool for the job too, if it’s BerkeleyDB, sqlite, MySQL, or whatever.



code (2) gaming (3) git (9) internet (5) prorubyconf (25) railsconf (17) ruby (10) windows (3)