Tuesday, April 13, 2010

Experimenting with Databases

I've been playing around with various Open-Source databases for a little weekend database project.

Last weekend was really rainy. April usually feels like summer here, but this year is an el Nino year so we're getting a lot of rain... I shouldn't complain though, as the rain turns into great powder in Tahoe and we've had a great ski season!!

But I didn't go skiing last weekend... For my database project I had two buckets of data to work with:

Bucket (A)
- low volume of data (35000 entries)
- structured data with a stable structure
- low update and access rate (a few thousands a day)
- need flexible query and offline analysis capability

Bucket (B)
- higher volume (500000+ entries)
- unstructured text
- lots of reads/writes, no deletes
- need super fast key-based lookups for interactive access

I looked at the following SQL and NoSQL databases:
- BerkeleyDB
- Apache Cassandra
- Apache CouchDb
- MongoDB
- PostgreSQL
- TinyCDB
- Tokyo Cabinet

There has been a lot of buzz about SQL vs NoSQL databases recently so I wanted to try both types of databases.

I quickly eliminated the following products as their licenses were not permissive enough for me and not compatible with the Apache license:
- BerkeleyDB (viral as code using it must be made available in source form)
- Tokyo Cabinet (LGPL)
- MongoDB (AGPL)

Then I tried the remaining products. I say 'tried' instead of 'tested' as I didn't have much time for a scientific test approach and was mostly trying to get a feel for each database. I just used the simple tools provided out of the box with each product and didn't spend time optimizing their configuration.

I used data from bucket (B), hoping that the products that performed well with (B) would fit the bill for bucket (A) as well: 580000 text entries, keys ranging from 3 to around 200 bytes, same characteristics for values, total 137 Mb

I tried to insert the 580000 entries one by one, then did 100000 random gets.

Here are my first observations:

Apache Cassandra
I used binary release 0.5.1. Cassandra was easy to setup and get started with, although I got a little lost in their Wiki.

580000 inserts with cassandra-cli took 15mn, used 100% CPU, quite a bit of network I/O and 520Mb of disk space.

Average speed:
- insert 1.55ms
- get 1.58ms

Apache CouchDb
I built version 0.11.0 from source and had to work through issues with different levels of libmozjs on Ubuntu. Getting started took 10mns, although I'd have liked a real client instead of just CURL.

580000 inserts with curl took 124 mn, used 100% CPU, a lot of network I/O and... 4.5Gb of disk space!

Average speed:
- insert 12.82ms
- get n/a... couldn't measure gets as the CouchDb view creation just kept hanging.

I also didn't understand how to create CouchDb views right away, and could have used a better 'Getting started with views doc'.

Version 9.0alpha4 was easy to build from source, setup and get started. I picked version 9.0 as it's supposed to have new high availability + replication features that I may want to try later...

580000 inserts with psql took 85 mn, used mostly disk I/O (didn't use much CPU at all, which was good) and 243Mb of disk space.

Average speed:
- insert 8.79ms
- get/select (with no index) 173ms

After creating an index, which took 40sec:
- get/select 0.085ms

I built version from source and got started in about 5 mns.

20000 inserts with sqlite3 took very long, using mostly disk I/O (I lost patience and killed it after 15mn.)

Average speed:
- insert 45ms
- get n/a... didn't measure as I had given up loading the database

I think the problem was that sqlite3 does an fsync for each write. Their doc describes how to not fsync, but then I can't afford losing my data... and I won't be able to batch inserts either.

That one is a different animal... I used tinycdb version 0.77. It's really tiny (core code under 1400 lines of C) and easy to build from source, with a tiny doc, but very simple to use.

TinyCDB is a 'constant' database. You can't modify it after you've added entries to it and saved the database file. To update the database, you need to rebuild it from scratch with the new values. It's so fast though, that doing that is not necessarily a problem...

580000 inserts with the cdb command took 2.25 sec and 150Mb of disk space

Average speed:
- insert 0.00379ms
- get 0.00683ms

That's about 1 to 2 orders of magnitude faster than the other databases!

With these initial observations, I'm tempted to experiment further with PostgreSQL for bucket (A) and TinyCDB for bucket (B).

Both packages are proven and stable technologies too. PostgreSQL has been deployed in production for years, and TinyCDB is used in production in well known DNS and mail packages like djbdns, fastforward, mess822, qmail etc.

I'll probably have to put together a simple cascading scheme to handle updates with TinyCDB (a small TinyCDB db containing recent and/or frequent updates, periodically flushed into a bigger one, for example), but it shouldn't be too hard to do.

I'll try to post an update here after another weekend or two playing with this...


Philippe said...

Very nice!
tinycdb (and cdb that that inspired it) is incredible in its simplicity and completely imbeatable speed. The fact it is read only is almost a non-issue given recreating a db is freaking fast!

Ritesh said...

Thanks for your wonderful comparison. I am trying to learn more about TinyCDB. I have never worked before CDB. Do you have any good pointers

Jean-Sebastien Delfino said...

You can start there:
http://www.corpit.ru/mjt/tinycdb.html. There's a few more pointers on that page. Then just download and read the code, it's not big and pretty easy to follow.

The postings on this site are my own and don’t necessarily represent positions, strategies or opinions of my employer IBM.