May 7, 2010
Choosing a key-value storage system (Cassandra vs. Voldemort)
Motivation
At Medallia, a key component of our system currently works with an open source relational db. Since this component mainly queries the db entries by key, we want to try to switch to a key-value storage system and take advantage of several benefits provided by such a system, including distributed replication, load balancing, and failover. One of our objectives is to re-architect this component in a way that will allow us to achieve horizontal scalability, that among other things will help us alleviate the high disk storage requirements we currently have.
Recently we took the time to look into this (and other technological improvements too, exciting times at Medallia right now!), and we reviewed several options. To make a long story short, we ended up with two finalists, Apache Cassandra and Project Voldemort.
High level comparison
Project Voldemort
While not an exhaustive list, these are the most relevant pros and cons we identified when reviewing both stores:
- Pros
- Simpler API
- Persistency based on Berkley DB, a mature and well-known key/value db
- Uses Vector Clocks instead of simple timestamps. It doesn’t need the nodes (or clients) clocks to be synchronized
- Cons
- No built-in support for “multiple data center”-aware routing (meaning there must be 1 copy of each key in at least one data center)
Apache Cassandra
- Pros
- Broader range of systems in production (Facebook, Twitter, Digg, Rackspace)
- Richer API which supports values with a dynamic column structure. The columns can evolve independently, meaning that you can update one column without reading the whole structure.
- Optimized for writes (by design)
- Configurable consistency level (specified on each request)
- Cons
- File format is still in development, changes to the internal structure are likely to happen. Due to the flexibility it provides, the file format is more complex and harder to reason with, especially in terms of performance
- Requires Clock Synch (NTP) (for nodes and clients)
- Reads are more disk-intensive than competitors
- Doesn’t support client conflict resolution, so the latest update always wins
Performance Tests
To our surprise this was the only link we’ve found that compares the performance for both projects – thus we decided to write this post to share our research. We used the vpork test framework, which we modified to suit our needs by upgrading the client code to the latest versions, adding a warm-up phase, and adding rewrite capabilities. These are the results of our tests:
Setup:
- Versions
- Voldemort v0.80.1
- Cassandra 0.6.0-beta3
- Boxes: 3 similar nodes with the following spec:
- 4GB maximum heap size
- Replication parameters: N=3 (replicas for each entry), R=2 (nodes to wait for on each read), W=2 (nodes to block for on each write)
- 8 processors on each server (Intel(R) Xeon(R) CPU E5504 @ 2.00GHz)
- 1TB disk space (Seagate ST31000340NS, 7200 RPM, 32MB cache)
- Persistence parameters
- Voldemort (default values)
- key-serializer: string
- value-serializer: identity (byte array)
- persistence engine=bdb (Berkley DB)
- bdb.cache.size=1536MB
- Cassandra
- ColumnFamily definition: CompareWith=”BytesType” RowsCached=”10000″
- ReplicationFactor=3
- Partitioner=org.apache.cassandra.dht.RandomPartitioner
- ConcurrentReads=16
- ConcurrenWrites=32
- Voldemort (default values)
- Tests
- Client Threads: 40
- Initial load: 5 million records – records present before starting each test
- WarmUp: 20K records – initial writes before measuring time for each test
- Number of operations per test: 500K
We ran tests for 4 different write-rewrite-read configurations. A write is equivalent to a put operation with a new record (non-existing key). A rewrite is a put operation with an existing key. A read is a get operation on an existing key. These are the configurations we tested:
- 50% Write 50% Read
- 10% Write 40% Rewrite 50% Read
- 50% Rewrite 50% Read
- 90% Rewrite 10% Read
We ran all the tests for two different value sizes, 15 and 1.5 KB. Even though we evaluated different options, for our current needs, the last one with a 15 KB data entry was the most representative scenario.
The first pair of charts shows the latency, or average time it takes a read or write operation to complete in each case. Lower values are better. As expected, Cassandra write (and rewrite) times were consistently faster than Voldemort, while read times varied a bit depending on the scenario but were more or less the same in general.
|
|
The second pair of charts shows the maximum time in the best 99% of cases; again lower values are better:
|
|
On the front-end, we have a write-back cache which means that write operations don’t affect the user experience. On the other hand, read operations are directly related to page loads. That’s why we were concerned about the peak for Cassandra read in the last scenario for 15KB. We ran some further tests to measure the 99.9% and 99.99% percentiles and the difference was even greater: 5050 ms for Cassandra and 748 ms for Voldemort in the first case, and 9176 ms against 1129 ms in the second case. This huge difference was a key decision factor for us.
Finally, these two charts show the general throughput in terms of operations (read or write) per second. In this case higher values are better:
|
|
Notes:
- Cassandra commit log and data folder are supposed to be placed at different disks to improve performance, we tested with both on the same disk.
Issues found while testing:
- Voldemort client.put(K key, V value) (not the one that takes a Version object) throws ObsoleteVersionException if called with the same key from different threads. The javadoc states “Associated (sic) the given value to the key, clobbering any existing values stored for the key. “, so this was not expected.
And the winner is …
I think there is no clear winner, in general terms. The best option depends on many factors that each company has to evaluate. My preference changed a few times during the review and tests.
Having said that, we had to choose one, and we decided to go with Project Voldemort. The main reasons were the simplicity, better versioning control, persistency layer maturity, and latency predictability.
We are currently developing the new solution, and it will take some time before we can put it in production, but we wanted to share our preliminary results with everyone who is considering one of these two options, so they’ll have one more tool at the time of the decision.
We’ll keep you posted on how it goes.
Diego Erdody
Lead Software Engineer
Other useful articles comparing different key-value stores:
- http://blog.endpoint.com/2010/03/nosql-live-dynamo-derivatives-cassandra.html
- http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html
- http://arstechnica.com/business/data-centers/2010/02/-since-the-rise-of.ars/2
- http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/
- http://bhavin.directi.com/tag/memcached/
- http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
These are interesting results thanks for sharing. But why cassandra 0.6.0beta3 and voldemort 0.80.1? 0.6.1 and 0.80.2 have been out for a while.
Also, would you share your configurations for both projects? e.g. row/key cache size in cassandra’s case, bdb transactions and serializers for voldemort.
Also, you may be interested in a similar benchmark suite released by Yahoo! at
http://github.com/brianfrankcooper/YCSB
while it does not cover voldemort yet (probably because it also benchmarks scan queries) it would be interesting to see it added

Comment by riffraff on May 8, 2010 at 1:47 amThanks for the interesting benchmarks!
I find it rather surprising, looking at the graphs and in the absense of other decision criteria, that you decided to use Voldemort.
Out of the write latency results, 15/16 are in favour of Cassandra.
For the read latency, 7/18 are in favour of Cassandra.
For the throughput, Cassandra beats Voldemort in 7/8 cases, mostly by significant amounts.
I do understand that you were concerned about 1 specific case, the last 15KB scenario, as this would show up most significantly on your frontend. On the other hand, the throughput for this case favours Cassandra by almost 3:1?
What about doing 99.9% and 99.99% for the other cases? I would find that interesting.
Thanks!
Benjamin
P.S. I’ve worked with Voldemort for the past 6 months, so feel free to ask if you need help…

Comment by Benjamin Nortier on May 8, 2010 at 2:31 amThank you for publishing this. What I’m surprised not to see is any information about any of the settings you used. Because Voldemort uses BDB-JE under the hood (well, if that is what you choose, and it looks like you did), then there are literally *dozens* of settings just for BDB-JE. Maybe half a dozen or them or so are *critical* for performance and control how often BDB’s log files are merged, how often various cleaner threads run, how often buffers are flushed to disk, etc. — all big factors when it comes to performance.
Maybe you can share that, too, so people know what you tested.
It would also be interesting to capture CPU and disk IO and such, while vpork is running, so people can understand what is going on there – do both C and V keep the CPU equally busy? How much disk IO does each of them create? etc.
Finally, do you think consulting with experts/developers from each project and have them help ensure that their solution is ideally tuned for your particular environment, data, and usage pattern?

Comment by Otis Gospodnetic on May 8, 2010 at 4:02 amI’m very concerned on how badly titled your article is. Cassandra isn’t a key/value storage its column-family based!!!!

Comment by TuAno on May 8, 2010 at 11:36 amSpeaking from experience, putting the commit log and the data folder on different disks vastly improves cassandra performance. The lack of seeks on the commit log disk vastly improves write times while the lack on “constant writing” vastly improves read times.
This was suggested for a reason.

Comment by Brady on May 8, 2010 at 1:22 pmThis is an interesting start to performance testing these systems, but raises many more questions than it answers. I am disappointed you chose not to investigate the enormous, unexplained spreads in performance for either system tested, nor to attempt to adjust tuning parameters to improve any of the metrics reported.

Comment by Benjamin Black on May 8, 2010 at 1:40 pmHow does all this compare to, say, MySql, Postgresql, SQLite3, and Oracle?
Otherwise, why do I want to support such a niche beastie?

Comment by Steve on May 8, 2010 at 2:26 pmthis benchmark is flawed big time. As someone mentioned earlier, Cassandra is NOT a key/value DB. Maybe Redis vs Voldemort would’ve been a better comparison.

Comment by anonymous on May 8, 2010 at 2:54 pmHi Diego, I find your article extremely useful and interesting for my line of work. Keep up with the great quality!!! Looking forward to your next post.

Comment by artemis on May 8, 2010 at 3:43 pmThanks for the comments! I’ll update the post with the specific persistency parameters we used, shortly.
For the people saying that Cassandra is not a key-value storage system, let me quote the Cassandra site: “Cassandra: A highly scalable, eventually consistent, distributed, structured key-value store. ”
The common factor is that you have to access the values by key, regardless of the content/structure.

Comment by Diego Erdody on May 8, 2010 at 4:01 pmA comparison to a conventional DB would be interesting. This numbers look very disappointing to me. I expect a conventional DB to perform much better in this scenario…

Comment by hagrid on May 9, 2010 at 12:27 amI did some testing on cassandra, and maybe I didn’t have it set up right or something, but I generated 1 million pseudo-random data entries and stored them, them retrieved them and found my two-node cassandra cluster failed to store all the data correctly. I also had problems when I took a node offline with the client connections.
I would advise people using any of these data storage systems to do their own testing with their own sample data, to measure performance, resilience and reliability.
Current favourite is Mongo, but YMMV!

Comment by Paul M on May 9, 2010 at 7:54 amWow, we still have that “key-value store” part on the website? We should fix that.
Without seeing more details, my guess would be that (a) you need to increase memtable size for 15KB values; the default will have you flushing every 1-3s here, which is too fast, and (b) high worst-case latency may indicate JVM pauses from GC, meaning your heap is too small, your GC options need tweaking, or both. (We’re working on better default GC options for 0.6.2.)
Are you going to publish your vpork-based code?
/cassandra developer

Comment by Jonathan Ellis on May 9, 2010 at 8:31 amYour replication settings suggest this test was conducted against a cluster of boxes.
1. How big was the cluster?
2. How did you target your reads and writes? (Per-thread round-robin? Global round-robin? Random? All against one host?)
3. Was there a timeout on the throughput test resulting in failover to a new host? (This would improve latency significantly for the client in the 99% case.)

Comment by David Schoonover on May 10, 2010 at 6:43 am“We used the vpork test framework, which we modified to suit our needs by upgrading the client code to the latest versions, adding a warm-up phase, and adding rewrite capabilities.”
I looked for a fork on github, can you please post your modified vpork?

Comment by Ian Kallen on May 12, 2010 at 12:27 pmI wonder if the rtesults are normalized to 1 server. If not, should I assume that there wwere 3 servers (as replication factor was 3)?

Comment by Israel Ziv on May 13, 2010 at 2:35 amI’ve updated the post with the persistence parameters in each case. I think it answers most of the questions. Thanks!

Comment by Diego Erdody on May 24, 2010 at 12:53 am