Project Voldemort vs Cassandra for write heavy apps

Sean

unread,

Jan 31, 2011, 1:30:56 AM1/31/11

to project-voldemort

People seem to have consensus that Bigtable model (HBase/Hypertable)
is good for range query, and Dynamo model (Cassandra/Voldemort) is
good for write. Ok, let's discuss from this consensus:

For write-heavy apps, is there any benckmark between Project Voldemort
and Cassandra? -- I suppose the consistency model and DHT routing are
probably similar in these two systems. The performance has a lot to do
with the data node storage? (BDB vs SSTable?)

Is there any theoretical or empirical comparison? Or benchmark results?

Alex Feinberg

unread,

Jan 31, 2011, 3:23:13 PM1/31/11

to project-...@googlegroups.com

Hi Sean,

By default, Voldemort uses BerkeleyDB Java Edition as the storage
engine. BerkeleyDB Java Edition actually uses a log-structured B+Tree,
which is the same design principle as Log Structured Merge Trees used
by SSTables in BigTable/Cassandra. If you'd like to learn more, I
suggest reading the BerkeleyDB Java Edition Architecture white paper
(http://www.oracle.com/go/?&Src=4945225&Act=7) from Oracle/Sleepycat.
If you'd like to understand log structured systems in general, Mendel
Rosenblum's Ph. D dissertation is a good start:
http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf (the paper is about
a file system, but in grand scheme of things file systems and
databases are remarkably similar).

In terms of actual benchmarks, here's one:
http://blog.medallia.com/2010/05/choosing_a_keyvalue_storage_sy.html

Both Voldemort and Cassandra are also supported by the YCSB (Yahoo
Cloud Storage Benchmark). We provide a slightly modified version of
YCSB with Voldemort as the performance tool:
https://github.com/voldemort/voldemort/wiki/Performance-Tool

As far as I recall, writes are slightly faster in Cassandra and reads
are slightly faster in Voldemort. At least with version 0.6 and
earlier, I believe, out of the box, the performance impact of log
compaction is somewhat less visible in Voldemort than in Cassandra (of
course it entirely depends on your environment and configuration in
both cases).

HBase and Hypertable also use LSM trees and, in a normal scenario,
also have very high write performance. I am not very familiar with
Hypertable, but HBase is also able to do fast range scans. There are
some very interesting applications built that leverage write
performance and data model of HBase e.g., OpenTSDB.

The key difference between Dynamo and BigTable is the behaviour in a
failure scenario: in the case of BigTable, when a node responsible for
a partition goes down, there is a period when read and write
availability is lost until another node takes over. Using WAL
shipping (I believe that either is supported or may be supported by
HBase in the future), it's possible to achieve high availability for
reads and there's ongoing work to minimize the "transition" period for
a failed node down to a few seconds (presently, if I am correct, it's
around 1-2 minutes?). Once this is done, it would mean that upon
failure, you will see latency spikes as the clients retry writes until
a success happens. The advantage of this is ability to do more atomic
operations: e.g., to implement a counter in Voldemort, you have to use
an "optimistic lock" with a vector clock (see the applyUpdate() method
in StoreClient interface), but this can be done atomically in HBase.

Of course, keep in mind that I'm talking about the architecture here,
the implementation details change.

Thanks,
- Alex

> --
> You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> To post to this group, send email to project-...@googlegroups.com.
> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
>
>

Sean

unread,

Feb 1, 2011, 1:47:13 AM2/1/11

to project-voldemort

I see a vast performance difference in a published benchmark. can
someone give a thought on it?

On Jan 31, 12:23 pm, Alex Feinberg <feinb...@gmail.com> wrote:
> Hi Sean,
>
> By default, Voldemort uses BerkeleyDB Java Edition as the storage
> engine. BerkeleyDB Java Edition actually uses a log-structured B+Tree,
> which is the same design principle as Log Structured Merge Trees used
> by SSTables in BigTable/Cassandra. If you'd like to learn more, I
> suggest reading the BerkeleyDB Java Edition Architecture white paper
> (http://www.oracle.com/go/?&Src=4945225&Act=7) from Oracle/Sleepycat.
> If you'd like to understand log structured systems in general, Mendel

> Rosenblum's Ph. D dissertation is a good start:http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf(the paper is about

John Cohen

unread,

Feb 1, 2011, 11:49:04 AM2/1/11

to project-...@googlegroups.com

Hi All,

Benchmark is just one part of the overall system consideration that
you have to evaluate based on your business goal. Another one is how
easy or difficult something can be done with a particular persistent
implementation layer. For example backups/restore (point in time
recovery), "toolings" that you may need to debug, or inspect your
system, usability, etc.

Consider not just benchmark in your evaluation,
thanks,
-john

Tatu Saloranta

unread,

Feb 1, 2011, 2:01:00 PM2/1/11

to project-...@googlegroups.com

On Tue, Feb 1, 2011 at 8:49 AM, John Cohen <john.ja...@gmail.com> wrote:
> Hi All,
>
> Benchmark is just one part of the overall system consideration that
> you have to evaluate based on your business goal. Another one is how
> easy or difficult something can be done with a particular persistent
> implementation layer. For example backups/restore (point in time
> recovery), "toolings" that you may need to debug, or inspect your
> system, usability, etc.
>
> Consider not just benchmark in your evaluation,

Fully agreed -- there is sometimes tendency to overestimate value of
performance. And even with performance there are multiple trade-offs
between basic latency (~=maximum speed) and throughput (~=scalability
with concurrency).

Beyond backups/restores, ability to seamlessly add/remove nodes is
often important, and something not many distributed systems handle
ideally (how to bootstrap, do config files need to be manually
modified etc).

With respect to Voldemort, there seem to be developments to create
high(er)performance mysql/innodb-based backends, which could give
significant gains to performance and possibly allow better backup
functionality as well (since RDBMSs typically have well-developed
functionality in this area). Ability to plug different backends is
nice if and when backends do have different trade-offs.

-+ Tatu +-

Alex Feinberg

unread,

Feb 1, 2011, 3:57:52 PM2/1/11

to project-...@googlegroups.com

Hi Sean,

There is a large difference in the 99% percentile time with 1.5 kb
data (and improving 99% performance outliers is something that's our
roadmap), but the average write latency is only 3 ms higher in the
case of Voldemort with a 50/50 read/write workload (while overall
throughput is higher for Voldemort than Cassandra). In 50% write, 10%
re-write, 50% read workload average throughput is only slightly higher
in Cassandra's case.

Average throughput is indeed clearly higher in Cassandra's case for a
majority (90%) write throughput. I should note that in addition to
difference between SSTables and BDB-Je, another reason the write
throughput is so much higher in Cassandra's case is due to the fact
that in Voldemort, writes are conditional at the storage engine layer:
we fetch the old version, compare vector clocks and only write the
newer date if the old clock happened-before the new (that explains why
rewrite behaviour is comparable -- the old value is still in the BDB
cache or the operating system page cache).

Unfortunately the benchmark doesn't publish the 95% numbers which are
more realistic: I'd imagine both in the case of Cassandra and
Voldemort -- see Cassandra's suprisingly high latency with 99% writes
with 15kb values is due to garbage collection (incremental scans can
take up to few dozens of milliseconds, while full GC can easily take
up to a second).

I'd suggest using the YCSB tool to compare the systems back to back
with a workload similar to your environment. YCSB does print 95%
latency numbers. Take a look at the configuration options in terms of
tuning BDB and JVM parameters (the JVM GC settings will likely apply
to Cassandra as well):

http://project-voldemort.com/configuration.php

You'll also want to do your test with more than a 4GB heap: this lets
you set a higher cache BDB cache size without running into memory
pressure due to garbage collection. We use an 18GB Heap and a 10GB bdb
cache in production at LinkedIn (overall, the recommendation from the
BDB folks is to use no more than 2/3 of your JVM heap for BDB cache).
Larger cache means better read performance (important in read and
write-after-read scenarios).

You'll also want to take at:

1) How gratefully these systems degrade as your data:memory ratio changes
2) How they behave with different probability distributions in
requests and thus cache locality (does the hot portion of your data
fit into RAM?)

You'll also want to look at using bdb 4.1.7 (which features improved
performance) It's on my radar to upgrade the version of BDB to this,
but I still haven't had a chance to fully evaluate it: there is a
bdb417 branch on github which you can use to test.

Ultimately Cassandra's storage engine (and the way it is coupled to
the rest of the system) is an advantage of Cassandra when it comes to
writes. However, you should see how it performs with *your specific*
workload and look at throughput as well as average and 95% latency --
to see see what performance will look like for you.

As others have stated, the feature differences are (IMO) the more
important difference: when the hot portion of your data fits into
memory, you can easily saturate the network with either system.

Thanks,
- Alex

Reply all

Reply to author

Forward