ScyllaDB - C++14 based Cassandra drop-in replacement

Vitaly Davidovich

unread,

Sep 22, 2015, 2:57:13 PM9/22/15

to mechanical-sympathy

If you ever wondered what Cassandra perf would be like if written in native code with perf in mind, this would probably be it: http://scylladb.com

From the same guys at Cloudius Systems that develop Seastar and OSv.

sent from my phone

Greg Young

unread,

Sep 22, 2015, 6:29:16 PM9/22/15

to mechanica...@googlegroups.com

When I see a 4x io performance increase eg writes I tend to lean towards different trade offs on durability. Are there any white papers on this?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Studying for the Turing test

Vitaly Davidovich

unread,

Sep 22, 2015, 6:41:14 PM9/22/15

to mechanical-sympathy

You should read (if you haven't yet) the architecture section on that site and also the explanation/setup/analysis of the benchmark itself. AFAICT, they do not sacrifice durability. Tests were done on a pretty beefy machine (e.g. 48 lcore machine, 4x RAID SSD drives, and 10Gb kernel-bypassed NIC (via DPDK)) - they have full specs there.

sent from my phone

Richard Warburton

unread,

Sep 23, 2015, 1:14:44 PM9/23/15

to mechanica...@googlegroups.com

I thought their benchmarking process was interesting. They deserve credit for transparency though and using Cassandra's load tests.

They've used a kernel bypass library in their case and benchmarked on hardware that is supported by said library. It's a shame there are no comparisons with what happens when you use a nic which is virtualized or a nic without dpdk support. My understanding is that dpdk works but falls back to using pcap in this case.

They could also have benchmarked on some hardware with Java kernel bypass on Cassandra.

It's kind of hard to tell otherwise the extent to which the improvements are dpdk or scylladb.

Vitaly Davidovich

unread,

Sep 23, 2015, 1:43:25 PM9/23/15

to mechanical-sympathy

They've used a kernel bypass library in their case and benchmarked on hardware that is supported by said library. It's a shame there are no comparisons with what happens when you use a nic which is virtualized or a nic without dpdk support. My understanding is that dpdk works but falls back to using pcap in this case.

Seastar supports using kernel networking stack, so it would've been an interesting datapoint to see Scylla perf using that instead of DPDK entirely. Presumably anyone interested in this type of performance from a single machine will not be running on a hypervisor with virtualized networking to begin with.

They could also have benchmarked on some hardware with Java kernel bypass on Cassandra.

How? Cassandra uses standard JDK networking APIs as far as I know. To fully take advantage of kernel bypass, they'd likely need to rearchitect their internals and have a JNI wrapper on top of the userland driver.

It's kind of hard to tell otherwise the extent to which the improvements are dpdk or scylladb.

It's safe to say that 10x improvement isn't *purely* due to C++. Rather, better internal design that takes advantage of hardware, more control over memory (layout and alloc/dealloc), efficient abstractions/code, etc that's enabled by using C++.

Georges Gomes

unread,

Sep 23, 2015, 3:21:49 PM9/23/15

to mechanica...@googlegroups.com

They could also have benchmarked on some hardware with Java kernel bypass on Cassandra.

How? Cassandra uses standard JDK networking APIs as far as I know. To fully take advantage of kernel bypass, they'd likely need to rearchitect their internals and have a JNI wrapper on top of the userland driver.

>> Solarflare (for example) provide kernel bypass drivers for Linux.

Called open-onload, you run "onload java -cp..." and it takes over the network calls.

Works very well.

If my memory is correct, I used to get 5us roundtrip from Java-to-Java versus 35us with a standard NIC.

Cheers

GG

Vitaly Davidovich

unread,

Sep 23, 2015, 3:33:00 PM9/23/15

to mechanical-sympathy

Solarflare (for example) provide kernel bypass drivers for Linux.
Called open-onload, you run "onload java -cp..." and it takes over the network calls.

Yes, but:

1) They'd need a machine with a solarflare NIC; given they're DPDK users I suspect they already had the necessary kit. Does AWS or any other cloud provider have machines with solarflare cards?

2) The code would still travel through the JDK nio stuff; while having onload intercept syscalls would help, I'm not sure how much in the grand scheme of things. Typically, you'd want to also not use the JDK networking APIs and write your own.

3) To fully take advantage of higher rx/tx throughput, you'd need to change the internal system architecture.

So while using onload would physically work, I'm not sure how much would be gleaned? Instead, I think it'd be interesting (and likely easier) to benchmark Scylla using kernel networking instead of bypass.

Greg Young

unread,

Sep 23, 2015, 7:13:54 PM9/23/15

to mechanica...@googlegroups.com

not responding to this in particular but most of the write throughput
I would hope to be limited by IO. Is there a link to a more specific
whitepaper on benchmarks?

Vitaly Davidovich

unread,

Sep 23, 2015, 7:45:27 PM9/23/15

to mechanical-sympathy

Well, i/o throughput continues to increase - we already have SSDs capable of 3GB/s reads and 2GB/s writes; I suspect this will continue to improve. At some point, what has historically been i/o bound may become cpu bound.

I'm not aware of any whitepapers for the benchmark other than pages on their site (http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark/ and http://www.scylladb.com/2015/09/22/watching_scylla_serve_1m/). It looks like they're sustaining about 1.1-1.3M writes/s equaling ~500MB/s written to disk. There's also an informative page on benchmarking scylla: http://www.scylladb.com/kb/benchmarking-scylla/. In general, there are actually some nice tidbits of info in their Knowledgebase section.

Benedict Elliott Smith

unread,

Sep 23, 2015, 11:38:06 PM9/23/15

to mechanica...@googlegroups.com

It looks like they're sustaining about 1.1-1.3M writes/s equaling ~500MB/s written to disk

I'm actually surprised at this. The write workload operated over only 1M unique rows, or only ~250Mb (including overhead) of unique data. Naively one might assume that makes sense, as everything needs to be written twice, but since the duplicate data is resolved in memtables, only the commit log should need to be written with regularity. With just 8Gb of offheap memtable space, Cassandra would flush only 1/10th of the data it received (and compact this data once), resulting in around 300Mb/s. Why they were writing 500Mb/s with over 80Gb of memory available to their memtables I'm not sure.

I was also a little surprised by their cache graphs. With 10M rows, each consisting of: 5 * (34 bytes value + 2 bytes id) + 1 * 10 bytes, almost 70Gb was occupied, an overhead of almost 40x, or over 1K per 34 byte string. If I counted the noughts correctly.

As to performance, I'm not at all surprised Cassandra does not win this particular scenario - we've had tickets filed to improve our threading and networking models in order to better serve these workloads for a long time. They just aren't seen as important as other improvements, as really very few users want an in-memory single node cluster serving such trivial queries. Those that do probably use Redis. We periodically perform a cost/benefit analysis on this work (it's a common point of discussion) and put it to the end of a long queue. As Vitaly says, though, there is a pretty strong pace of hardware improvements which may necessitate it bumping further up the queue in the near future.

I will note I'm a little surprised at the throughput of Cassandra on their benchmark, as we can push 250-300k/s through an m3.8xlarge EC2 node for this same workload, but I don't really want to dissect it, I just thought that - as the most recently culpable mutilator of cassandra-stress - the extra context I could provide might be of value to the community.

DISCLAIMER: I'm an individual with thoughts and volition of my own, who only happens to be a committer to Apache Cassandra.

Vitaly Davidovich

unread,

Sep 24, 2015, 12:23:56 AM9/24/15

to mechanical-sympathy

Actually, I just noticed that this page, http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark/, states that scylla used kernel networking in this benchmark.

sent from my phone

Vitaly Davidovich

unread,

Sep 24, 2015, 12:32:40 AM9/24/15

to mechanical-sympathy

Benedict, they have a Google group ("ScyllaDB users") - might be interesting to pose your benchmark questions to them there?

As for your comment regarding most users not caring about/wanting a single node cluster ... I'm sure you'd agree that single node performance still matters even if real deployment has a full cluster.

sent from my phone

Benedict Elliott Smith

unread,

Sep 24, 2015, 12:45:29 AM9/24/15

to mechanica...@googlegroups.com

I would rather not get involved a flamewar (even a polite one is, at best, terribly time consuming), I just wanted to provide some context for what stress was actually doing, to better inform discussion.

Certainly local node performance is important, but that's very different to single node, since cluster behaviour is very different when replication is involved. However, either way, I simply don't encounter anyone even pushing 100k/s/node, except simple benchmarks we've run ourselves. Most workloads involve more complex data models, where these inefficiencies simply do not have a chance to exhibit. As such we tend to focus on those other areas. We will no doubt reach a point where hardware and improvements in those areas shift the common bottleneck to where this benchmark is, but that's not typical at the moment, from the data I am exposed to.

Vladimir Rodionov

unread,

Sep 26, 2015, 3:11:05 PM9/26/15

to mechanical-sympathy

Single server performance matters of course, but features matter the most: data durability, tolerance to failures, MTTR, replication, inter DC support, atomic counters (kidding :))

ScyllaDB should compare itself with Aerospike, not with Cassandra.

-Vlad

Rajiv Kurian

unread,

Sep 26, 2015, 3:41:28 PM9/26/15

to mechanical-sympathy

Exactly. Even if you have something intercept your read(), write(), accept() calls and provide an alternate implementation with a DMA engine and a better TCP/IP stack, you are still handicapped by the BSD socket API call, which makes you do system calls for each socket. Most of the high performance network stacks that I read about use better batch API calls which let you amortize the system call. For example when it comes to UDP, the Linux system calls (not a user space implementation) of recvmmsg() and sendmmsg() help a lot in C/C++. And there is no way (without JNI) to access these calls to access these calls. Of course this is not to bring down the easy gains available in Java from just loading the SolarFlare driver.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Greg Young

unread,

Sep 26, 2015, 4:11:23 PM9/26/15

to mechanica...@googlegroups.com

D is normally the blocker.

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-symp...@googlegroups.com.

Reply all

Reply to author

Forward