New ScyllaDB benchmarks

Avi Kivity

unread,

Oct 14, 2015, 12:53:27 PM10/14/15

to mechanical-sympathy

Hello mechanical sympathizers,

Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.

I'll be happy to answer questions about the benchmark or the architecture.

[1] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-2/

[2] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-cluster-1/

Vitaly Davidovich

unread,

Oct 14, 2015, 1:07:45 PM10/14/15

to mechanical-sympathy

You guys are doing good work. The seastar and scylla code is very clean too.

Have you tried these benchmarks with dpdk networking? Curious what that config would yield.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,

Oct 14, 2015, 1:12:49 PM10/14/15

to mechanical-sympathy

Thanks. We haven't run the dpdk benchmarks yet, because that code needs some more work for load-balancing, and frankly, the performance is awesome enough already.

I estimate between 20%-50% improvement. When you have more clients and connections, the improvement increases, since tcp can batch fewer messages per connection.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Oct 14, 2015, 1:29:18 PM10/14/15

to mechanical-sympathy

Yup, fair enough.

How are you currently handling load balancing? If some connections are more loaded than others, do you do anything dynamic? Or once once a connection is accepted on a shard it stays there till close?

Also, noticed you're using a custom cpu-local memory allocator with deferred reclaim. If you do need to send a message between two engines with one of them allocating, is the receiver engine sending back a message to tell sender to delete the allocation?

How does scylla handle memory exhaustion/oom?

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,

Oct 14, 2015, 1:36:56 PM10/14/15

to mechanical-sympathy

Also, how are you finding AIO/direct IO support in xfs? Is it fully async, or do you hit stalls from time to time on submission?

Avi Kivity

unread,

Oct 14, 2015, 1:37:05 PM10/14/15

to mechanical-sympathy

On Wednesday, October 14, 2015 at 8:29:18 PM UTC+3, Vitaly Davidovich wrote:

Yup, fair enough.

How are you currently handling load balancing? If some connections are more loaded than others, do you do anything dynamic? Or once once a connection is accepted on a shard it stays there till close?

Connections do not move (they are balanced on accept).

We'll load-balance individual requests if cores are unbalanced, but this is not yet committed. About half the work is self-balancing due to sharding data.

Also, noticed you're using a custom cpu-local memory allocator with deferred reclaim. If you do need to send a message between two engines with one of them allocating, is the receiver engine sending back a message to tell sender to delete the allocation?

Yes, every message has a completion response so we delete on origin core, and if we send back data, we wrap it in a foreign_ptr<> so it knows to delete itself on the right core.

How does scylla handle memory exhaustion/oom?

Reclaim memory from cache.

sent from my phone

On Oct 14, 2015 1:12 PM, "Avi Kivity" <a...@cloudius-systems.com> wrote:

Thanks. We haven't run the dpdk benchmarks yet, because that code needs some more work for load-balancing, and frankly, the performance is awesome enough already.

I estimate between 20%-50% improvement. When you have more clients and connections, the improvement increases, since tcp can batch fewer messages per connection.

On Wednesday, October 14, 2015 at 8:07:45 PM UTC+3, Vitaly Davidovich wrote:

You guys are doing good work. The seastar and scylla code is very clean too.

Have you tried these benchmarks with dpdk networking? Curious what that config would yield.

On Wed, Oct 14, 2015 at 12:53 PM, Avi Kivity <a...@cloudius-systems.com> wrote:

Hello mechanical sympathizers,

Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.

I'll be happy to answer questions about the benchmark or the architecture.

[1] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-2/
[2] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-cluster-1/

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Avi Kivity

unread,

Oct 14, 2015, 2:07:07 PM10/14/15

to mechanical-sympathy

On Wednesday, October 14, 2015 at 8:36:56 PM UTC+3, Vitaly Davidovich wrote:

Also, how are you finding AIO/direct IO support in xfs? Is it fully async, or do you hit stalls from time to time on submission?

It's fully async if you avoid a few things (like writing beyond eof (at eof is ok)). In general it works better than I expected.

On Wed, Oct 14, 2015 at 1:29 PM, Vitaly Davidovich <vit...@gmail.com> wrote:

Yup, fair enough.

How are you currently handling load balancing? If some connections are more loaded than others, do you do anything dynamic? Or once once a connection is accepted on a shard it stays there till close?

Also, noticed you're using a custom cpu-local memory allocator with deferred reclaim. If you do need to send a message between two engines with one of them allocating, is the receiver engine sending back a message to tell sender to delete the allocation?

How does scylla handle memory exhaustion/oom?

sent from my phone

On Oct 14, 2015 1:12 PM, "Avi Kivity" <a...@cloudius-systems.com> wrote:

Thanks. We haven't run the dpdk benchmarks yet, because that code needs some more work for load-balancing, and frankly, the performance is awesome enough already.

I estimate between 20%-50% improvement. When you have more clients and connections, the improvement increases, since tcp can batch fewer messages per connection.

On Wednesday, October 14, 2015 at 8:07:45 PM UTC+3, Vitaly Davidovich wrote:

You guys are doing good work. The seastar and scylla code is very clean too.

Have you tried these benchmarks with dpdk networking? Curious what that config would yield.

On Wed, Oct 14, 2015 at 12:53 PM, Avi Kivity <a...@cloudius-systems.com> wrote:

Hello mechanical sympathizers,

Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.

I'll be happy to answer questions about the benchmark or the architecture.

[1] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-2/
[2] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-cluster-1/

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rajiv Kurian

unread,

Oct 15, 2015, 2:14:11 PM10/15/15

to mechanical-sympathy

Thanks for sharing. Seastar seems like a great little library. I like how high-level the code is for something that does so much behind the covers. I'll be sure to give it a try. Is there any documentation on the networking stack (DPDK or kernel)?

Avi Kivity

unread,

Oct 16, 2015, 2:58:13 AM10/16/15

to mechanical-sympathy

Indeed the ability to create high level abstractions while retaining low level control is what makes C++ suitable for high performance servers. I'm surprised that with this group's charter most of the discussion is about Java, and ways to workaround the JVM's abstraction penalty.

What kind of documentation are you looking for? The API is documented in http://docs.seastar-project.org/master/group__networking-module.html.

Francesco Nigro

unread,

Oct 16, 2015, 12:07:07 PM10/16/15

to mechanical-sympathy

Hi!!

This could be a good reason:

http://vanillajava.blogspot.it/2013/07/c-like-java-for-low-latency.html?m=1

Vitaly Davidovich

unread,

Oct 16, 2015, 12:41:37 PM10/16/15

to mechanical-sympathy

"Low level" java is ugly, error prone, and still doesn't yield sufficient control. Hotspot code generation isn't as tight as GCC/LLVM. You still pay for GC write barriers even if you don't GC. You still get random safepoints (i.e. GuaranteedSafepointInterval induced ones). Deoptimization/profiling induced stalls. Expensive FFI. JIT warmup problems. Etc. There are benefits to it, of course, but pushing performance is swimming upstream.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Francesco Nigro

unread,

Oct 16, 2015, 12:56:37 PM10/16/15

to mechanical-sympathy

I'm completly ignorant about the differences between GCC and jvm compiled code but i understand that a difference exists: in my everyday job i cannot choose other languages so for me there are no choices. I don't know if for the other members of this group it is the same...or there are other reasons that lead them to choose java instead of c++ (or rust?)...

Vitaly Davidovich

unread,

Oct 16, 2015, 1:14:44 PM10/16/15

to mechanical-sympathy

JVM is a good platform and java is a good ecosystem for being productive fairly quickly. Personally, I don't think it's a good platform to build systems-level software, like databases (distributed or not), or other software that should make as much use of hardware it's running on as possible. It's "good enough" for a lot of people though.

I encourage anyone to read this comment by Todd Lipcon (I believe I've seen him on this list as well, so maybe he'll chime in): https://news.ycombinator.com/item?id=10298024

On Fri, Oct 16, 2015 at 12:56 PM, Francesco Nigro <nigr...@gmail.com> wrote:

I'm completly ignorant about the differences between GCC and jvm compiled code but i understand that a difference exists: in my everyday job i cannot choose other languages so for me there are no choices. I don't know if for the other members of this group it is the same...or there are other reasons that lead them to choose java instead of c++ (or rust?)...

Todd Lipcon

unread,

Oct 16, 2015, 1:26:59 PM10/16/15

to mechanica...@googlegroups.com

On Fri, Oct 16, 2015 at 10:14 AM, Vitaly Davidovich <vit...@gmail.com> wrote:

JVM is a good platform and java is a good ecosystem for being productive fairly quickly. Personally, I don't think it's a good platform to build systems-level software, like databases (distributed or not), or other software that should make as much use of hardware it's running on as possible. It's "good enough" for a lot of people though.

I encourage anyone to read this comment by Todd Lipcon (I believe I've seen him on this list as well, so maybe he'll chime in): https://news.ycombinator.com/item?id=10298024

Hey! That's me! :) Yep, I continue to be interested in the low-level Java techniques discussed on this list despite being a C++ coder most recently, and a lot of the topics apply in C++ as well.

I don't have a whole lot to add to what I wrote in the comment above, except that the point made in Francesco's link above is pretty valid about integration with ecosystem libraries. Java has a whole lot of libraries and they tend to "play nice" together. C++ can be a bit harder to get different libraries from different vendors to agree with each other on the way things should be done. For a low levels systems project like a database storage engine (my current project), C++ works pretty well. I'd still choose Java for most "applications".

-Todd

Vitaly Davidovich

unread,

Oct 16, 2015, 1:30:12 PM10/16/15

to mechanical-sympathy

Hey! That's me! :)

Wow, that was fast! :)

Vitaly Davidovich

unread,

Oct 16, 2015, 1:36:57 PM10/16/15

to mechanical-sympathy

Java has a whole lot of libraries and they tend to "play nice" together

This is definitely true so long as you're not doing low level java; most libraries don't give a crap about allocations, indirections, etc. This is part of the issue: if you're, e.g., avoiding the GC, a lot of that library ecosystem goes out the window, including the JDK.

On Fri, Oct 16, 2015 at 1:26 PM, Todd Lipcon <to...@lipcon.org> wrote:

Rajiv Kurian

unread,

Oct 16, 2015, 2:02:19 PM10/16/15

to mechanical-sympathy

Thanks for the link. The API documentation is good enough to get started. Also found the wiki at https://github.com/scylladb/seastar/wiki very useful.

How would you compare using the host OS TCP stack through Seastar vs straight up using epoll?

Vitaly Davidovich

unread,

Oct 16, 2015, 9:51:44 PM10/16/15

to mechanical-sympathy

Avi,

Where's the code that balances connections on accept? There's no accept thread, is each shard registering epoll notification for accept? I noticed SO_REUSEPORT is disabled with a comment about it causing load imbalance - did you guys figure out why?

One other tidbit I found while looking is submitting to an smp queue doesn't push the message until at least 16 messages are on it (batch_size). Is there a reason for the delay? Are you trying to amortize cost of working with the spsc queue?

How does back pressure work here? If the queue is full, the messages continue to be appended to a local deque. What happens if another shard is stalled for a long period of time and a shard continues enqueing? Do you get a bad_alloc and handle it somewhere by doing the cache reclaim you mentioned?

Also, what is the devops story beyond collectd? Any builtin alerting? Or is the idea to use collectd and provide custom alerting on top of that?

sent from my phone

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Avi Kivity

unread,

Oct 17, 2015, 12:04:35 AM10/17/15

to mechanica...@googlegroups.com

On 10/16/2015 09:02 PM, Rajiv Kurian wrote:

Thanks for the link. The API documentation is good enough to get started. Also found the wiki at https://github.com/scylladb/seastar/wiki very useful.

How would you compare using the host OS TCP stack through Seastar vs straight up using epoll?

When using the host stack, seastar is just a wrapper. The problem with epoll programming is that it "flattens" your program: all every time you run out of input you have to roll back to the main loop. Seastar allows you to do even driven programming while pretending code runs in multiple fibers.

On Thursday, October 15, 2015 at 11:58:13 PM UTC-7, Avi Kivity wrote:

Indeed the ability to create high level abstractions while retaining low level control is what makes C++ suitable for high performance servers. I'm surprised that with this group's charter most of the discussion is about Java, and ways to workaround the JVM's abstraction penalty.

What kind of documentation are you looking for? The API is documented in http://docs.seastar-project.org/master/group__networking-module.html.

On Thursday, October 15, 2015 at 9:14:11 PM UTC+3, Rajiv Kurian wrote:

Thanks for sharing. Seastar seems like a great little library. I like how high-level the code is for something that does so much behind the covers. I'll be sure to give it a try. Is there any documentation on the networking stack (DPDK or kernel)?

On Wednesday, October 14, 2015 at 9:53:27 AM UTC-7, Avi Kivity wrote:

Hello mechanical sympathizers,

Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.

I'll be happy to answer questions about the benchmark or the architecture.

[1] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-2/

[2] http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-cluster-1/

--

Avi Kivity

unread,

Oct 17, 2015, 12:16:55 AM10/17/15

to mechanica...@googlegroups.com

On 10/17/2015 04:51 AM, Vitaly Davidovich wrote:

Avi,

Where's the code that balances connections on accept? There's no accept thread, is each shard registering epoll notification for accept?

Look at net::posix_server_socket_impl::accept().

I noticed SO_REUSEPORT is disabled with a comment about it causing load imbalance - did you guys figure out why?

No, we didn't chase it down.

One other tidbit I found while looking is submitting to an smp queue doesn't push the message until at least 16 messages are on it (batch_size). Is there a reason for the delay? Are you trying to amortize cost of working with the spsc queue?

Exactly. The queue will be flushed if we reach the polling loop regardless of how many messages are queued, so latency is bounded.

How does back pressure work here? If the queue is full, the messages continue to be appended to a local deque. What happens if another shard is stalled for a long period of time and a shard continues enqueing? Do you get a bad_alloc and handle it somewhere by doing the cache reclaim you mentioned?

Shards poll at least every two milliseconds, so any queuing will be limited. Concurrency at the originating shard is limited, so if the remote shard really stalls it will queue all the work it can and idle.

Also, what is the devops story beyond collectd? Any builtin alerting? Or is the idea to use collectd and provide custom alerting on top of that?

There are no builtin alerts, those are all done on the collectd server side. In practice we're more on the dev side of devops; we use collectd for performance monitoring, not alerts. No doubt the alerting side will grow in the future.

Gleb Natapov

unread,

Oct 17, 2015, 4:54:52 AM10/17/15

to mechanical-sympathy

On Saturday, October 17, 2015 at 7:16:55 AM UTC+3, Avi Kivity wrote:

On 10/17/2015 04:51 AM, Vitaly Davidovich wrote:

I noticed SO_REUSEPORT is disabled with a comment about it causing load imbalance - did you guys figure out why?

No, we didn't chase it down.

I thought we have a pretty god idea. The problem is the same as with our DPDK code. Just like RSS that DPDK uses to distribute connection SO_REUSEPORT uses hash

on tcp tuple to do so and with low number of connections it does not result in a good distribution. cassandra-stress opens relatively low number of connection no matter what parallelism
is configured.

Nitsan Wakart

unread,

Oct 18, 2015, 6:28:18 AM10/18/15

to mechanica...@googlegroups.com

Hi Avi,

Results are very impressive, well done.

I would caution against measuring latency using cassandra-stress as it suffers from the good old coordinated omission issue in reporting latencies. Fixing this measurement issue is likely to make the Cassandra results look worse in my experience, not sure what difference it will make to your DB. Note that there a corrected fork of cassandra-stress to be found here: https://github.com/LatencyUtils/cassandra-stress2

This repo is a bit behind the Cassandra repo, but a merge should not be a huge effort.

Regards,

Nitsan

Avi Kivity

unread,

Oct 18, 2015, 8:32:40 AM10/18/15

to mechanical-sympathy, nit...@yahoo.com

Thanks for the compliments and the tip. Am I right in assuming that the results change only if you have ~1sec latencies (or somewhat below), since c-s measures at 1 second intervals?

If this is correct it's likely to have no effect on ScyllaDB, since it does not suffer from GC.

Are you planning to feed the patches back upstream?

Nitsan Wakart

unread,

Oct 18, 2015, 2:42:17 PM10/18/15

to mechanica...@googlegroups.com

Hi Avi,

I assume you are refering to the results published on the ScyllaDB website:

http://www.scylladb.com/technology/download/latency-100-threads-cassandra3.0.txt

E.g:

type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,        137667,  137639,  137639,  137639,     0.7,     0.5,     1.5,     2.9,    31.2,    39.7,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
total,        272878,  122242,  122242,  122242,     0.8,     0.6,     1.0,     3.7,    62.4,    87.5,    2.1,  0.05522,      0,      0,       0,       0,       0,       0
total,        405124,  126801,  126801,  126801,     0.8,     0.6,     0.9,     2.4,    41.3,   253.1,    3.1,  0.03926,      0,      0,       0,       0,       0,       0
total,        539316,  132259,  132259,  132259,     0.7,     0.6,     0.9,     1.8,    10.3,   104.6,    4.2,  0.02947,      0,      0,       0,       0,       0,       0
total,        681514,  136313,  136313,  136313,     0.7,     0.6,     0.8,     2.2,    30.7,   446.4,    5.2,  0.02445,      0,      0,       0,       0,       0,       0

Measurements are logged every 1 second, but the rate of operations per second is quite high (and each one is measured). I would say at a guess that the load is well beyond what Cassandra can comfortably handle (in your particular setup) as I'd expect the rate of operations per second to be relatively constant for this kind of test.

While your solution may not suffer from GC induced issues, it is hard to say upfront that it will have no occasional hiccups induced by the OS or your internal data structure management schedules. A common cause for non-GC related hiccups is page-faults (at least from my experience with benchmarking Cassandra).

We've had related discussions with folks from Data Stax and they are aware of the issue and the suggested solution, I'm not sure what their plans are for merging the solution in or providing their own.

Regards,

Nitsan

--

Avi Kivity

unread,

Oct 18, 2015, 2:51:39 PM10/18/15

to mechanica...@googlegroups.com

On 10/18/2015 09:39 PM, 'Nitsan Wakart' via mechanical-sympathy wrote:

Hi Avi,

I assume you are refering to the results published on the ScyllaDB website:

http://www.scylladb.com/technology/download/latency-100-threads-cassandra3.0.txt

E.g:
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,        137667,  137639,  137639,  137639,     0.7,     0.5,     1.5,     2.9,    31.2,    39.7,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
total,        272878,  122242,  122242,  122242,     0.8,     0.6,     1.0,     3.7,    62.4,    87.5,    2.1,  0.05522,      0,      0,       0,       0,       0,       0
total,        405124,  126801,  126801,  126801,     0.8,     0.6,     0.9,     2.4,    41.3,   253.1,    3.1,  0.03926,      0,      0,       0,       0,       0,       0
total,        539316,  132259,  132259,  132259,     0.7,     0.6,     0.9,     1.8,    10.3,   104.6,    4.2,  0.02947,      0,      0,       0,       0,       0,       0
total,        681514,  136313,  136313,  136313,     0.7,     0.6,     0.8,     2.2,    30.7,   446.4,    5.2,  0.02445,      0,      0,       0,       0,       0,       0
Measurements are logged every 1 second, but the rate of operations per second is quite high (and each one is measured). I would say at a guess that the load is well beyond what Cassandra can comfortably handle (in your particular setup) as I'd expect the rate of operations per second to be relatively constant for this kind of test.

Well, it is relatively constant, isn't it? Hovering around 130K ops/sec with low 95 percentile latency. It doesn't seem to be beyond what Cassandra can do, nor is it particularly high for such a large machine.

While your solution may not suffer from GC induced issues, it is hard to say upfront that it will have no occasional hiccups induced by the OS or your internal data structure management schedules. A common cause for non-GC related hiccups is page-faults (at least from my experience with benchmarking Cassandra).

ScyllaDB does not rely on or suffer from page faults. All memory is allocated up-front, and paged in via O_DIRECT writes; the OS page cache is bypassed.

In DPDK mode even the host networking stack is bypassed.

We've had related discussions with folks from Data Stax and they are aware of the issue and the suggested solution, I'm not sure what their plans are for merging the solution in or providing their own.

I assume you're talking about cassandra-stress now. You should just post your patches on the casssandra dev mailing list, or open a ticket and attach them there. Cassandra is not just datastax, and it's a pity to let your work bitrot on some forgotten tree.

Nitsan Wakart

unread,

Oct 18, 2015, 3:49:06 PM10/18/15

to mechanica...@googlegroups.com

"Well, it is relatively constant, isn't it? Hovering around 130K ops/sec with low 95 percentile latency. It doesn't seem to be beyond what Cassandra can do, nor is it particularly high for such a large machine."

The low 95%ile, measured with coordinate omission can be a long long way off.
I expect a stable system under test for latency to sustain the load with variation that reflects the latency SLA (E.g. 99%ile < X millis, under load of Y ops/s). So if it is reasonable for the number of requests per second to go from ~110K to ~170K, and we are to assume some sort of stable request rate is intended, that would mean 60K odd requests failed to happen and were delayed to the next second. A correct measurement would reflect this delay.
Lets assume Cassandra can manage 200K messages per second at full tilt, the 30K leftover messages will take up 150ms of the next measurements section. This means all messages in that second would be delayed by that amount before their processing can begin.
I note that your test script('
tools/bin/cassandra-stress write duration=5m -mode native cql3 -rate threads=100 -node 1.1.1.1') does not actually specify an expected load (limit is not set)? Am I missing something?
Here's an example of the variance in the log:

type, total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time

total, 25890995, 124323, 124323, 124323, 0.8, 0.6, 0.9, 2.2, 10.6, 304.8, 178.9
total, 26036739, 142969, 142969, 142969, 0.7, 0.6, 0.9, 1.6, 8.2, 38.0, 179.9
total, 26165350, 127025, 127025, 127025, 0.8, 0.7, 0.9, 1.8, 8.8, 328.3, 181.0
total, 26341065, 172907, 172907, 172907, 0.6, 0.4, 1.0, 2.6, 7.5, 155.7, 182.0
total, 26498033, 136568, 136568, 136568, 0.7, 0.6, 0.8, 2.0, 8.0, 268.8, 183.1
total, 26625554, 126019, 126019, 126019, 0.8, 0.7, 1.0, 2.1, 12.1, 92.2, 184.1
total, 26730683, 103757, 103757, 103757, 0.9, 0.6, 1.1, 3.9, 63.0, 180.8, 185.1
total, 26878909, 144735, 144735, 144735, 0.7, 0.6, 0.9, 2.3, 6.7, 119.0, 186.2

total, 27039773, 161654, 161654, 161654, 0.6, 0.6, 0.8, 1.3, 5.0, 27.9, 187.2

Avi Kivity

unread,

Oct 19, 2015, 3:05:27 AM10/19/15

to mechanica...@googlegroups.com

On 10/18/2015 10:46 PM, 'Nitsan Wakart' via mechanical-sympathy wrote:
>
> "Well, it is relatively constant, isn't it? Hovering around 130K ops/sec with low 95 percentile latency. It doesn't seem to be beyond what Cassandra can do, nor is it particularly high for such a large machine."
> The low 95%ile, measured with coordinate omission can be a long long way off.
> I expect a stable system under test for latency to sustain the load with variation that reflects the latency SLA (E.g. 99%ile < X millis, under load of Y ops/s). So if it is reasonable for the number of requests per second to go from ~110K to ~170K, and we are to assume some sort of stable request rate is intended, that would mean 60K odd requests failed to happen and were delayed to the next second. A correct measurement would reflect this delay.
> Lets assume Cassandra can manage 200K messages per second at full tilt, the 30K leftover messages will take up 150ms of the next measurements section. This means all messages in that second would be delayed by that amount before their processing can begin.
> I note that your test script('
> tools/bin/cassandra-stress write duration=5m -mode native cql3 -rate threads=100 -node 1.1.1.1') does not actually specify an

I guess it's making it go as fast as it can, relying on the fact that
the loader is weaker than the server. Probably not good for a latency
benchmark.

Are there tools for generating hockey-stick graphs? Like SPEC SFS does.

c-s is lacking in many ways, such as coordinated runs and aggregation of
multiple loaders, scaling the key space with throughput, and better output.

ymo

unread,

Oct 21, 2015, 9:32:09 AM10/21/15

to mechanical-sympathy

Hi Avi

Great stuff ))) ... I was wondering how easy is it to bypass the cql compiler interface all together and read/write directly without using cql ? I have an array of key value data that i want to read/write directly without passing by cql. How easy is it and where should i start from ?

Regards

Avi Kivity

unread,

Oct 22, 2015, 2:38:46 AM10/22/15

to mechanica...@googlegroups.com

It's very easy since it already exists: the Cassandra Thrift protocol. It's disabled by default (since the implementation is incomplete) but you can enable it from the command line (--start-rpc).

It's not as heavily tuned as the cql transport (still uses std::string and std::shared_ptr instead of seastar's thread-unsafe versions) but is still plenty fast.

--

ymo

unread,

Oct 22, 2015, 4:06:09 PM10/22/15

to mechanical-sympathy

Hi Avi.

I was thinking to come in from below the thrift interface. Basically have my own transport which is via shared memory now. Is there an API or a "recommended" way of doing that ?

Regards

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Avi Kivity

unread,

Oct 23, 2015, 8:46:24 AM10/23/15

to mechanica...@googlegroups.com

Look at storage_proxy. But this is not a stable API and so could change from day to day.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Lifey

unread,

Oct 24, 2015, 12:37:51 PM10/24/15

to mechanical-sympathy

Very Nice

Latency distribution remains the same as previous benchmark ?

Avi Kivity

unread,

Oct 25, 2015, 3:12:37 AM10/25/15

to mechanica...@googlegroups.com

We need to redo the latency tests. As Nitsan explained here, cassandra-stress is susceptible to the "Coordinated Omission" problem.

We will produce a latency-throughput profile for both Scylla and Cassandra, and I will post here when the tests are ready. I expect the gap between Scylla and Cassandra to grow, both due to coordinated-omission having a smaller effect on Scylla, and because latency spikes are magnified on a cluster.

Reply all

Reply to author

Forward