--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Yup, fair enough.
How are you currently handling load balancing? If some connections are more loaded than others, do you do anything dynamic? Or once once a connection is accepted on a shard it stays there till close?
Also, noticed you're using a custom cpu-local memory allocator with deferred reclaim. If you do need to send a message between two engines with one of them allocating, is the receiver engine sending back a message to tell sender to delete the allocation?
How does scylla handle memory exhaustion/oom?
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Yup, fair enough.
How are you currently handling load balancing? If some connections are more loaded than others, do you do anything dynamic? Or once once a connection is accepted on a shard it stays there till close?
Also, noticed you're using a custom cpu-local memory allocator with deferred reclaim. If you do need to send a message between two engines with one of them allocating, is the receiver engine sending back a message to tell sender to delete the allocation?
How does scylla handle memory exhaustion/oom?
sent from my phone
On Oct 14, 2015 1:12 PM, "Avi Kivity" <a...@cloudius-systems.com> wrote:
Thanks. We haven't run the dpdk benchmarks yet, because that code needs some more work for load-balancing, and frankly, the performance is awesome enough already.
I estimate between 20%-50% improvement. When you have more clients and connections, the improvement increases, since tcp can batch fewer messages per connection.
On Wednesday, October 14, 2015 at 8:07:45 PM UTC+3, Vitaly Davidovich wrote:
You guys are doing good work. The seastar and scylla code is very clean too.Have you tried these benchmarks with dpdk networking? Curious what that config would yield.
On Wed, Oct 14, 2015 at 12:53 PM, Avi Kivity <a...@cloudius-systems.com> wrote:
Hello mechanical sympathizers,Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.I'll be happy to answer questions about the benchmark or the architecture.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Also, how are you finding AIO/direct IO support in xfs? Is it fully async, or do you hit stalls from time to time on submission?
On Wed, Oct 14, 2015 at 1:29 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
Yup, fair enough.
How are you currently handling load balancing? If some connections are more loaded than others, do you do anything dynamic? Or once once a connection is accepted on a shard it stays there till close?
Also, noticed you're using a custom cpu-local memory allocator with deferred reclaim. If you do need to send a message between two engines with one of them allocating, is the receiver engine sending back a message to tell sender to delete the allocation?
How does scylla handle memory exhaustion/oom?
sent from my phone
On Oct 14, 2015 1:12 PM, "Avi Kivity" <a...@cloudius-systems.com> wrote:
Thanks. We haven't run the dpdk benchmarks yet, because that code needs some more work for load-balancing, and frankly, the performance is awesome enough already.
I estimate between 20%-50% improvement. When you have more clients and connections, the improvement increases, since tcp can batch fewer messages per connection.
On Wednesday, October 14, 2015 at 8:07:45 PM UTC+3, Vitaly Davidovich wrote:
You guys are doing good work. The seastar and scylla code is very clean too.Have you tried these benchmarks with dpdk networking? Curious what that config would yield.
On Wed, Oct 14, 2015 at 12:53 PM, Avi Kivity <a...@cloudius-systems.com> wrote:
Hello mechanical sympathizers,Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.I'll be happy to answer questions about the benchmark or the architecture.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
This could be a good reason:
http://vanillajava.blogspot.it/2013/07/c-like-java-for-low-latency.html?m=1
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
I'm completly ignorant about the differences between GCC and jvm compiled code but i understand that a difference exists: in my everyday job i cannot choose other languages so for me there are no choices. I don't know if for the other members of this group it is the same...or there are other reasons that lead them to choose java instead of c++ (or rust?)...
JVM is a good platform and java is a good ecosystem for being productive fairly quickly. Personally, I don't think it's a good platform to build systems-level software, like databases (distributed or not), or other software that should make as much use of hardware it's running on as possible. It's "good enough" for a lot of people though.I encourage anyone to read this comment by Todd Lipcon (I believe I've seen him on this list as well, so maybe he'll chime in): https://news.ycombinator.com/item?id=10298024
Hey! That's me! :)
Java has a whole lot of libraries and they tend to "play nice" together
Avi,
Where's the code that balances connections on accept? There's no accept thread, is each shard registering epoll notification for accept? I noticed SO_REUSEPORT is disabled with a comment about it causing load imbalance - did you guys figure out why?
One other tidbit I found while looking is submitting to an smp queue doesn't push the message until at least 16 messages are on it (batch_size). Is there a reason for the delay? Are you trying to amortize cost of working with the spsc queue?
How does back pressure work here? If the queue is full, the messages continue to be appended to a local deque. What happens if another shard is stalled for a long period of time and a shard continues enqueing? Do you get a bad_alloc and handle it somewhere by doing the cache reclaim you mentioned?
Also, what is the devops story beyond collectd? Any builtin alerting? Or is the idea to use collectd and provide custom alerting on top of that?
sent from my phone
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Thanks for the link. The API documentation is good enough to get started. Also found the wiki at https://github.com/scylladb/seastar/wiki very useful.
How would you compare using the host OS TCP stack through Seastar vs straight up using epoll?
On Thursday, October 15, 2015 at 11:58:13 PM UTC-7, Avi Kivity wrote:Indeed the ability to create high level abstractions while retaining low level control is what makes C++ suitable for high performance servers. I'm surprised that with this group's charter most of the discussion is about Java, and ways to workaround the JVM's abstraction penalty.
What kind of documentation are you looking for? The API is documented in http://docs.seastar-project.org/master/group__networking-module.html.
On Thursday, October 15, 2015 at 9:14:11 PM UTC+3, Rajiv Kurian wrote:Thanks for sharing. Seastar seems like a great little library. I like how high-level the code is for something that does so much behind the covers. I'll be sure to give it a try. Is there any documentation on the networking stack (DPDK or kernel)?
On Wednesday, October 14, 2015 at 9:53:27 AM UTC-7, Avi Kivity wrote:Hello mechanical sympathizers,
Following up on the ScyllaDB benchmark that was discussed here a few weeks ago, here are the results for corrected single-node benchmark [1] and new results for a 3-node cluster [2]. TL;DR single node is up from 1.3M/s to 1.8M/s and clustered benchmark shows excellent clustering behavior.
I'll be happy to answer questions about the benchmark or the architecture.
--
Avi,
Where's the code that balances connections on accept? There's no accept thread, is each shard registering epoll notification for accept?
I noticed SO_REUSEPORT is disabled with a comment about it causing load imbalance - did you guys figure out why?
One other tidbit I found while looking is submitting to an smp queue doesn't push the message until at least 16 messages are on it (batch_size). Is there a reason for the delay? Are you trying to amortize cost of working with the spsc queue?
How does back pressure work here? If the queue is full, the messages continue to be appended to a local deque. What happens if another shard is stalled for a long period of time and a shard continues enqueing? Do you get a bad_alloc and handle it somewhere by doing the cache reclaim you mentioned?
Also, what is the devops story beyond collectd? Any builtin alerting? Or is the idea to use collectd and provide custom alerting on top of that?
On 10/17/2015 04:51 AM, Vitaly Davidovich wrote:
I noticed SO_REUSEPORT is disabled with a comment about it causing load imbalance - did you guys figure out why?
No, we didn't chase it down.
type, total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb total, 137667, 137639, 137639, 137639, 0.7, 0.5, 1.5, 2.9, 31.2, 39.7, 1.0, 0.00000, 0, 0, 0, 0, 0, 0 total, 272878, 122242, 122242, 122242, 0.8, 0.6, 1.0, 3.7, 62.4, 87.5, 2.1, 0.05522, 0, 0, 0, 0, 0, 0 total, 405124, 126801, 126801, 126801, 0.8, 0.6, 0.9, 2.4, 41.3, 253.1, 3.1, 0.03926, 0, 0, 0, 0, 0, 0 total, 539316, 132259, 132259, 132259, 0.7, 0.6, 0.9, 1.8, 10.3, 104.6, 4.2, 0.02947, 0, 0, 0, 0, 0, 0 total, 681514, 136313, 136313, 136313, 0.7, 0.6, 0.8, 2.2, 30.7, 446.4, 5.2, 0.02445, 0, 0, 0, 0, 0, 0
Hi Avi,I assume you are refering to the results published on the ScyllaDB website:
E.g:type, total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb total, 137667, 137639, 137639, 137639, 0.7, 0.5, 1.5, 2.9, 31.2, 39.7, 1.0, 0.00000, 0, 0, 0, 0, 0, 0 total, 272878, 122242, 122242, 122242, 0.8, 0.6, 1.0, 3.7, 62.4, 87.5, 2.1, 0.05522, 0, 0, 0, 0, 0, 0 total, 405124, 126801, 126801, 126801, 0.8, 0.6, 0.9, 2.4, 41.3, 253.1, 3.1, 0.03926, 0, 0, 0, 0, 0, 0 total, 539316, 132259, 132259, 132259, 0.7, 0.6, 0.9, 1.8, 10.3, 104.6, 4.2, 0.02947, 0, 0, 0, 0, 0, 0 total, 681514, 136313, 136313, 136313, 0.7, 0.6, 0.8, 2.2, 30.7, 446.4, 5.2, 0.02445, 0, 0, 0, 0, 0, 0Measurements are logged every 1 second, but the rate of operations per second is quite high (and each one is measured). I would say at a guess that the load is well beyond what Cassandra can comfortably handle (in your particular setup) as I'd expect the rate of operations per second to be relatively constant for this kind of test.
While your solution may not suffer from GC induced issues, it is hard to say upfront that it will have no occasional hiccups induced by the OS or your internal data structure management schedules. A common cause for non-GC related hiccups is page-faults (at least from my experience with benchmarking Cassandra).
We've had related discussions with folks from Data Stax and they are aware of the issue and the suggested solution, I'm not sure what their plans are for merging the solution in or providing their own.
--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.