Good alternatives to Netty without GC impact

5,211 views

Skip to first unread message

bob.tw...@gmail.com

unread,

May 13, 2015, 7:37:34 AM5/13/15

to mechanica...@googlegroups.com

We are looking for Netty alternatives to implement a low-latency tcp server. I came across this library on Reddit:

http://www.coralblocks.com/index.php/2014/04/coralreactor-vs-netty-performance-comparison/

Does anyone have any experience with CoralReactor?

What other libraries, commercial or open-source, would you suggest as a Netty alternative, preferably without GC impact?

Gil Tene

unread,

May 13, 2015, 5:30:29 PM5/13/15

to mechanica...@googlegroups.com

I don't have much to contribute on where CoralReactor and Netty stand in comparison to each other. But I do have some stuff to contribute on measurement techniques and on what is actually being compared... I was a bit bored sitting in my hotel room, so thanks for providing me with some amusement.. Here are some observations:

Apples to Apples? This appears to be a comparison of an apple at rest with a clockwork orange under stress:

1. The coralreactor client code recycles a single ByteBuffer pre-allocated and pre-initialized (with 'x's) during construction, and doesn't allocate any new buffer during sends. In contrast, the netty client code allocates a new ByteBuf in each sendMsg() call, and initializes each message with 'x's in each sendMsg() call. Why is the netty client implementation NOT doing similar recycling of a single ByteBuf created at construction time??

2. Similarly, the netty client and server code both goes to great lengths to covert to and use NIO ByteBuffers instead of netty's nice and fast ByteBuf: and instead of using ByteBuf.getLong() directly in the channelRead() method, it hops through extracting a ByteBuffer, going through two methods call (that are not there in the coralreactor client), and hopping into a coralreactor-styled handleMessage() call to process the byte buffer to to a... Butebuffer.getLong(). [Why didn't they put a sleep in there while they were at it?] The coralreactor implementation, in contrast, handles it's input directly in the buffer form it came in, and doesn't have any s̶l̶e̶e̶p̶s̶ extra and unneeded conversion steps in the code. I bet that if you wrote the coralreactor client and server code to convert to netty ByteBufs before it processed stuff, and allocate a new ByteBuf for each send, while keeping the netty code to using (and reusing) ByteBufs, the roles and behavior numbers would roughly reverse...

Measurement:

3. The latency measured is one way latency from client to server, measured using System.nanoTime() on each. WTF? [System.nanotTime can't be safely used this way. Even on the same box on the same day.]

But mainly, those black box corralreactor Benchmarker classes do not inspire confidence:

4. To start with, why is there a custom Benchmarker built (just for netty) under com.coralblocks.nettybenchmarks.util.Benchmarker? And why is only the netty code using it for measurement? Why are the two tests not using the same benchmarker (com.coralblocks.coralbits.bench.Benchmarker)?

5. But mainly the confidence is highly degraded by output lines like "99.999% = [avg: 21.146 micros, max: 91.416 micros]". 99.999%'iles don't have averages. The 99.999%'ile is the 99.999%'ile. period. over each period. period. If you want some more ranting discussion on the subject of averaging percentiles you can find it here. When someone reports percentile averages coming out of a black box (whose code you can't find or read to understand how it makes up it's numbers), you have to assume the black box is running on crystal meth.

I could keep going and point to coordinated omission, and explain that percentiles are meaningless when measured this way, but I think there are enough nails in this coffin already.

ricardo.dan...@gmail.com

unread,

May 14, 2015, 1:49:29 AM5/14/15

to mechanica...@googlegroups.com

Hi Gil,

I am one of the developers of CoralReactor. We appreciate your feedback. We believe the best way to improve a benchmark is to hear criticism from smart people and offer our arguments. We have some clients that are also clients of Azul and from their feedback we believe our components run well in the Zing VM. Please see my comments below:

On Wednesday, May 13, 2015 at 5:30:29 PM UTC-4, Gil Tene wrote:

I don't have much to contribute on where CoralReactor and Netty stand in comparison to each other. But I do have some stuff to contribute on measurement techniques and on what is actually being compared... I was a bit bored sitting in my hotel room, so thanks for providing me with some amusement.. Here are some observations:

Apples to Apples? This appears to be a comparison of an apple at rest with a clockwork orange under stress:

1. The coralreactor client code recycles a single ByteBuffer pre-allocated and pre-initialized (with 'x's) during construction, and doesn't allocate any new buffer during sends. In contrast, the netty client code allocates a new ByteBuf in each sendMsg() call, and initializes each message with 'x's in each sendMsg() call. Why is the netty client implementation NOT doing similar recycling of a single ByteBuf created at construction time??

2. Similarly, the netty client and server code both goes to great lengths to covert to and use NIO ByteBuffers instead of netty's nice and fast ByteBuf: and instead of using ByteBuf.getLong() directly in the channelRead() method, it hops through extracting a ByteBuffer, going through two methods call (that are not there in the coralreactor client), and hopping into a coralreactor-styled handleMessage() call to process the byte buffer to to a... Butebuffer.getLong(). [Why didn't they put a sleep in there while they were at it?] The coralreactor implementation, in contrast, handles it's input directly in the buffer form it came in, and doesn't have any s̶l̶e̶e̶p̶s̶ extra and unneeded conversion steps in the code. I bet that if you wrote the coralreactor client and server code to convert to netty ByteBufs before it processed stuff, and allocate a new ByteBuf for each send, while keeping the netty code to using (and reusing) ByteBufs, the roles and behavior numbers would roughly reverse...

We did not purposely chose to write that Netty benchmark to make it slower as your response might suggest. Netty makes it hard to re-use things and forces you to do reference counting on its ByteBufs. I may be mistaken here, but I don't think there is an easy/natural way to write a Netty benchmark using the techniques you described, techniques that, as you noticed, are incorporated from the ground up on CoralReactor.

Anyone that dislikes the quality of that Netty benchmark code is encouraged to make it better, and that's exactly the reason why we included the full Netty benchmark source code in our article. The Benchmark class is exactly the same for both tests. They were only placed in different packages to make it easier to distribute the Netty code without any Coral Blocks dependencies. If you or anyone can come up with a better Netty benchmark code that outputs better latency numbers, then that would be a great contribution to the Netty / Low-Latency community. Our personal opinion is that using ByteBuf from Netty is not a good idea, making things not only slower but more complex. Are you aware of any benchmarks / comparisons that show that Netty's ByteBuf is faster / better than java.nio.ByteBuffer? I am asking because our benchmarks suggest exactly the opposite.

Again if you or anyone can write a simple Netty benchmark that measures latency and performs around 2 micros per 256-byte message over TCP one-way (or round-trip if you prefer) I would be more than happy to run it on the same machine I am currently running the CoralReactor benchmarks and post the results here. I would also be wiling to write the equivalent of the Netty benchmark code using CoralReactor and present the code and the numbers here for a comparison. A simple ping-pong benchmark test for measuring latency should be a simple program for any network library.

We are making available for download the complete Netty benchmark we used which was already listed in the article, including the Benchmark class that was missing. You can download it from there: http://www.coralblocks.com/NettyBench.zip. Refer to the README.txt file included for the complete command lines on how to execute the client and the server.

The reasons why we are confident on CoralReactor being much faster are:

1. We are getting very positive feedback from our clients. Like you, they are skeptical and prefer to do their own independent benchmarks and run CoralReactor and CoralFIX on their own environments to come up with their own latency numbers. That's a good thing and we fully encourage them to do that during their free full version trial. Fortunately they have been reporting numbers closer to the ones from our own benchmarks.

2. CoralReactor is single-threaded by design, from the groud up. There is only one pinned selector thread doing all operations, re-using and pooling all objects. That does not mean you can't add a second selector thread to scale your architecture, but that's completely different than adding a second thread sharing state with other threads. When that happens you start having to use multithreading techniques that introduce not only complexity but a lot of latency.

3. CoralRector produces zero garbage. That's zero, not little garbage. We wrote a super-optimized NIO reactor and rewrote the EPoll selector implementation for Linux, optimizing and cleaning it to the last bit for performance and zero garbage creation. That allows for the development of ultra-low-latency servers and clients with very little variance.

4. We are using Java as a syntax language and avoiding the JDK completely, at least the classes that do not perform well or produce garbage. We provide libraries and tools for our clients (CoralBits) so that they can do the same

5. CoralReactor makes it much easier (and that's a subjective matter but we have been receiving positive feedback from clients about simplicity) to write asynchronous, non-blocking, single-threaded network clients and servers, TCP and UDP including broadcast and multicast.

Measurement:

3. The latency measured is one way latency from client to server, measured using System.nanoTime() on each. WTF? [System.nanotTime can't be safely used this way. Even on the same box on the same day.]

We have found System.nanoTime() to be fairly reliable and monotonic on the same Linux box without NTP servers. Moreover, System.nanotTime() is being used on both Netty and CoralReactor benchmarks, so they should influence/affect both benchmarks equally. We have also used native RDTSC as a timestamper and the numbers measured were very similar.

But mainly, those black box corralreactor Benchmarker classes do not inspire confidence:

As mentioned below, these classes are the same and we are providing the source code together with the netty benchmark source code for download.

4. To start with, why is there a custom Benchmarker built (just for netty) under com.coralblocks.nettybenchmarks.util.Benchmarker? And why is only the netty code using it for measurement? Why are the two tests not using the same benchmarker (com.coralblocks.coralbits.bench.Benchmarker)?

Explained above, but for completeness: "The Benchmark class is exactly the same for both tests. They were only in different packages to make it easier to distribute the Netty code without any CoralBlocks dependencies." Source code for the Benchmarker class will be provided from now on.

5. But mainly the confidence is highly degraded by output lines like "99.999% = [avg: 21.146 micros, max: 91.416 micros]". 99.999%'iles don't have averages. The 99.999%'ile is the 99.999%'ile. period. over each period. period. If you want some more ranting discussion on the subject of averaging percentiles you can find it here. When someone reports percentile averages coming out of a black box (whose code you can't find or read to understand how it makes up it's numbers), you have to assume the black box is running on crystal meth.

Perhaps when you see the code from the Benchmarker class, this will become more clear. We are storing every measurement in a sorted list, then calculating the percentiles on top of it. For example 99.999%'ile means: If you take the 99.999% best measurements of the whole dataset, you will find that the average is X and the max time (biggest outlier) is Y. That's important because your average might be great but you might have some terrible outliers hidden in there. By presenting the worst outlier you can at least have an idea of the worst case scenario for your latency, up to the 99.999%'ile, without having to calculate the standard deviation. Our opinion is that average and worst outlier, up to a percentile, gives enough information to evaluate latency / performance.

I could keep going and point to coordinated omission, and explain that percentiles are meaningless when measured this way, but I think there are enough nails in this coffin already.

Thanks for your feedback. Even if it can sometimes be interpreted by some as harsh, we respect it and understand that this is just your personal style. Hopefully the arguments I presented above will offer some balance to this great discussion.

Gil Tene

unread,

May 14, 2015, 3:53:15 AM5/14/15

to mechanica...@googlegroups.com

On Thursday, May 14, 2015 at 12:49:29 AM UTC-5, ricardo.dan...@gmail.com wrote:

Hi Gil,

I am one of the developers of CoralReactor. We appreciate your feedback. We believe the best way to improve a benchmark is to hear criticism from smart people and offer our arguments. We have some clients that are also clients of Azul and from their feedback we believe our components run well in the Zing VM. Please see my comments below:

Thanks for the compliments above. And while my response to the tests themselves and the measurement techniques may be harsh, I hope that it is clear that I have no position (good or bad) on how CoralReactor stands in relation to netty when it comes to performance and latency behavior, since I have no numbers, results, or first hand experience to base such an position on.

My main criticism of the results posted is that (for the enumerated reasons) I don't believe that they provide numbers or results to base any comparison on.

On Wednesday, May 13, 2015 at 5:30:29 PM UTC-4, Gil Tene wrote:
I don't have much to contribute on where CoralReactor and Netty stand in comparison to each other. But I do have some stuff to contribute on measurement techniques and on what is actually being compared... I was a bit bored sitting in my hotel room, so thanks for providing me with some amusement.. Here are some observations:

Apples to Apples? This appears to be a comparison of an apple at rest with a clockwork orange under stress:

1. The coralreactor client code recycles a single ByteBuffer pre-allocated and pre-initialized (with 'x's) during construction, and doesn't allocate any new buffer during sends. In contrast, the netty client code allocates a new ByteBuf in each sendMsg() call, and initializes each message with 'x's in each sendMsg() call. Why is the netty client implementation NOT doing similar recycling of a single ByteBuf created at construction time??

2. Similarly, the netty client and server code both goes to great lengths to covert to and use NIO ByteBuffers instead of netty's nice and fast ByteBuf: and instead of using ByteBuf.getLong() directly in the channelRead() method, it hops through extracting a ByteBuffer, going through two methods call (that are not there in the coralreactor client), and hopping into a coralreactor-styled handleMessage() call to process the byte buffer to to a... Butebuffer.getLong(). [Why didn't they put a sleep in there while they were at it?] The coralreactor implementation, in contrast, handles it's input directly in the buffer form it came in, and doesn't have any s̶l̶e̶e̶p̶s̶ extra and unneeded conversion steps in the code. I bet that if you wrote the coralreactor client and server code to convert to netty ByteBufs before it processed stuff, and allocate a new ByteBuf for each send, while keeping the netty code to using (and reusing) ByteBufs, the roles and behavior numbers would roughly reverse...

We did not purposely chose to write that Netty benchmark to make it slower as your response might suggest. Netty makes it hard to re-use things and forces you to do reference counting on its ByteBufs. I may be mistaken here, but I don't think there is an easy/natural way to write a Netty benchmark using the techniques you described, techniques that, as you noticed, are incorporated from the ground up on CoralReactor.

I'll leave it up to Norman or someone else that is more netty-wise than me to provide an example of how they would actually implement the logical equivalent to the CoralReactor-style client and server in netty-style. But my impression from reading examples and actual code that uses netty is that people either manage and recycle their ByteBuf in a way that is similar to how you do it for ByteBuffer, or that they use the higher level abstractions like context and channels, and operations like write, flush, and writeAndFlush at those levels, letting the underlying layers actually manage the buffers. You can look at their documentation examples (e.g. http://netty.io/4.0/xref/io/netty/example/worldclock/WorldClockServerHandler.html and http://netty.io/4.0/xref/io/netty/example/worldclock/WorldClockClientHandler.html) to see what typical netty clients and servers would be doing for simple request/response scenario like the one you construct in the benchmark.

If you think that the idiomatic netty way of doing things is slow, or significantly slower than CoralReactor's, use it and measure it.

Anyone that dislikes the quality of that Netty benchmark code is encouraged to make it better, and that's exactly the reason why we included the full Netty benchmark source code in our article.

The simple observation of the tests so far is that ether is no way to tell if the netty numbers are 10x as bad because the extra work *your* code does in the netty client and server cases is 10x as expesive, or because netty itself is actually 10x as expensive to use when used according to their instructions, documentation, and examples.

It's up to you to fix that by changing your netty-based client and server. I doubt anyone else will do that for you.

The Benchmark class is exactly the same for both tests. They were only placed in different packages to make it easier to distribute the Netty code without any Coral Blocks dependencies.

It's good to know that the two tests used identical benchmark code. It was impossible to know without the source, but now that you'll be providing the source, we can see for ourselves. I'd still recommend that you have both tests use the exact same Benchmark class (the package-isolated netty one you created). It keeps things simple and avoids guesswork.

If you or anyone can come up with a better Netty benchmark code that outputs better latency numbers, then that would be a great contribution to the Netty / Low-Latency community.

Our personal opinion is that using ByteBuf from Netty is not a good idea, making things not only slower but more complex.

That's a valid opinion and an interesting hypothesis. So far, there are no numbers (in this discussion or there referenced material) that show it to be true (or false).

Are you aware of any benchmarks / comparisons that show that Netty's ByteBuf is faster / better than java.nio.ByteBuffer? I am asking because our benchmarks suggest exactly the opposite.

I am not aware of benchmarks that suggest that ByteBufs are faster or better than java.nio.ByteBuffer. But I'm also not aware of any that show the opposite. The benchmarks that started this discussion certainly don't show that.

I have no idea how much allocation (if any) is involved when you do it the idiomatic netty way. But I find it hard to believe that common netty applications written by performance-minded people would be allocating and initializing a new buffer per message. Not even regular NIO or non-NIO socket users do that. That's as close to worst-case coding as you can come.

As such, I think that so far we are looking at an apples (idiomatic for the library tested, non-allocating) to oranges (non-idiomatic for the library tested, artificially allocating for no reason) comparison, with results dominated by those qualities. If you want a credible comparison to base your speed comparison claims on, I'd suggest you fix your benchmark and measure again. I'll be happy to review it either privately or publicly to give input on the measurement technique and scenario comparison.

Again if you or anyone can write a simple Netty benchmark that measures latency and performs around 2 micros per 256-byte message over TCP one-way (or round-trip if you prefer) I would be more than happy to run it on the same machine I am currently running the CoralReactor benchmarks and post the results here.

I don't really know what the netty numbers will look like. Maybe someone else can post some. I'd say that what we have from your tests so far is some basis to establish some of CoralReactor numbers for the above (see Coordinated Omission note below about not trusting the results with multiple 9s in them). But there is no basis for establishing netty's.

I would also be wiling to write the equivalent of the Netty benchmark code using CoralReactor and present the code and the numbers here for a comparison. A simple ping-pong benchmark test for measuring latency should be a simple program for any network library.

We are making available for download the complete Netty benchmark we used which was already listed in the article, including the Benchmark class that was missing. You can download it from there: http://www.coralblocks.com/NettyBench.zip. Refer to the README.txt file included for the complete command lines on how to execute the client and the server.

The reasons why we are confident on CoralReactor being much faster are:

1. We are getting very positive feedback from our clients. Like you, they are skeptical and prefer to do their own independent benchmarks and run CoralReactor and CoralFIX on their own environments to come up with their own latency numbers. That's a good thing and we fully encourage them to do that during their free full version trial. Fortunately they have been reporting numbers closer to the ones from our own benchmarks.

2. CoralReactor is single-threaded by design, from the groud up. There is only one pinned selector thread doing all operations, re-using and pooling all objects. That does not mean you can't add a second selector thread to scale your architecture, but that's completely different than adding a second thread sharing state with other threads. When that happens you start having to use multithreading techniques that introduce not only complexity but a lot of latency.

Those are interesting qualities. They probably produce good behavior characteristics. But as far as claims that compare those characteristics to nettty's probably-also-good ones, actual measurement that compares your preferred way to write to CoralReactor with well written netty code would be the best way to support them.

3. CoralRector produces zero garbage. That's zero, not little garbage. We wrote a super-optimized NIO reactor and rewrote the EPoll selector implementation for Linux, optimizing and cleaning it to the last bit for performance and zero garbage creation. That allows for the development of ultra-low-latency servers and clients with very little variance.

That's a good quality. But the effects zero-garabage won't show up in a 1,000,000 message test on a single connection, as not even a single newgen collection will normally trigger during such a test on a normally configured HotSpot setup. A much longer test would probably be needed to establish the effects of GC (you'll want a generously sized newgen, and at least 10s of GC events).

4. We are using Java as a syntax language and avoiding the JDK completely, at least the classes that do not perform well or produce garbage. We provide tools for our clients (CoralBits) so that they can do the same

5. CoralReactor makes it much easier (and that's a subjective matter but we have been receiving positive feedback from clients about simplicity) to write asynchronous, non-blocking, single-threaded network clients and servers, TCP and UDP including broadcast and multicast.

Those are also good qualities, but I don't see why they help you feel that you are much faster. More elegant, simpler, and more consistent (if you have measurements to support the "more" part) I can see, and maybe the cumulative side effect of those things give you some speed. But it's not a direct result of avoiding garbage and being easier to write to.

Measurement:

3. The latency measured is one way latency from client to server, measured using System.nanoTime() on each. WTF? [System.nanotTime can't be safely used this way. Even on the same box on the same day.]

We have found System.nanoTime() to be fairly reliable and monotonic on the same Linux box without NTP servers. Moreover, System.nanotTime() is being used on both Netty and CoralReactor benchmarks, so they should influence/affect both benchmarks equally. We have also used native RDTSC as a timestamper and the numbers measured were very similar.

This observation was not claiming bias. Just wrong/dangerous measurement. The sort that can introduce noise or bias that may look like a signal but isn't. It would be just as bad for measuring netty, CoralReactor, or an idle loop.

Using System.nanoTime() to compare against numbers collected in another process is specifically "discouraged" in the documentation. Specifically: "...This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time. The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin....". It doesn't get more specific that that.

You may be lucky, and your nanoTime in two processes may be aligned in the version of the JDK and OS you are using on the day of the week you did the test in. But I've seen JDK versions that intentionally add a random (generated at process start) base number to nanoTime to discourage the exact practice you are attempting to use.

RDSTC is similarly tricky. You'd need to verify that your RDTSCs are actually synchronized on all the cores involved in the test (which they are on some systems). But I've seen multiple systems in the wild where one core (usually thread 0 of core 0) is skewed from the rest due to interesting choices made by the BIOS at boot time...

The reliable way to test the thing you want to test is to measure round trip latency such that the two nanoTimes used for comparison are collected in the same JVM. It would provide you with data that is just as useful. Your test already does a round trip, and already carries the client's original time stamp back to the client in the server response, so there is not much needed to change to fix this issue.

But mainly, those black box corralreactor Benchmarker classes do not inspire confidence:

As mentioned below, these classes are the same and we are providing the source code together with the netty benchmark source code for download.

I'd recommend you use the same class to test the CoralBlocks case, to make sure people don't have to guess about potential variations.

4. To start with, why is there a custom Benchmarker built (just for netty) under com.coralblocks.nettybenchmarks.util.Benchmarker? And why is only the netty code using it for measurement? Why are the two tests not using the same benchmarker (com.coralblocks.coralbits.bench.Benchmarker)?

Explained above, but for completeness: "The Benchmark class is exactly the same for both tests. They were only in different packages to make it easier to distribute the Netty code without any CoralBlocks dependencies." Source code for the Benchmarker class will be provided from now on.

5. But mainly the confidence is highly degraded by output lines like "99.999% = [avg: 21.146 micros, max: 91.416 micros]". 99.999%'iles don't have averages. The 99.999%'ile is the 99.999%'ile. period. over each period. period. If you want some more ranting discussion on the subject of averaging percentiles you can find it here. When someone reports percentile averages coming out of a black box (whose code you can't find or read to understand how it makes up it's numbers), you have to assume the black box is running on crystal meth.

Perhaps when you see the code from the Benchmarker class, this will become more clear. We are storing every measurement in a sorted list, then calculating the percentiles on top of it. For example 99.999%'ile means: If you take the 99.999% best measurements of the whole dataset, you will find that the average is X and the max time (biggest outlier) is Y. That's important because your average might be great but you might have some terrible outliers hidden in there. By presenting the worst outlier you can at least have an idea of the worst case scenario for your latency, up to the 99.999%'ile, without having to calculate the standard deviation. Our opinion is that average and worst outlier, up to a percentile, gives enough information to evaluate latency / performance.

Ok. While that's a different way to think of "average", I'd suggest dropping it altogether. I doubt that anyone reading those results would have guessed that the above is what "average" meant in your output anyway.

What you call the Max of the 99.999% is the 99.999%'lie (and what anyone reading a number next to "99.999%" thinks you are talking about).

What you call the average is meaningless at best. But it is actually damaging/distracting because people tend to read it and pay more attention to it than to other numbers. What it actually means is "the average of the good things, without considering the bad things". Averages on latencies are meaningless enough on their own when they actually include all results. Selective averages on selectively picked data (skewed towards good behavior) is even more

I could keep going and point to coordinated omission, and explain that percentiles are meaningless when measured this way, but I think there are enough nails in this coffin already.

Thanks for your feedback. Even if it can sometimes be interpreted by some as harsh, we respect it and understand that this is just your personal style. Hopefully the arguments I presented above will offer some balance to this great discussion.

Once you correct the issues I had noted (use idiomatic netty code, use round trip measurement for reliable nanoTime, use the same class to measure both cases, drop average measurements/reporting for percentiles), you'll have a new basis, but probably a couple of more challenges.

Here is the harder problem to consider: The test (for both netty and CoralReactor) is significantly affected by Coordinated Omission. The effect is higher the higher the actual outliers are, but it is there even for your CoralReactor side of the test. For example: you have a recorded max time of ~67usec during the test, with an average time of ~2usec. This means that when that max time occurred, ~33 test measurement opportunities were skipped, and the system behavior during that "blip" was significantly (by a factor of ~33x) under-reported. And that's if one a single blip occurred. What if 10 blips of a magnitude similar to the max occurred (they would no have shown up in any other data, like the 99.999% max time you currently report). So at least 33 and up to 330 bad results are missing from the test. That's potentially 0.33% of data missing (all of which would have been higher latencies), which makes the 99.999%, 99.99%, and 99.9% numbers all basically bogus.

This problem is extremely common in measurement systems of all sorts (e.g. I'm dedicating half of a full day workshop to it tomorrow). So you are not alone. While it is fairly easy to avoid in most systems (e.g. by starting the clock when a message was supposed to be sent out per a test plan, and not when the test code happened to get around to actually sending it), it is harder to do this well for lower latency round trips (say, those below 100-200usec) because it is hard to get wakeup time that are precise enough for that. While it is possible to do an estimated post-correction based on the recorded data (e.g. using something like copyIntoCorrectedForCoordinatedOmission() in HdrHistogram for post processing your raw benchmark data), I would recommend something else in your case: Since you can't use a sleep() variant to clock your sends, use a spinloop instead. E.g. if choose to send 50K messages per second, you are supposed to send a message every 20usec. If the time has not arrived yet to send a message, spin until it does, and then send it. You can find an example of how to compute the expected using a pacer here: https://github.com/LatencyUtils/cassandra-stress2/blob/trunk/tools/stress/src/org/apache/cassandra/stress/StressAction.java#L373

If you don't want to spend the time to avoid Coordinated Omission in your test (yet), I'd suggest you stop reporting high percentiles until you do. Stick to reporting the median and the max. Maybe the 90%'lie. But don't use anything with multiple nines without fixing this first, because people will read that as "this is the change of seeing a result this big in a random sample". Without fixing the problem, your number actually mean "this is the chance of seeing a result this big if I only look at good results with fast responses, with a tiny fraction of the slower responses randomly added to make people feel like the non-perfect stuff wasn't completely filtered away"...

Peter Booth

unread,

May 17, 2015, 12:20:35 PM5/17/15

to mechanica...@googlegroups.com

Bob,

1. Do you already have a netty based server that is your baseline?

2. Do you already have benchmark data of this netty based server?

3. Do you have a specific SLA/performance goal that you are trying to meet?

You can see some comparisons of netty with it's peers in the web framework space at:

https://www.techempower.com/benchmarks/#section=data-r10&hw=peak&test=plaintext

These TechEmpower benchmarks are open-sourced, independent, microbenchmarks that are really focused on the

high traffic web world, which is quite different from low latency trading. They arent perfect, but they are open, use

realistic hardware, and have have over a year's history.

Netty and Coral Blocks are very different things, so it shard to compare them.

Can you be more specific about the issues you ahev with netty?

Peter

Reply all

Reply to author

Forward

0 new messages