Performance results of Disruptor Tests

228 views
Skip to first unread message

jorge docampo carro

unread,
Mar 24, 2014, 6:15:24 AM3/24/14
to lmax-di...@googlegroups.com
Hi all,

As part of an study, i'm measuring the performance of the Disruptor with the test provided with the source code, but i'm having some troubles understanding the nature of the tests and getting unexpected results i expect you can clarify for me.

First of all, i wanted to get the performance results for Disruptor that are showed in github (https://github.com/LMAX-Exchange/disruptor/wiki/Performance-Results). When I opened and search in the source code for those test,  I couldn't find any document explaining the package hierarchy and the nature of the test, but i assume that the results posted on github are the ones under the packages com.lmax.disruptor.queue and com.lmax.disruptor.sequenced.

Working on the assumption that my choice is correct and executing the tests, i get the following results for classes OneToOneQueueThroughputTest and OneToOneSequencedThroughputTest, after pinning them to specific cores of the CPU (Dual socket Intel Xeon E5-2660 Sandy Bridge):

Starting Queue tests
Run 0, BlockingQueue=3,494,060 ops/sec
Run 1, BlockingQueue=3,624,501 ops/sec
Run 2, BlockingQueue=3,671,071 ops/sec
Run 3, BlockingQueue=3,805,175 ops/sec
Run 4, BlockingQueue=3,824,091 ops/sec
Run 5, BlockingQueue=3,762,227 ops/sec
Run 6, BlockingQueue=3,796,507 ops/sec

Starting Disruptor tests
Run 0, Disruptor=46,232,085 ops/sec
Run 1, Disruptor=40,816,326 ops/sec
Run 2, Disruptor=49,358,341 ops/sec
Run 3, Disruptor=46,598,322 ops/sec
Run 4, Disruptor=46,576,618 ops/sec
Run 5, Disruptor=49,504,950 ops/sec
Run 6, Disruptor=49,554,013 ops/sec

Which comparing with the github submitted results are better, but comparing with the results I found on this blog (http://bad-concurrency.blogspot.com.es/2012/07/disruptor-v3-faster-hopefully.html) are almost two times slower. But regarding this issue, i think the results are pretty good and I wanted to know if i'm doing the correct comparison between classes.

The real trouble i get is when i compare the results of the classes PingPongQueueLatencyTest and PingPongSequencedLatencyTest. The results i get do not match the results from github, even if i perform cpu binding, despite the fact that the results submitted on github were obtained without it as they claim.

Queue PingPong:

Without CPU binding:

PingPongQueueLatencyTest run 0 BlockingQueue 
#[Mean   =          6.2295, StdDeviation   =      120.1229]
#[Max     =   17671.1680, Total count     =    182200356]

PingPongQueueLatencyTest run 1 BlockingQueue 
#[Mean   =         5.8534, StdDeviation   =        65.7652]
#[Max     =    8415.2320, Total count     =    197990795]

PingPongQueueLatencyTest run 2 BlockingQueue 
#[Mean   =          5.0861, StdDeviation   =        36.0260]
#[Max     =    3637.1200, Total count      =    181156206]


taskset 0-7:

PingPongQueueLatencyTest run 0 BlockingQueue
#[Mean    =        18.8887, StdDeviation   =       493.1312]
#[Max      =   40284.1600, Total count     =    181431589]

PingPongQueueLatencyTest run 1 BlockingQueue
#[Mean    =        22.5575, StdDeviation   =       498.5724]
#[Max      =   29723.6480, Total count     =    181499795]

PingPongQueueLatencyTest run 2 BlockingQueue
#[Mean    =        43.7720, StdDeviation   =       805.3321]
#[Max      =   35377.1520, Total count     =    181796565]

------------------------------------------------------------------------------------

Disruptor PingPong:

Without CPU binding:

PingPongSequencedLatencyTest run 0 Disruptor
#[Mean    =       1.2853, StdDeviation   =       7.0978]
#[Max     =     283.2640, Total count    =     30188405]

PingPongSequencedLatencyTest run 1 Disruptor
#[Mean    =       1.3400, StdDeviation   =       7.2893]
#[Max     =     288.1120, Total count    =     30184346]

PingPongSequencedLatencyTest run 2 Disruptor
#[Mean    =       1.3658, StdDeviation   =       7.3930]
#[Max     =     313.9840, Total count    =     30180547]


taskset 0-7:

PingPongSequencedLatencyTest run 0 Disruptor
#[Mean    =       0.7346, StdDeviation   =       5.7340]
#[Max     =     414.8800, Total count    =     30242574]

PingPongSequencedLatencyTest run 1 Disruptor
#[Mean    =       0.7633, StdDeviation   =       6.4105]
#[Max     =     450.0960, Total count    =     30183025]

PingPongSequencedLatencyTest run 2 Disruptor
#[Mean    =       0.8277, StdDeviation   =      14.0454]
#[Max     =    2438.0160, Total count    =     30196642]


Reviewing the code, i stepped into the instruction:  histogram.recordValue(t1 - t0, pauseTimeNs); and looking for the definition of the function on the source code of HdrHistogram, i get:
     * To compensate for the loss of sampled values when a recorded value is larger than the expected
     * interval between value samples, Histogram will auto-generate an additional series of decreasingly-smaller
     * (down to the expectedIntervalBetweenValueSamples) value records.

My question is: Why do you auto-generate additional series of values when the latency is bigger than 1000ns? 
I changed this instruction for the simple one that only records the values, and the results i get are:

Queue PingPong:

Without CPU binding:

PingPongQueueLatencyTest run 0 BlockingQueue 
#[Mean    =       7.1360, StdDeviation   =       5.5498]
#[Max     =    8796.1600, Total count    =     30000001]

PingPongQueueLatencyTest run 1 BlockingQueue 
#[Mean    =       6.9327, StdDeviation   =       4.6201]
#[Max     =    8462.8480, Total count    =     30000001]

PingPongQueueLatencyTest run 2 BlockingQueue 
#[Mean    =       6.7062, StdDeviation   =       4.7566]
#[Max     =    7864.3200, Total count    =     30000001]


taskset 0-7:

PingPongQueueLatencyTest run 0 BlockingQueue
#[Mean    =       6.7305, StdDeviation   =      13.4514]
#[Max     =   39100.4160, Total count    =     30000001]

PingPongQueueLatencyTest run 1 BlockingQueue
#[Mean    =       6.6670, StdDeviation   =      15.2062]
#[Max     =   38625.2800, Total count    =     30000001]

PingPongQueueLatencyTest run 2 BlockingQueue
#[Mean    =       6.6649, StdDeviation   =      20.9045]
#[Max     =   32719.8720, Total count    =     30000001]


------------------------------------------------------------------------------------

Disruptor PingPong:

Without CPU binding:

PingPongSequencedLatencyTest run 0 Disruptor
#[Mean    =       1.0289, StdDeviation   =       0.8593]
#[Max     =     322.6560, Total count    =     30000001]

PingPongSequencedLatencyTest run 1 Disruptor
#[Mean    =       1.0397, StdDeviation   =       0.8429]
#[Max     =     327.5680, Total count    =     30000001]

PingPongSequencedLatencyTest run 2 Disruptor
#[Mean    =       1.0365, StdDeviation   =       0.7842]
#[Max     =     288.4000, Total count    =     30000001]


taskset 0-7:

PingPongSequencedLatencyTest run 0 Disruptor
#[Mean    =       0.5200, StdDeviation   =       0.6922]
#[Max     =    1800.7680, Total count    =     30000001]

PingPongSequencedLatencyTest run 1 Disruptor
#[Mean    =       0.5055, StdDeviation   =       0.5834]
#[Max     =     318.7840, Total count    =     30000001]

PingPongSequencedLatencyTest run 2 Disruptor
#[Mean    =       0.5127, StdDeviation   =       0.5935]
#[Max     =     319.2160, Total count    =     30000001]


On this run, i appreciate that the results of the Queue with cpu binding are more consistent and believable, and the results of the Disruptor are globally better. But despite of this, the results compared with the ones submitted on github are too different.

With the Queue test i executed the average latency is around 6-7 milliseconds, and on the github page are over 32 milliseconds.

With the Disruptor test i executed the average latency is more than 500 nanoseconds in the best run, even with cpu binding, and on github the results are of 52 nanoseconds.

I wish you can help me understand if i'm taking the measures correctly, and the reason of auto-generating values on the PingPong test, as well the nature of the rest of the test on the source code. Any help will be appreciated.

Michael Barker

unread,
Mar 24, 2014, 4:44:00 PM3/24/14
to lmax-di...@googlegroups.com
Hi Jorge,

See answers inline.

Working on the assumption that my choice is correct and executing the tests, i get the following results for classes OneToOneQueueThroughputTest and OneToOneSequencedThroughputTest, after pinning them to specific cores of the CPU (Dual socket Intel Xeon E5-2660 Sandy Bridge):
...
Which comparing with the github submitted results are better, but comparing with the results I found on this blog (http://bad-concurrency.blogspot.com.es/2012/07/disruptor-v3-faster-hopefully.html) are almost two times slower. But regarding this issue, i think the results are pretty good and I wanted to know if i'm doing the correct comparison between classes.

The 2x performance difference is probably due to the differences in hardware, I'm running an Ivy Bride i7-3770.  OS and JVM version can also be a factor.
 
The real trouble i get is when i compare the results of the classes PingPongQueueLatencyTest and PingPongSequencedLatencyTest. The results i get do not match the results from github, even if i perform cpu binding, despite the fact that the results submitted on github were obtained without it as they claim.

The results that are on the Github page were some very early results using version 1.0 of the Disruptor, also there were a number of problems (primarily coordinated omission, which I will get to later) with the latency test that we addressed in the PingPongQueueLatencyTest and the original latency test has been deleted.  I have a long standing task to refresh the Wiki and haven't reached the performance tests section yet.  The biggest problem with trying to compare your own results to those run by someone else is there will be a lot a variation due to hardware and OS set up.  In reality the only performance test results that matter are the ones that you get in your own environment.
 
Reviewing the code, i stepped into the instruction:  histogram.recordValue(t1 - t0, pauseTimeNs); and looking for the definition of the function on the source code of HdrHistogram, i get:
     * To compensate for the loss of sampled values when a recorded value is larger than the expected
     * interval between value samples, Histogram will auto-generate an additional series of decreasingly-smaller
     * (down to the expectedIntervalBetweenValueSamples) value records.

My question is: Why do you auto-generate additional series of values when the latency is bigger than 1000ns? 

This is done to compensate for a a typical problem when benchmarking latency called "Coordinated Omission", there a fairly detailed discussion of it on the mechanical sympathy mailing list (https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/discussion).  Coordinated omission in latency tests occurs when you get latency spike that is higher than the specified delta between events.  Imagine testing at 1000 events/sec which all returned with a latency of 0.5ms, except for 1 event that took 1s.  If the test is set up such that the load generator will stall for the same period of time as the system under test i.e. 1s in this case, then that coordination between the load generation and the system under test will result in the load generator not generating (omitting) 1000 requests.  The result being that your percentile results will look really good, e.g. 99.99% (assuming > 10000 requests) would show 0.5ms, despite the fact that the system stalled for an entire second.  Backfilling those values allows the measurement to compensate for that behaviour and give more realistic results.

On this run, i appreciate that the results of the Queue with cpu binding are more consistent and believable, and the results of the Disruptor are globally better. But despite of this, the results compared with the ones submitted on github are too different.

With the Queue test i executed the average latency is around 6-7 milliseconds, and on the github page are over 32 milliseconds.

Improvements in hardware and JVM implementation are likely to explain the difference as well as the tests being quite different.
 
With the Disruptor test i executed the average latency is more than 500 nanoseconds in the best run, even with cpu binding, and on github the results are of 52 nanoseconds.

The original test attempted to calculate the latency in a way that resulted in a value that was too low.  Also the PingPong test measures a round trip, where as the original test was just measure a one-way transition and didn't take into account coordinated omission.

Mike.

jorge docampo carro

unread,
Mar 26, 2014, 11:58:23 AM3/26/14
to lmax-di...@googlegroups.com
Hi Mike

Thank you for the answers, i understand that my measures are correct as they depend on the hardware and JVM version, but there's still a point i want to remarc after doing some tests:
 
 
Reviewing the code, i stepped into the instruction:  histogram.recordValue(t1 - t0, pauseTimeNs); and looking for the definition of the function on the source code of HdrHistogram, i get:
     * To compensate for the loss of sampled values when a recorded value is larger than the expected
     * interval between value samples, Histogram will auto-generate an additional series of decreasingly-smaller
     * (down to the expectedIntervalBetweenValueSamples) value records.

My question is: Why do you auto-generate additional series of values when the latency is bigger than 1000ns? 

This is done to compensate for a a typical problem when benchmarking latency called "Coordinated Omission", there a fairly detailed discussion of it on the mechanical sympathy mailing list (https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/discussion).  Coordinated omission in latency tests occurs when you get latency spike that is higher than the specified delta between events.  Imagine testing at 1000 events/sec which all returned with a latency of 0.5ms, except for 1 event that took 1s.  If the test is set up such that the load generator will stall for the same period of time as the system under test i.e. 1s in this case, then that coordination between the load generation and the system under test will result in the load generator not generating (omitting) 1000 requests.  The result being that your percentile results will look really good, e.g. 99.99% (assuming > 10000 requests) would show 0.5ms, despite the fact that the system stalled for an entire second.  Backfilling those values allows the measurement to compensate for that behaviour and give more realistic results.

With your explanation and looking on the mechanical sympathy mailing list and i have a global understanding of the problem, so i understand that the interval between samples should be aprox. the mean value of all the raw data, which is fine to be 1 microsecond for Disruptor but it seems way to smaller for the queue version. Using a interval value of 7 microseconds (aprox the mean value of the results without auto-generated values) i'll get much more stable results on consecutive executions of the Latency test, from values of 8.14, 24.82 and 41.41 microseconds to values of 7.15, 7.82 and 8.03 microseconds, as well as reducing the stdDeviation.

I was wondering if my assumption is correct or there may be another explanation for the high variability of the latency results of the queue version on the code.

Thanks,
Jorge

Michael Barker

unread,
Mar 28, 2014, 4:30:17 AM3/28/14
to lmax-di...@googlegroups.com
 
With your explanation and looking on the mechanical sympathy mailing list and i have a global understanding of the problem, so i understand that the interval between samples should be aprox. the mean value of all the raw data, which is fine to be 1 microsecond for Disruptor but it seems way to smaller for the queue version. Using a interval value of 7 microseconds (aprox the mean value of the results without auto-generated values) i'll get much more stable results on consecutive executions of the Latency test, from values of 8.14, 24.82 and 41.41 microseconds to values of 7.15, 7.82 and 8.03 microseconds, as well as reducing the stdDeviation.

1µs as the interval rate is not defined by mean latency, but instead by the input rate.  As we are pushing 1M ops/sec the only value that makes sense as the expected interval rate is 1µs.

Mike.
Reply all
Reply to author
Forward
0 new messages