disruptor.net perfomance test - too slow

328 views
Skip to first unread message

Oleg Vazhnev

unread,
Nov 16, 2012, 2:51:18 AM11/16/12
to lmax-di...@googlegroups.com
I'm using disruptor.net port of disruptor. I have significant perfomance problems and I don't know why:

- am I just measure perfomance wrong way?
- am I use disruptor wrong way?
- .net port of disruptor is so bad?

Probably someone can help me. In my real application I've measured that using Disruptor I spent ~50 microseconds in average to deliver one message.
The code below is very simplified version of my real application. The difference is that in real appplication i have several consumers and I do pass "reference" instead of "int".

However even simplified test below produced such output:

average = 4 minimum = 3 0-5 = 3429, 5-10 = 428, 10-30 = 848, >30 = 295

4 microsecond for Disruptor is HUGE delay. I expect it always to be much less than 1 microsecond. Waiting for suggestions and hints (also probably someone can test the same code with Java version, it would be easy to port I assume :)

using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using Disruptor;
using Disruptor.Dsl;

namespace DisruptorTest
{
    public sealed class ValueEntry
    {
        public int Value { get; set; }

        public ValueEntry()
        {
            //   Console.WriteLine("New ValueEntry created");
        }
    }

    public class ValueAdditionHandler : IEventHandler<ValueEntry>
    {
        public void OnNext(ValueEntry data, long sequence, bool endOfBatch)
        {

            long microseconds = Program.sw[data.Value].ElapsedTicks / (Stopwatch.Frequency / (1000L * 1000L));
            Program.results[data.Value] = microseconds;
        }
    }

    class Program
    {
        public const int length = 10000;
        public static Stopwatch[] sw = new Stopwatch[length];
        public static long[] results = new long[length];

        private static readonly int _ringSize = 1048576;  // Must be multiple of 2

        public static int factorial;

        static void Main(string[] args)
        {
            for (int i = 0; i < length; i++)
            {
                sw[i] = Stopwatch.StartNew();
            }

            var disruptor = new Disruptor.Dsl.Disruptor<ValueEntry>(() => new ValueEntry(), _ringSize, TaskScheduler.Default);

            disruptor.HandleEventsWith(new ValueAdditionHandler());

            var ringBuffer = disruptor.Start();

            for (int i = 0; i < length; i++)
            {
                var valueToSet = i;
                long sequenceNo = ringBuffer.Next();

                ValueEntry entry = ringBuffer[sequenceNo];

                entry.Value = valueToSet;

                sw[i].Restart();
                ringBuffer.Publish(sequenceNo);

                // let's simulate some work
                factorial = 1;
                for (int j = 1; j < 100000; j++)
                {
                    factorial *= j;
                }
            }

            // be absolutely sure that all events are delivered
            Thread.Sleep(1000);

            long average = 0;
            long minimum = 10000000000;
            int firstFive = 0;
            int fiveToTen = 0;
            int tenToThirty = 0;
            int moreThenThirty = 0;

            // count only second half because first half might be too slow by some "start-up" reason
            for (int i = length / 2; i < length; i++)
            {
                average += results[i];
                if (results[i] < minimum)
                {
                    minimum = results[i];
                }
                if (results[i] < 5)
                {
                    firstFive++;
                }
                else if (results[i] < 10)
                {
                    fiveToTen++;
                }
                else if (results[i] < 30)
                {
                    tenToThirty++;
                }
                else
                {
                    moreThenThirty++;
                }
            }
            average /= (length - 100);
            Console.WriteLine("average = {0} minimum = {1} 0-5 = {2}, 5-10 = {3}, 10-30 = {4}, >30 = {5}", average, minimum, firstFive, fiveToTen, tenToThirty, moreThenThirty);
        }
    }
}

Oleg Vazhnev

unread,
Nov 16, 2012, 2:56:04 AM11/16/12
to lmax-di...@googlegroups.com
Bug in code

average /= (length - 100);
should be
average /= (length / 2)

Everything else is still correct.

New output (average a little bit more):
average = 6 minimum = 3 0-5 = 4040, 5-10 = 625, 10-30 = 191, >30 = 144

Olivier

unread,
Nov 16, 2012, 4:39:49 AM11/16/12
to lmax-di...@googlegroups.com, lmax-di...@googlegroups.com
Could you describe the hardware used?
Have you tried to run the latency perf test in the perf test suite, in release mode, without debugger attached?

Olivier
--
 
 

Oleg Vazhnev

unread,
Nov 16, 2012, 5:03:01 AM11/16/12
to lmax-di...@googlegroups.com
The results are for my laptop Lenovo W530 (using ivy bridge processor).

I've just also tried to run my test on HP DL360p Gen8 (2 * Xeon E5-2640), the results are:
average = 4 minimum = 4 0-5 = 2579, 5-10 = 2388, 10-30 = 32, >30 = 1

This is in Release build with no debugger attached.
No I didn't tried to run perf tests, probaly I should.
Do you think that my test is correct and the problem is with hardware or sotware?

Oleg Vazhnev

unread,
Nov 16, 2012, 8:25:41 AM11/16/12
to lmax-di...@googlegroups.com
Adding perf tests result (not sure if this is important).
So the question is still why my original test is so slow.

E:\Oleg\documents\Disruptor-perfRunner-1.0.0.0>Disruptor.PerfTests.exe 0 0 1
Name: OLEGPC
Manufacturer: LENOVO
Model: 2436CTO

Operating System: Microsoft Windows 8 Pro with Media Center
 - Version: 6.2.9200
 - ServicePack: 0

Number of Processors: 1
 - Name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
 - Description: Intel64 Family 6 Model 58 Stepping 9
 - ClockSpeed: 2601 Mhz
 - Number of cores: 4
 - Number of logical processors: 8
 - Hyperthreading: ON

Memory: 15977 MBytes
 - L1Cache: 64 KBytes
 - L2Cache: 256 KBytes
 - L3Cache: 6144 KBytes

Scenario=All, Implementation=All, Runs=1
UniCast1P1CDisruptorPerfTest:24 125 452
UniCast1P1CBlockingCollectionPerfTest:3 076 923
UniCast1P1CDataflowPerfTest:6 814 310
MultiCast1P3CDisruptorPerfTest:16 583 747
MultiCast1P3CBlockingCollectionPerfTest:993 048
MultiCast1P3CDataflowPerfTest:
Pipeline3StepDisruptorPerfTest:26 560 424
Pipeline3StepBlockingCollectionPerfTest:1 136 363
Pipeline3StepDataflowPerfTest:1 474 926
Sequencer3P1CDisruptorPerfTest:15 612 802
Sequencer3P1CBlockingCollectionPerfTest:2 671 415
Sequencer3P1CDataflowPerfTest:
DiamondPath1P3CDisruptorPerfTest:26 455 026
DiamondPath1P3CBlockingCollectionPerfTest:1 231 527
DiamondPath1P3CDataflowPerfTest:883 392
Pipeline3StepLatencyDisruptorPerfTest : Histogram{min=-21, max=547745, mean=137.
69, 99%=128, 99.99%=131072, [2=15673064, 4=0, 8=0, 16=0, 32=0, 64=0, 128=4202065
, 256=10405, 512=8974, 1024=8795, 2048=11327, 4096=12602, 8192=17833, 16384=2459
1, 32768=11927, 65536=7315, 131072=9802, 262144=553, 524288=688, 1048576=59, 209
7152=0, 4194304=0, 8388608=0, 16777216=0, 33554432=0, 67108864=0, 134217728=0, 2
68435456=0, 536870912=0, 1073741824=0, 9223372036854775807=0]}
2       15673064
4       0
8       0
16      0
32      0
64      0
128     4202065
256     10405
512     8974
1024    8795
2048    11327
4096    12602
8192    17833
16384   24591
32768   11927
65536   7315
131072  9802
262144  553
524288  688
1048576 59
2097152 0
4194304 0
8388608 0
16777216        0
33554432        0
67108864        0
134217728       0
268435456       0
536870912       0
1073741824      0
9223372036854775807     0
Pipeline3StepLatencyBlockingCollectionPerfTest : Histogram{min=505, max=1877227,
 mean=1822.45, 99%=4096, 99.99%=524288, [2=0, 4=0, 8=0, 16=0, 32=0, 64=0, 128=0,
 256=0, 512=29626, 1024=675878, 2048=278179, 4096=6773, 8192=1650, 16384=1129, 3
2768=1225, 65536=1220, 131072=2286, 262144=1592, 524288=422, 1048576=10, 2097152
=10, 4194304=0, 8388608=0, 16777216=0, 33554432=0, 67108864=0, 134217728=0, 2684
35456=0, 536870912=0, 1073741824=0, 9223372036854775807=0]}
2       0
4       0
8       0
16      0
32      0
64      0
128     0
256     0
512     29626
1024    675878
2048    278179
4096    6773
8192    1650
16384   1129
32768   1225
65536   1220
131072  2286
262144  1592
524288  422
1048576 10
2097152 10
4194304 0
8388608 0
16777216        0
33554432        0
67108864        0
134217728       0
268435456       0
536870912       0
1073741824      0
9223372036854775807     0
Pipeline3StepLatencyDataflowPerfTest : Histogram{min=9223372036854775807, max=0,
 mean=0, 99%=0, 99.99%=0, [2=0, 4=0, 8=0, 16=0, 32=0, 64=0, 128=0, 256=0, 512=0,
 1024=0, 2048=0, 4096=0, 8192=0, 16384=0, 32768=0, 65536=0, 131072=0, 262144=0,
524288=0, 1048576=0, 2097152=0, 4194304=0, 8388608=0, 16777216=0, 33554432=0, 67
108864=0, 134217728=0, 268435456=0, 536870912=0, 1073741824=0, 92233720368547758
07=0]}
2       0
4       0
8       0
16      0
32      0
64      0
128     0
256     0
512     0
1024    0
2048    0
4096    0
8192    0
16384   0
32768   0
65536   0
131072  0
262144  0
524288  0
1048576 0
2097152 0
4194304 0
8388608 0
16777216        0
33554432        0
67108864        0
134217728       0
268435456       0
536870912       0
1073741824      0
9223372036854775807     0
UniCast1P1CBatchDisruptorPerfTest:120 627 261
UniCast1P1CBatchBlockingCollectionPerfTest:2 861 230
UniCast1P1CBatchDataflowPerfTest:8 080 808

Olivier

unread,
Nov 16, 2012, 1:00:01 PM11/16/12
to lmax-di...@googlegroups.com, lmax-di...@googlegroups.com
I had a closer look to your code:
1. There is no warm up phase so you're measuring JIT time
2. You're not configuring the disruptor to use the right claim strategy, it should be the single threaded one since you have only one producer
3. You're doing computations on the hot path, you should just get raw measures and compute once the test is complete (for instance store stopwatch.GetTimestamp).
4. Your array used to store stopwatches is accessed by both threads in sequence, the first one writing, the other one reading which likely cause lots of false sharing. 
5. Your final measures include the instrumentation (stopwatch calls do not come for free). To do this properly you would have to measure the time spent to instrument and subtract it from your final results.

Hope this helps..

Olivier
--
 
 

Oleg Vazhnev

unread,
Nov 16, 2012, 1:19:14 PM11/16/12
to lmax-di...@googlegroups.com
Thank you Olivier!

1. Even in real application while measuring Disruptor I do see that it takes 2-3 times longer to process data comparing to BlockingCollection. As real application works for "several hours" I'm sure i do not measure JIT there, so I do think that JIT likely is not the reason why my test above is so slow.
3. Agree, I better to store Stopwatch.ElapsedTicks, however my computations are trivial and can not affect test result significantly.
4. Ok, but i do not know how to make things better... How can I measure delivery of the message from one thread to another without accessing Stopwatch from different threads?
5. Agree, but this can not significantly affect measurement. Stopwatch is able to measure with "microsecond precision" without introducing significant overhead.

2. Can this slowdown disruptor in 10-100 times? I would appreciate If you or someone else can "point" me to a "right" example that covers my needs.

Thanks,
 Oleg
Reply all
Reply to author
Forward
0 new messages