Disruptor in log4j2, disappointing multi-producer performance

1,186 views
Skip to first unread message

Remko Popma

unread,
Mar 15, 2013, 12:00:54 PM3/15/13
to lmax-di...@googlegroups.com
I am working on a module for log4j2 that provides asynchronous logging based on the Disruptor. (https://issues.apache.org/jira/browse/LOG4J2-163)
When running performance tests I was surprised to find that with multiple producers (threads that log messages), throughput of the log4j (ArrayBlockingQueue-based) AsynchAppender was much higher than the Disruptor throughput.

Is this to be expected with multiple producers?

On Solaris 10 (64bit) with JDK1.7.0_06, 4-core Xeon X5570 CPU @2.93GHz with hyperthreading switched on (8 virtual cores):
Single Thread
DisruptorLogger throughput: 1,268,737 ops/sec. latency(ns): avg=808.8 99% < 1024.0 99.99% < 131072.0 (50000 samples)
AsynchAppender  throughput:   857,867 ops/sec. latency(ns): avg=892.2 99% < 3686.4 99.99% < 29491.2 (50000 samples)

Two Threads
DisruptorLogger throughput: 536,877 ops/sec. latency(ns): avg=2351.9 99% < 10649.6 99.99% < 222822.4 (100000 samples)
AsynchAppender  throughput: 444,856 ops/sec. latency(ns): avg=1926.4 99% < 8192.0 99.99% < 183500.8 (100000 samples)

Four Threads
DisruptorLogger throughput: 61,183 ops/sec. latency(ns): avg=3015.3 99% < 10649.6 99.99% < 5236326.4 (200000 samples)
AsynchAppender throughput: 183,071 ops/sec. latency(ns): avg=4270.5 99% < 32768.0 99.99% < 327680.0 (199998 samples)

Eight Threads
DisruptorLogger throughput: 27,097 ops/sec. latency(ns): avg=3430.8 99% < 16128.0 99.99% < 18419712.0 (399966 samples)
AsynchAppender throughput: 111,550 ops/sec. latency(ns): avg=3817.5 99% < 39731.2 99.99% < 4449894.4 (399994 samples)


On Windows 7 (64bit) with JDK1.7.0_11, 2-core Intel i5-3317u CPU @1.70Ghz with hyperthreading switched on (4 virtual cores):
Single Thread
DisruptorLogger throughput: 838,069 ops/sec. latency(ns): avg=2064.6 99% < 4096.0 99.99% < 996147.2 (4492 samples)
AsynchAppender  throughput: 613,668 ops/sec. latency(ns): avg=2506.0 99% < 14745.6 99.99% < 996147.2 (2765 samples)

Two Threads
DisruptorLogger throughput: 606,741 ops/sec. latency(ns): avg=1433.8 99% < 5324.8 99.99% < 235929.6 (9690 samples)
AsynchAppender  throughput: 458,478 ops/sec. latency(ns): avg=1497.9 99% < 7372.8 99.99% < 196608.0 (10000 samples)

Four Threads
DisruptorLogger throughput: 54,436 ops/sec. latency(ns): avg=3573.5 99% < 32051.2 99.99% < 1395916.8 (20000 samples)
AsynchAppender throughput: 229,912 ops/sec. latency(ns): avg=4627.6 99% < 13619.2 99.99% < 589824.0 (20000 samples)

Eight Threads
DisruptorLogger throughput: 34,308 ops/sec. latency(ns): avg=7705.5 99% < 20736.0 99.99% < 2340454.4 (40000 samples)
AsynchAppender throughput: 165,455 ops/sec. latency(ns): avg=5167.7 99% < 13670.4 99.99% < 1238220.8 (40000 samples)


I suspect that I am doing something wrong somewhere...
First, could there be a problem with my performance test?

public static void singleThreadPerfTest() {
    // JVM warmup
    Histogram warmupHist = createHistogram();
    for (int i = 0; i < 10; i++) {
        perfTest(200000, warmupHist);
    }
    // wait for buffers to drain
    Thread.sleep(10000);
    
    // test
    Histogram histogram = createHistogram();
    perfTest(100000, warmupHist);
    printReport(histogram);
}

private void perfTest(int lines, Histogram histogram) {
    if (throughput) {
        runThroughputTest(lines, histogram);
    } else {
        runLatencyTest(histogram);
    }
}

public static void multiThreadPerfTest(int threadCount) {
    final Histogram warmupHist = createHistogram();
    Runnable run1 = new Runnable() {
        public void run() {
            for (int i = 0; i < 10; i++) {
                perfTest(200000, warmupHist);
            }
        }
    };
    Thread thread1 = new Thread(run1);
    Thread thread2 = new Thread(run1);
    thread1.start();
    thread2.start();
    thread1.join();
    thread2.join();
    
    // wait for buffers to drain
    Thread.sleep(10000);
    
    // test
    final Histogram[] histograms = new Histogram[threadCount];
    for (int i = 0; i < histograms.length; i++) {
        histograms[i] = createHistogram();
    }
    final int LINES = 100000;
    Thread[] threads = new Thread[threadCount];
    for (int i = 0; i < threads.length; i++) {
        final Histogram histogram = histograms[i];
        threads[i] = new Thread() {
            public void run() {
                perfTest(LINES, histogram);
            }
        };
    }
    for (Thread thread : threads) {
        thread.start();
    }
    for (Thread thread : threads) {
        thread.join();
    }
    for (Histogram histogram : histograms) {
        printReport(threadCount, histogram);
    }
}

private static void runThroughputTest(int lines, Histogram histogram) {
    long s1 = System.nanoTime();
    Logger logger = LogManager.getLogger(PerfTest.class);
    for (int j = 0; j < lines; j++) {
        logger.info(LINE500);
    }
    long s2 = System.nanoTime();
    long opsPerSec = (1000L * 1000L * 1000L * lines) / (s2 - s1);
    histogram.addObservation(opsPerSec);
}

private static void runLatencyTest(Histogram histogram) {
    long nanoTimeCost = calcNanoTimeCost();
    Logger logger = LogManager.getLogger(PerfTest.class);
    final int SAMPLES = 10000;
    for (int i = 0; i < SAMPLES; i++) {
        long s1 = System.nanoTime();
        logger.info(LINE500);
        long s2 = System.nanoTime();
        long value = s2 - s1 - nanoTimeCost;
        if (value > 0) {
            histogram.addObservation(value);
        }
        // wait 1 microsec
        final long PAUSE_NANOS = 1000;
        long pauseStart = System.nanoTime();
        while (PAUSE_NANOS > (System.nanoTime() - pauseStart)) {
            // busy spin
        }
    }
}

Sam Barker

unread,
Mar 15, 2013, 12:05:28 PM3/15/13
to lmax-di...@googlegroups.com
What sort of system are you running these tests on, depending on the wait strategies in use you may see relatively poor performance from the disruptor if you have fewer cores than threads.

Sam



--
You received this message because you are subscribed to the Google Groups "Disruptor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lmax-disrupto...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Sam Barker

unread,
Mar 15, 2013, 12:07:24 PM3/15/13
to lmax-di...@googlegroups.com
Re-reading I see the core counts. Though I would still look at the wait and claim strategies your using.

Remko Popma

unread,
Mar 15, 2013, 12:41:05 PM3/15/13
to lmax-di...@googlegroups.com
I am using the constructor Disruptor(factory, ringBufferSize, executor) (in disruptor-2.10.4),
so that would be the default: MultiThreadedClaimStrategy and BlockingWaitStrategy.
BusySpinWaitStrategy is probably not a good idea in a logging library, which leaves SleepingWait and YieldingWaitStrategy...
I can give those a try...

Michael Barker

unread,
Mar 15, 2013, 5:37:31 PM3/15/13
to lmax-di...@googlegroups.com
I would try out version 3.0.0.beta3. The biggest change was an
improved multiproducer algorithm.

We use the Disruptor for logging. Our preference is for
BlockingWaitStrategy in CI and other resource constrained/virtualised
environments and SleepingWaitStrategy in production. The advantage of
the SleepingWaitStrategy is that it has very low impact on the
publishing thread which is what you want from a logging system. You
can tolerate the additional latency for actually getting the message
logged. Note that the SleepingWaitStrategy performs quite poorly on
Windows. I wouldn't recommend the BusySpin or Yielding strategies for
logging.

Mike.

Hander

unread,
Mar 15, 2013, 8:54:09 PM3/15/13
to lmax-di...@googlegroups.com
What would happen if you have a lot of publisher log events threads? That would mean in having a lot of contention on a single disruptor for logging onto a single file.

I was thinking in having for each published thread a single disruptor and a pool of shared pinned consumers "polling" each disruptor. Would that be an stupid idea using disruptor? Obviously, the big problem would be out-of-order events if you look at the event timestamp (assuming it was timestamped on the publisher side), but still would be ordered events if you look at the published events side...Any ideas for reducing the publishing contention on a signle disruptor?

rgrds

Michael Barker

unread,
Mar 16, 2013, 5:53:08 PM3/16/13
to lmax-di...@googlegroups.com
> What would happen if you have a lot of publisher log events threads? That
> would mean in having a lot of contention on a single disruptor for logging
> onto a single file.
>
> I was thinking in having for each published thread a single disruptor and a
> pool of shared pinned consumers "polling" each disruptor. Would that be an
> stupid idea using disruptor? Obviously, the big problem would be
> out-of-order events if you look at the event timestamp (assuming it was
> timestamped on the publisher side), but still would be ordered events if you
> look at the published events side...Any ideas for reducing the publishing
> contention on a signle disruptor?

This approach has been mentioned before and will work well in certain
scenarios, but for a generalised logger may not be the best approach.
Consider a web server that could have 1000s of threads, creating over
a 1000 instances of the Disruptor (depending on the size of the
buffer) could give you a number of other performance issues, such high
GC cost as a newly allocated Disruptor with a large number of
preallocated objects needs to be promoted.

Remko Popma

unread,
Mar 17, 2013, 12:25:44 AM3/17/13
to lmax-di...@googlegroups.com
Thank you all for your kind advice!

I found an issue with my tests: I was causing the queue/ringbuffer to overflow, which was not what I was trying to measure.

After reducing the events to fit the queue size this I got these results on Windows 7 (64bit) with JDK1.7.0_11, 2-core Intel i5-3317u CPU @1.70Ghz with hyperthreading switched on (4 virtual cores):

disruptor-2.10.4
T ABQueue BlockWt SleepWt YieldWait
- ------- ------- ------- ---------
1 571,000 941,000 921,000 1,772,000
2 539,000 547,000 642,000 __767,000
4 266,000 239,000 392,000 __498,000
8 144,000 146,000 187,000 __181,000

disruptor-3.0.0-beta3
T ABQueue BlockWait SleepWt YieldWt
- ------- --------- ------- -------
1 577,000 1,009,000 897,000 966,000
2 547,000 __728,000 584,000 656,000
4 258,000 __459,000 391,000 401,000
8 161,000 __294,000 524,000 328,000

Where T=ThreadCount, ABQueue=ArrayBlockingQueue, and the other columns are WaitStrategies.
The numbers are throughput in events/second per thread, average of 5 test runs.

Both disruptor 2.10 and 3.0 show higher throughput than ArrayBlockingQueue in all but one case.
Also, as promised, disruptor 3.0 has higher throughput than 2.10 with multiple producers. Great work, many thanks!


As to my original problem (where my queue/ringbuffer was too small in the multi-threaded scenarios),
I am speculating, could the higher contention in ArrayBlockingQueue work to "throttle" the producers?
With the disruptor, once the ringbuffer is full, performance becomes very close to synchronous loggers (not shown), that is, limited by disk I/O speed.
It is interesting to see that ArrayBlockingQueue stayed at about 10 times disk I/O speed. 
Does that mean that with 8 threads producing 100,000 events you cannot fill a ABQ queue with 256K slots?

I'll do some more testing with scenarios where eventCount > queueSize.

Meanwhile, thanks again everyone for your help and advice.

Michael Barker

unread,
Mar 17, 2013, 4:29:24 AM3/17/13
to lmax-di...@googlegroups.com
> I am speculating, could the higher contention in ArrayBlockingQueue work to
> "throttle" the producers?
> With the disruptor, once the ringbuffer is full, performance becomes very
> close to synchronous loggers (not shown), that is, limited by disk I/O
> speed.
> It is interesting to see that ArrayBlockingQueue stayed at about 10 times
> disk I/O speed.
> Does that mean that with 8 threads producing 100,000 events you cannot fill
> a ABQ queue with 256K slots?

It's possible. I'd probably add something that can periodically
report the remaingCapacity of the queue and disruptor as it processes
events.

How are you writing to the output? Are you doing synchronised file
writes? Have you utilised the batching capability built into the
Disruptor.

Mike.

Remko Popma

unread,
Mar 17, 2013, 4:51:56 AM3/17/13
to lmax-di...@googlegroups.com
> It's possible. I'd probably add something that can periodically report the remaingCapacity of the queue and disruptor as it processes events.

Not sure what you mean. You mean report capacity during the performance test, or use remainingCapacity to somehow throttle events in the producers (not sure how)?

> How are you writing to the output? Are you doing synchronised file
> writes? Have you utilised the batching capability built into the

> disruptor?

Log4j appenders write to files in a synchronized block, but with both ArrayBlockingQueue and Disruptor there is only one I/O thread, so I didn't think this would be a problem.
I am batching events, flushing on endOfBatch.
(https://issues.apache.org/jira/browse/LOG4J2-164 )

Michael Barker

unread,
Mar 17, 2013, 5:11:06 AM3/17/13
to lmax-di...@googlegroups.com
>> It's possible. I'd probably add something that can periodically report the remaingCapacity of the queue and disruptor as it processes events.
>
> Not sure what you mean. You mean report capacity during the performance test, or use remainingCapacity to somehow throttle events in the producers (not sure how)?

Just suggesting an option to try and prove your theory about AQB
throttling publishers to the point where they are unable to fill the
queue.

>> How are you writing to the output? Are you doing synchronised file
>> writes? Have you utilised the batching capability built into the
>> disruptor?
>
> Log4j appenders write to files in a synchronized block, but with both ArrayBlockingQueue and Disruptor there is only one I/O thread, so I didn't think this would be a problem.

By synchronised I meant sync'd to disk, e.g. using
FileDescriptor.sync() or using a RandomAccessFile in "direct" mode.

> I am batching events, flushing on endOfBatch.
> (https://issues.apache.org/jira/browse/LOG4J2-164

Are you also flushing periodically? E.g. after a certain amount of
data is written. Failing to do so can make throughput rates a bit
jumpy as the batches get larger when the consumer gets behind. This
is because by default the Sequence is not updated until the batch
completes. The SequenceReportingEventHandler interface is designed to
support this. It passes the Sequence object into the handler, so that
the handler can update the Sequence's value after a flush that isn't
triggered by an end of batch event.

Mike.

Remko Popma

unread,
Mar 18, 2013, 7:30:59 PM3/18/13
to lmax-di...@googlegroups.com

> Just suggesting an option to try and prove your theory about AQB 
> throttling publishers to the point where they are unable to fill the 
> queue. 

Good idea, thanks! I'll give that a try.

> By synchronised I meant sync'd to disk, e.g. using 
> FileDescriptor.sync() or using a RandomAccessFile in "direct" mode. 

Not doing any of those.

> Are you also flushing periodically?  E.g. after a certain amount of 
> data is written.  Failing to do so can make throughput rates a bit 
> jumpy as the batches get larger when the consumer gets behind.  This 
> is because by default the Sequence is not updated until the batch 
> completes.  The SequenceReportingEventHandler interface is designed to 
> support this.  It passes the Sequence object into the handler, so that 
> the handler can update the Sequence's value after a flush that isn't 
> triggered by an end of batch event. 

There are two FileAppenders, one uses a BufferedOutputStream, with
the default 8K buffer size, and the other uses a 256K ByteBuffer
and a RandomAccessFile. Buffers are flushed on endOfBatch, or
when the buffer is full.
What is the advantage of updating the Sequence's value after a
flush that happened in the middle of a batch?

Michael Barker

unread,
Mar 18, 2013, 8:17:27 PM3/18/13
to lmax-di...@googlegroups.com
> What is the advantage of updating the Sequence's value after a
> flush that happened in the middle of a batch?

If you have a burst of traffic and the consumer gets quite a large
backlog of events, there is a higher probability of the producer
running into a case where the buffer is completely full. Example:

Imagine you have a 1K buffer. You have burst of events such that the
producer has reached 900, but the consumer is around 300. Message
rates then fall back to normal, which (for arguments sake) is about
1/3 of the maximum rate that the consumer can run at. In order to
fill up the buffer and block or drop messages the producer only needs
to produce another 124 messages. In that time the consumer would
consume 372 (3x the rate) messages so it would be at sequence 672. If
the consumer doesn't publish periodic updates to its sequence then
the producer will see the old value of 300 until it completes the
batch and updates it to 900. In this situation the producer will fill
the buffer's available space and will either need to drop the incoming
events or block the thread (which will probably cause back pressure or
message loss elsewhere in the system). If the consumer does
periodically updates (e.g. once every 50 messages) then the producer
would see a value somewhere around ~650 and would still have some
available space in the buffer to put new events.

Periodic updates allow the producer to make more regular progress.

Mike.
Reply all
Reply to author
Forward
0 new messages