Garbage-free log4j: request for feedback on measurement methodology

Remko Popma

unread,

Apr 21, 2016, 1:44:30 AM4/21/16

to mechanica...@googlegroups.com

I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:

throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads

I would like feedback from the experts on this list on whether my measurement approach makes sense.

Throughput

Async logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.

One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.

I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?

Response time

I would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram? I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java (Update: this test has been replaced with https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java after Gil's feedback)

Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:

Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?

And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.

Thoughts?

Remko Popma

Patrik Nordwall

unread,

Apr 21, 2016, 2:59:39 AM4/21/16

to mechanica...@googlegroups.com

On Thu, Apr 21, 2016 at 7:44 AM, Remko Popma <remko...@gmail.com> wrote:

I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads

I would like feedback from the experts on this list on whether my measurement approach makes sense.

Throughput
Async logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.

One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.

I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?

Response time
I would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram?

Perhaps you can use JLBH (Java Latency Benchmark Harness) that is based on JMH, from Peter Lawrey. I have not tried it myself but read about it here: https://vanilla-java.github.io/2016/04/02/Microservices-in-the-Chronicle-World-Part-5.html

Regards,

Patrik

I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java

Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?

And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.

Thoughts?

Remko Popma

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Patrik Nordwall
Akka Tech Lead
Lightbend - Reactive apps on the JVM
Twitter: @patriknw

Nitsan Wakart

unread,

Apr 21, 2016, 3:53:57 AM4/21/16

to mechanica...@googlegroups.com

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.

You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.

Gil Tene

unread,

Apr 21, 2016, 7:56:12 PM4/21/16

to mechanica...@googlegroups.com

I'll address the latency specific question inline below. I haven't had a chance to look at Peter's JLBH yet, so I don't have a specific opinion there. What I can say about straight JMH latency measurement though, is that it uses sampling techniques (which are no a problem in themselves for percentiles, but make min/max values are not valid), and that AFAICT, it measures service time, and not response time for the clocked operations. It's good for answering questions like "how long does this operation take once it actually gets to start", but not "how long does it take to execute this operation under a given load".

On Wednesday, April 20, 2016 at 10:44:30 PM UTC-7, Remko Popma wrote:

I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads

I would like feedback from the experts on this list on whether my measurement approach makes sense.

Throughput
Async logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.

One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.

I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?

Response time
I would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram? I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java

Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:

1. Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?

The maximum load I refer to is the one for the throughput test above. But with the caveat that if you vary the number of logging threads, each # may have a different "maximum load" to test against. Remember to try at an attempted load that is greater than the achieved maximum load, to verify that the measurements are correct.

2. What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?

Don't think of it as reducing the load. Think of it as pacing your load. The pace attempted pace could be faster than you are actually able to achieve. And that's fine (even recommended as a test point).

For an example of how you might pace a thread, you can look at (and lift) the Pacer class we put in cassandra-stress2, and how it is used. That specific pacer implementation uses a loop with LockSupport.parkNanos(delta) until the "right time" is actually reached. But you can do the same with a spin loop, or [preferrably] a hybrid combo: spin if short, parkNanos to "a bit before the time actually needed" and spin for the rest of the time. The benefit of the spinning variants is that they will start each operation closer to when when it actually needs to start, such that the wakeup time (which ends up delaying the actual start compared to where it was supposed to be) does not count against the latency.

And it is VERY important that you measure your response time based on what the expected start time for the operation was, and NOT based on when the operation actually started. The Pacer class example above provides an expected start time for a constant-throughput-per-thread model. This is probably sufficient for what you are looking to do. But more complex models are possible (e.g. random or poissonian distribution of start times, modeled spikes in activity, playback of actual recorded arrival times, etc.). The key thing is that the Pacer should know when each operation is supposed to start, and the response time is the delta between that (expected) start time, and the actual completion time. [The service time is the time between the actual start time of the operation and the actual completion time, and it's easy to track it too for comparison. The cassandra-stress2 thing linked to above does just that].

3. I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?

Altering the number of threads should be seen as a change in configuration, not in load. E.g. the max throughout you can achieve will vary with the # of threads. And the response time behavior can vary with that as well. You can add the # of threads as yet another axis to iterate on, but I'd keep the number of different configuration points relatively low (e.g. 1 , 5, 10, 100).

Remko Popma

unread,

Apr 24, 2016, 4:58:22 AM4/24/16

to mechanical-sympathy

On Thursday, April 21, 2016 at 3:59:39 PM UTC+9, Patrik Nordwall wrote:

On Thu, Apr 21, 2016 at 7:44 AM, Remko Popma <remko...@gmail.com> wrote:
I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads

I would like feedback from the experts on this list on whether my measurement approach makes sense.

Throughput
Async logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.

One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.

I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?

Response time
I would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram?

Perhaps you can use JLBH (Java Latency Benchmark Harness) that is based on JMH, from Peter Lawrey. I have not tried it myself but read about it here: https://vanilla-java.github.io/2016/04/02/Microservices-in-the-Chronicle-World-Part-5.html

Patrik, JLBH looks promising. Quick summary of what I've seen so far:

* It is geared towards measuring latency at various levels of throughput

* I like the JHiccup-like idea to account for JVM/OS jitter.

* It seems to provide its own histogram while I would prefer to use HdrHistogram (and the nice graphs you can produce with its data)

I may look at both JLBH and the Pacer idea that Gil suggested below.

Regards,
Patrik

I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java

Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?

And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.

Thoughts?

Remko Popma

--

Patrik Nordwall
Akka Tech Lead

Remko Popma

unread,

Apr 24, 2016, 5:00:03 AM4/24/16

to mechanical-sympathy, nit...@yahoo.com

On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.
You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.

Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?

Martin Thompson

unread,

Apr 24, 2016, 6:13:18 AM4/24/16

to mechanica...@googlegroups.com

As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.

I've often found that formatting of the dates in loggers to be one of the dominant factors. Have you considered writing a custom date formatter for the most common format? This has provided a many fold improvement for me when I must support a logging API.

Georges Gomes

unread,

Apr 24, 2016, 6:44:54 AM4/24/16

to mechanica...@googlegroups.com

+1 Date/Time formatter

What you can do is :
- keep a buffer with the formated date/time
- when new log, compare last time and new time.
- if same time at the second level -> just update the millisecond part.
- if same time at the minute level -> just update the seconds part.
- etc...
- otherwise format total

This is enough to boost significantly the formating because logs are usually very close to each other during processing.

And it's easy to do in GC-free with formating in byte[] (or similar) instead of going through String :)

Hope this helps
GG

--

Remko Popma

unread,

Apr 24, 2016, 6:47:35 AM4/24/16

to mechanical-sympathy

On Sunday, April 24, 2016 at 7:13:18 PM UTC+9, Martin Thompson wrote:

On 24 April 2016 at 10:00, Remko Popma <remko...@gmail.com> wrote:
On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.
You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.

Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?

As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.

Martin, the problem I encounter when trying to measure async logging throughput is that there is one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

This becomes a problem when I try to measure throughput with JMH. I would prefer using JMH to rolling my own performance test, but JMH exceeds the ringbuffer size... For the async case perhaps JMH is simply not the right tool for what I'm trying to do?

(Or perhaps I don't understand how the consumeCPU() or Thread.yield() API can help with this.)

I've often found that formatting of the dates in loggers to be one of the dominant factors. Have you considered writing a custom date formatter for the most common format? This has provided a many fold improvement for me when I must support a logging API.

Completely agree. FixedDateFormat made a huge performance difference and is garbage-free (well, almost: once a day at midnight it delegates to FastDateFormat to recalc the date part, that allocates some temporary objects).

Martin Thompson

unread,

Apr 24, 2016, 7:19:13 AM4/24/16

to mechanica...@googlegroups.com

On 24 April 2016 at 11:47, Remko Popma <remko...@gmail.com> wrote:

On Sunday, April 24, 2016 at 7:13:18 PM UTC+9, Martin Thompson wrote:
On 24 April 2016 at 10:00, Remko Popma <remko...@gmail.com> wrote:
On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.
You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.

Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?

As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.

Martin, the problem I encounter when trying to measure async logging throughput is that there is one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

This becomes a problem when I try to measure throughput with JMH. I would prefer using JMH to rolling my own performance test, but JMH exceeds the ringbuffer size... For the async case perhaps JMH is simply not the right tool for what I'm trying to do?
(Or perhaps I don't understand how the consumeCPU() or Thread.yield() API can help with this.)

You are right in that it does not address that issue directly. It can help by yielding the CPU so the flusher threads can do their work in writing the page cache down to the block devices.

I find these types of approaches tend to have at least 3 modes.

1. Bursting into the application buffer, Disruptor in this case. You need to size this for your typical bursts.

2. Writing from the application buffer to file that is in the page cache. This is heavily influenced by amount of memory and vm.swappiness etc.

3. When the page cache is overwhelmed then you are writing to real storage.

These three can be combined/conflated depending on if memory mapping and/or synchronous IO is used.

It is useful to benchmark each component of the solution then combine the components and bench the aggregate.

When benching the aggregate I often find JMH to be limiting. Producer to consumer scenarios, having a fixed number of iterations in a run, or targeting a throughput rate don't tend to fit naturally. Maybe it is my limitation and I don't easily see how JMH works in these scenarios.

A test harness for aggregate components needs to be able to run at different throughput rates, optionally introduce bursts, and measure response time for all operations and not just sample.

To get back to your question. How about mocking out the writing to disk in the appender? Then set the timeOnIteration to be a suitably high number and measure that way. You will have issues with Coordinated Omission with JMH because it is really only tracking service time and not response time. You could add GC logging to see how "lumpy" the throughput is in classic mode vs garbage-free. However the GC is not likely to behave in a microbenchmark like in a real application with a complex heap of different objects.

Bottom line is if you have a buffer to absorb bursts then in a sustained high throughput scenario the buffer is soon something that just adds latency.

Remko Popma

unread,

Apr 24, 2016, 7:56:44 AM4/24/16

to mechanica...@googlegroups.com

On Sunday, April 24, 2016 at 8:19:13 PM UTC+9, Martin Thompson wrote:

On 24 April 2016 at 11:47, Remko Popma <remko...@gmail.com> wrote:

On Sunday, April 24, 2016 at 7:13:18 PM UTC+9, Martin Thompson wrote:
On 24 April 2016 at 10:00, Remko Popma <remko...@gmail.com> wrote:
On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.
You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.

Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?

As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.

Martin, the problem I encounter when trying to measure async logging throughput is that there is one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

This becomes a problem when I try to measure throughput with JMH. I would prefer using JMH to rolling my own performance test, but JMH exceeds the ringbuffer size... For the async case perhaps JMH is simply not the right tool for what I'm trying to do?
(Or perhaps I don't understand how the consumeCPU() or Thread.yield() API can help with this.)

You are right in that it does not address that issue directly. It can help by yielding the CPU so the flusher threads can do their work in writing the page cache down to the block devices.

I find these types of approaches tend to have at least 3 modes.

1. Bursting into the application buffer, Disruptor in this case. You need to size this for your typical bursts.
2. Writing from the application buffer to file that is in the page cache. This is heavily influenced by amount of memory and vm.swappiness etc.
3. When the page cache is overwhelmed then you are writing to real storage.

These three can be combined/conflated depending on if memory mapping and/or synchronous IO is used.

It is useful to benchmark each component of the solution then combine the components and bench the aggregate.

When benching the aggregate I often find JMH to be limiting. Producer to consumer scenarios, having a fixed number of iterations in a run, or targeting a throughput rate don't tend to fit naturally. Maybe it is my limitation and I don't easily see how JMH works in these scenarios.

Yes, I'm slowly coming to the same conclusion: of the four kinds of performance tests I had planned (see original post above) JMH seems a good fit to testing the throughput of synchronous logging (the aggregate), but not a very good fit to the other tests (throughput with some max number of invocations and response time under various loads).

A test harness for aggregate components needs to be able to run at different throughput rates, optionally introduce bursts, and measure response time for all operations and not just sample.

Looks like Peter's JLBH framework is aiming to address these needs. But for the response time perf tests I would really like to use HdrHistogram. So I may take Gil's Pacer idea for a spin first.

To get back to your question. How about mocking out the writing to disk in the appender? Then set the timeOnIteration to be a suitably high number and measure that way. You will have issues with Coordinated Omission with JMH because it is really only tracking service time and not response time. You could add GC logging to see how "lumpy" the throughput is in classic mode vs garbage-free. However the GC is not likely to behave in a microbenchmark like in a real application with a complex heap of different objects.

Yes I may end up doing just that the the async logging throughput test. This may not show the difference between garbage-free and classic as much, since 90% of the temporary memory is char arrays and byte arrays in the "classic" logging appenders which we will then be mocking out. But it will show the difference (if any) between reusing cached Message and LogEvents versus creating new ones for each call to Logger.log(). And hopefully that will allow me to use JMH for the async throughput test and avoid rolling something custom. (Let's hope that an empty EventHandler is fast enough to prevent the ringbuffer from filling up in multithreaded tests.)

Remko Popma

unread,

Apr 26, 2016, 2:03:17 PM4/26/16

to mechanical-sympathy

Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
1. Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
The maximum load I refer to is the one for the throughput test above. But with the caveat that if you vary the number of logging threads, each # may have a different "maximum load" to test against. Remember to try at an attempted load that is greater than the achieved maximum load, to verify that the measurements are correct.
2. What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
Don't think of it as reducing the load. Think of it as pacing your load. The pace attempted pace could be faster than you are actually able to achieve. And that's fine (even recommended as a test point).

For an example of how you might pace a thread, you can look at (and lift) the Pacer class we put in cassandra-stress2, and how it is used. That specific pacer implementation uses a loop with LockSupport.parkNanos(delta) until the "right time" is actually reached. But you can do the same with a spin loop, or [preferrably] a hybrid combo: spin if short, parkNanos to "a bit before the time actually needed" and spin for the rest of the time. The benefit of the spinning variants is that they will start each operation closer to when when it actually needs to start, such that the wakeup time (which ends up delaying the actual start compared to where it was supposed to be) does not count against the latency.

And it is VERY important that you measure your response time based on what the expected start time for the operation was, and NOT based on when the operation actually started. The Pacer class example above provides an expected start time for a constant-throughput-per-thread model. This is probably sufficient for what you are looking to do. But more complex models are possible (e.g. random or poissonian distribution of start times, modeled spikes in activity, playback of actual recorded arrival times, etc.). The key thing is that the Pacer should know when each operation is supposed to start, and the response time is the delta between that (expected) start time, and the actual completion time. [The service time is the time between the actual start time of the operation and the actual completion time, and it's easy to track it too for comparison. The cassandra-stress2 thing linked to above does just that].

Gil, thank you for the detailed and informative advice!

I lifted the Pacer class and incorporated it in ResponseTimeTest. It took a while to get my head around the cassandra stress test, I hope I'm using the Pacer class correctly.

Initial results show that from a certain load, response time becomes orders of magnitude larger than the maximum service time (e.g. max service time 1.3 ms, max response time 242 ms, with more than 10% of response time measurements larger than 100 ms). See attached 1500ks (service time at 1.5 million msg/sec) and 1500kr (response time).

Is this expected? The Linux system I'm running this on is idle, but not tuned for low latency. Could this be OS jitter?

1500ks

1500kr

Josh Humphries

unread,

Apr 26, 2016, 2:28:15 PM4/26/16

to mechanica...@googlegroups.com

On Thu, Apr 21, 2016 at 1:44 AM, Remko Popma <remko...@gmail.com> wrote:

I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads

I would like feedback from the experts on this list on whether my measurement approach makes sense.

Throughput
Async logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).

Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.

One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.

I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?

The throughput of the log system will have to account for the actual formatting and I/O. The main reason to use async vs. sync is to reduce latency observed in logging threads. So I don't think it's cheating for the async perf tests to be measuring expected latency instead of throughput.

BTW, having a full buffer degrade the latency of logging calls seems bad. Past versions of log4j had an async appender that had a "non-blocking" mode, where events were discarded if the buffer was full. Metrics were recorded for the number of discarded events and extra log events synthesized so that the actual log output would indicate summaries for the events that had to be thrown away.

I would think you could put async logging into such a mode (it would use RingBuffer#tryNext instead of RingBuffer#next) to do a throughput test that does not get bottlenecked by the appender.

Response time
I would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram? I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java

Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?

And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.

Thoughts?

Remko Popma

Remko Popma

unread,

Apr 26, 2016, 2:45:41 PM4/26/16

to mechanical-sympathy

After looking more carefully: somewhere around this point is where the producer outpaces the consumer and the ringbuffer becomes full. Backpressure probably explains the result I was seeing.

Remko Popma

unread,

Apr 26, 2016, 7:37:28 PM4/26/16

to mechanical-sympathy

I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?

The throughput of the log system will have to account for the actual formatting and I/O. The main reason to use async vs. sync is to reduce latency observed in logging threads. So I don't think it's cheating for the async perf tests to be measuring expected latency instead of throughput.

BTW, having a full buffer degrade the latency of logging calls seems bad. Past versions of log4j had an async appender that had a "non-blocking" mode, where events were discarded if the buffer was full. Metrics were recorded for the number of discarded events and extra log events synthesized so that the actual log output would indicate summaries for the events that had to be thrown away.

I would think you could put async logging into such a mode (it would use RingBuffer#tryNext instead of RingBuffer#next) to do a throughput test that does not get bottlenecked by the appender.

A function like this has been implemented and will be available in the next release (2.6):

If system property "log4j2.AsyncEventRouter" is set to "Discard", then TRACE and DEBUG events are dropped when the ring buffer is full. System property "log4j2.DiscardThreshold" can be used to set a different level of the events to keep (the default is INFO).

The default however is to enqueue the event (and block) even if the buffer is full (except for the case that the current thread is the background thread, which may happen if an object was logged that itself performed logging from its toString() method. In that case the event is sent to the appender directly).

See AsyncEventRouterFactory. The ResponseTimeTest uses this mechanism to print to the console when the ring buffer is full.

Gil Tene

unread,

Apr 26, 2016, 8:08:11 PM4/26/16

to mechanical-sympathy

Remko,

Yes. This sort of separation between service time and response time is very typical under load. And orders of magnitude difference is very normal.

Here is your data presented visually [plotted directly using http://hdrhistogram.github.io/HdrHistogram/plotFiles.html]:

With the separation being so large and happening so early (around the 75%'ile), I suspect that at this load queueing is building up quite a bit. But it is not completely saturated. A truly saturated test [one that tries to go slightly faster than the rate that the system can actually execute at] tends to look like this:

(note separation starts right from the start, as in a truly saturated test only a handful of response times will see the actual service time, and the rest will see the impact of ever increasing queue length)

A more common separation (at loads farther away from saturation) might look like this:

or this:

And at a load that is low enough to virtually never experience any queueing, you'll see the two converge:

Note that the response time distribution line is the one that is "real" as far as any observable latency is concerned. The service time distribution line is an internal measurement that nothing external to the thing being measured will ever get to experience. Realizing this difference often makes people make rethink the meaning of the squiggly lines on their monitoring screens...

-- Gil.

Remko Popma

unread,

Apr 28, 2016, 12:49:41 PM4/28/16

to mechanica...@googlegroups.com

Gil,

I understand your point that response time is the real observable latency.

The reason why (correct me if I'm wrong) is that a single spike in service time can delay many subsequent responses.

I still have trouble understanding what I'm seeing in my own testing:

Service time: max always less than 400 microseconds, even under loads of 1.4 million messages/sec. I'm pretty happy with this result. (Although I don't understand where these 260-350us spikes come from. Suspiciously regular too. Still, this is only ~30 spikes in 20+ million messages.)

But, we shouldn't look at this and instead look at response time:

Response time (for the SAME tests as above): single-digit millis from 1 million msg/sec, 20+ millis from 1.2 million msg/sec.

How can these response times be caused by queuing when service time behaviour is pretty much constant even when increasing the load?

Or phrased differently: how can ~30 service time "spikes" of 260-350 microseconds add up to 20+ millis of response time delays?

Is this really queueing, or are we seeing something else?

Could it be that the idling function in Pacer overshoots the target start time? Or did I make a mistake elsewhere in how I use the Pacer?

Remko

ymo

unread,

Apr 28, 2016, 2:58:24 PM4/28/16

to mechanical-sympathy

Maybe a silly question here about pacing and overshooting

1) So lets say that you have a thread dedicated to doing the timing (pacer) and another thread dedicated to doing the tight loop. Would it make sense to just do the sleep.Nano in the pacer thread then reset an atomic counter ? The tight loop thread does not have to call system.Nano at all afterwards. All the tight loop thread has to do is decrements the atomic counter and if it reaches 0 it knows it has to stop pounding the SUT and yeild ! the pacer thread sleep time basically determines the number of times you call the SUT [1].

2) When the timing thread wakes up from its sleepNano it will not only reset the counter but it can also call system.Nano to figure how far away it has drifted then reset the counter accordingly.

3) If you did your affinity correctly to run both the tight loop as well as the tight on the same core you could avoid the cache coherency altogether as well. Since the pacer thread is only doing sleeping (most of the time) i am hoping (in my simplistic view of the world) that the cpu is clever enough to run that in a hyper thread (maybe) ???

4) Even if they ran on separate core would the above still make sense ???

Please correct me if i am wrong here because i was not planning to use pacers inside the tight loop at all in my tests !

[1] system under test

Gil Tene

unread,

Apr 28, 2016, 4:06:31 PM4/28/16

to mechanical-sympathy

Some answers inline below.

On Thursday, April 28, 2016 at 9:49:41 AM UTC-7, Remko Popma wrote:

Gil,

I understand your point that response time is the real observable latency.
The reason why (correct me if I'm wrong) is that a single spike in service time can delay many subsequent responses.

I still have trouble understanding what I'm seeing in my own testing:

Service time: max always less than 400 microseconds, even under loads of 1.4 million messages/sec. I'm pretty happy with this result. (Although I don't understand where these 260-350us spikes come from. Suspiciously regular too. Still, this is only ~30 spikes in 20+ million messages.)

This looks like good service time plots. And mostly as expected, they do not vary with load much. In fact, as is often the case, service time tends to look better (as in the same outliers are less frequent) under higher loads. Having an occasional/rare 100s of usec outlier is not abnormal in a linux system. It is very rare to see systems that do much better that that for any workload. You can find some interesting observations by Mark Price here: https://www.youtube.com/watch?v=-6nrhSdu--s , where he discussed some interesting causes of periodic jitter, like the differences between using the SLAB and SLUB allocator in the kernel. Or his blog entry here: http://epickrram.blogspot.com/2015/08/jvm-guaranteed-safepoints.html which discusses periodic ~100usec pauses in some JVMs. [Mark seems to like investigating what makes linux have jitter/hiccups, and I've found that following his findings is a useful place to learn about "one more interesting thing that makes things pause" every few months]...

But, we shouldn't look at this and instead look at response time:

Response time (for the SAME tests as above): single-digit millis from 1 million msg/sec, 20+ millis from 1.2 million msg/sec.

How can these response times be caused by queuing when service time behaviour is pretty much constant even when increasing the load?
Or phrased differently: how can ~30 service time "spikes" of 260-350 microseconds add up to 20+ millis of response time delays?

What causes these larger response times is a good question. But they really did happen. I.e. your client had messages that were supposed to be sent queued up somewhere (on the measured system, or in a socket or some receiver queue, or in it's internal queues, or in the "haven't gotten around to sending them yet but they should already have been sent at this load" backlog). Of these, a Pacer overshoot could be a possible reason for the backlog queue to build up (the way it is coded, it's unlikely to overshoot by more than a few tens of usec, but you can look to improve on that with the hybrid sleep/spin loop trick I mentioned in my earlier post, where you'd use parkNanos to wake up "a bit ahead of time" and spin for the remaining time). Thread/CPU contention on the client side is another possible (and somewhat more likely) cause: where a client thread that was supposed to have it's Pacer wake up has no free CPU to run on for a while (because they are all full for those 20msec). A way to test for that would be to add jHiccup to both your client and server processes and try to correlate the behavior. And if you see hiccups of similar magnitudes in those processes, drill down to see if they are cpu-scheduling related by using LoadMeter , which will record a jHiccup-like distribution of the number of runnable threads in the system at ~1msec sampling granularity.

Remember that these latencies *could* be a result of actual "bunching" of higher service times on the service itself. E.g. ***if*** there was some mode-shifting behavior where several 100s of consecutive messages took ~20usec each to actually serve (well above the norm, but a scenario that is quite possible with the service time distribution you show), then each of those messages would create an additional ~24 message backlog at 1.2m/sec. A "bunching" of ~1000 such consequitive longer-than-normal-service time responses could be accumulate ~20msec worth of backlog. ***IF*** that is the mechanism at play, the obvious question would be "why would this happen only at higher loads?". And the answer would probably be "what mode-shift behavior only happens at a high enough load?". E.g. you mentioned that things go slower when back pressure is involved. Maybe that starts happening in burst around 1.2m/sec? [This is all conjecture, to be used as example, and not as the actual speculation... Unless this specific speculation turns out to match reality, of course, in which case I'll happily take credit for a wild-ass guess being right.]

BTW, if you want to study the behavior over time, you can shift to using an HdrHistogram Recorder and HistogramLogWriter, to produce the same sort of logs that jHiccup and cassandra-stess2 produce. You can then use HistogramLogProcessor to extract a percentile distribution for any portion of time in the log, and also to plot the max latencies over time. You can also use tools like HdrHistogramVisualizer (https://github.com/ennerf/HdrHistogramVisualizer) to view the timeline and the percentiles at the same time.

Gil Tene

unread,

Apr 28, 2016, 4:08:30 PM4/28/16

to mechanical-sympathy

The simpler way to deal with Pacer overshoots is to spin. Or to combine a sleep and a spin in a hybrid, where you sleep less time than needed (e.g. wake up 100usec before you need to) and spin for the remaining of the time if needed. No need for multiple threads...

ymo

unread,

Apr 28, 2016, 4:24:49 PM4/28/16

to mechanical-sympathy

Hi Gil.

When you say no need do you mean its an overkill or its completely wrong and should not be done at home even with guardians around ? I can "handle the truth" )))

Regards

Gil Tene

unread,

Apr 28, 2016, 5:09:37 PM4/28/16

to mechanical-sympathy

ymo, I mean that multiple threads don't really add anything useful. Fundamentally, the pacer's job is to answer the "is it time to do the next operation yet?" question (which it currently does with a call to System.nanoTime each time), and to provide a way to wait until it's time to do that operation. The multiple thread suggestion you have below adds three possible improvements over the simple current pacer code:

1. Less overshoot (waking up after you were supposed to). This can be handled with the spin-always or the wake-up-slightly-early-and-spin-for-the-remaining-time approaches I describe. With no added threads.

2. Reducing System.nanoTime() call costs on the "tight loop" thread (as you call it). This can be done in the same way inside the pacer, by [similarly] internally keeping a count of of how many operations can be issued before the time needs to be checked again. And no added threads. [It's just not a high enough cost to bother with IMO, not until you go over ~3-4m ops/thread].

3. Yielding when spinning. This can be done in the whatever the spinning thread is. I.e. in #1 above when it is spinning. But in general, I don't think yielding is a good idea for pacing. If you actually need to yield, your pacer will likely start occasionally overshooting by full cpu quantums (10s of msec) no matter what. And your response times will start sucking because your load generator simply cannot create requests as fast as it is expected to on a a consistent basis, leading to internal backlogs.

ymo

unread,

Apr 28, 2016, 5:44:48 PM4/28/16

to mechanical-sympathy

I was assuming that in a test harness you will always have a core dedicated to do timing as well as doing all the background logs for monitoring the progress of the test. These can all be clobbered together on one core via thread affinity. Moreover, the more cores/threads you have running the tests (doing tight loops) the more the gains are by not calling system.Nano. I agree that if all you are doing is one thread hammering a function then maybe this is an overkill.

Thanks a lot for confirming this !!!

Gil Tene

unread,

Apr 28, 2016, 6:06:12 PM4/28/16

to mechanica...@googlegroups.com

Whatever number of threads you have running the actual test, they'll each need to separately do something to "wait" until it's time for them to issue an operation. It's this "waiting" that is where you have to either sleep or spin (or a sleep+spin combo). And each thread has to do it… A common time-keeper thread doesn't alleviate the need for that wait/sleep thing. Whether it is done by spinning on a volatile that is triggered by another thread, or by making nanoTime() calls, or by sleeping, the behavior for the test thread looks the same when it's being paced.

— Gil.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/HwTrOMnm6h0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

signature.asc

Alex Bagehot

unread,

Apr 29, 2016, 5:24:48 PM4/29/16

to mechanica...@googlegroups.com

Hi,

https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java#L188

Looks like there is an issue here.
For each call to runLatencyTest() we reset the start time via
pacer.setInitialStartTime(System.nanoTime()).
However, unitsCompleted in Pacer is not reset to 0 at the same time.

The problem for the measurement of the latency is that for each of
these sub-tests in runLatencyTest()
any simulated queue backlog and resulting simulated wait time is wiped
out by the call to pacer.setInitialStartTime(System.nanoTime()) as the
first call to pacer.expectedNextOperationNanoTime() will certainly be
in the future at the start of each of the repeated sub tests in
runLatencyTest().

The arrival rate is changed also - if the backlog is deleted then the
arrival rate is no longer in catch-up mode so it drops to the "target
throughput" rate set in the program arguments.

Assuming I'm right the fix should be simple, only call
setInitialStartTime() once as it is in Cassandra-Stress2 (and as it
says in the javadoc comment).
If so you will need to re-run the tests and charts. There is a
material difference in the results that I have seen. With multiple
calls to setInitialStartTime() the latency is understated as any
queueing/wait time at the time of the call is deleted, expected start
time is corrupted.

Thanks,
Alex

>
> Initial results show that from a certain load, response time becomes orders
> of magnitude larger than the maximum service time (e.g. max service time 1.3
> ms, max response time 242 ms, with more than 10% of response time
> measurements larger than 100 ms). See attached 1500ks (service time at 1.5
> million msg/sec) and 1500kr (response time).
>
> Is this expected? The Linux system I'm running this on is idle, but not
> tuned for low latency. Could this be OS jitter?
>

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Gil Tene

unread,

Apr 30, 2016, 2:03:41 PM4/30/16

to mechanical-sympathy

A simpler fix would be to instantiate a new Pacer for each LatencyTest run. I.e. make a new Pacer here: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java#L170

> email to mechanical-sympathy+unsub...@googlegroups.com.

Remko Popma

unread,

May 1, 2016, 10:09:48 AM5/1/16

to mechanical-sympathy

Thanks for pointing this out Alex! (This is why open source is great!)

Remko Popma

unread,

May 1, 2016, 10:15:22 AM5/1/16

to mechanical-sympathy

On Sunday, May 1, 2016 at 3:03:41 AM UTC+9, Gil Tene wrote:

A simpler fix would be to instantiate a new Pacer for each LatencyTest run. I.e. make a new Pacer here: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java#L170

Just to double check, you did mean the simpler fix is to instantiate the Pacer immediately before entering the do/while loop (line 169-171), use the same single Pacer instance in all invocations of the runLatencyTest() method for that thread, and never explicitly call Pacer.setInitialStartTime() once we enter the do/while loop, it that correct?

Gil Tene

unread,

May 1, 2016, 10:45:44 PM5/1/16

to mechanical-sympathy

On Sunday, May 1, 2016 at 7:15:22 AM UTC-7, Remko Popma wrote:

On Sunday, May 1, 2016 at 3:03:41 AM UTC+9, Gil Tene wrote:
A simpler fix would be to instantiate a new Pacer for each LatencyTest run. I.e. make a new Pacer here: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java#L170

Just to double check, you did mean the simpler fix is to instantiate the Pacer immediately before entering the do/while loop (line 169-171), use the same single Pacer instance in all invocations of the runLatencyTest() method for that thread, and never explicitly call Pacer.setInitialStartTime() once we enter the do/while loop, it that correct?

No. I mean that you should crate a new Pacer every time you set the initial start time. So instantiate a new one on every call to runLatencyTest in the code linked above. Within each thread, the pacers will still be running one at a time, and [single threadedly] accumulating into the thread's serviceTmHist and responseTmHist.

The reason to use a fresh pacer and a new initial start time on each call to runLatencyTest() is that you don't want time between calls to count In this specific code, the inter-test time wouldn't be a big deal, since you aren't really doing much between the calls. But it would be "bad practice" and will end being copied into other tests where it would be wrong. I.e. in situations where each runLatencyTest call involves some setup work that may actually take some time, you wouldn't want that inter-test time to count as backlog time and polite your response time numbers.

You should only be setting the initial start time once per Pacer creation, really (maybe add code to enforce that and throw exception otherwise?). The reason for keeping the initial start time separate (not part of the construction) is that often you may be constructing a pacer well ahead of it's use. Or using information [like the histograms to accumulate into] that the actual use code will not have access to. But you want the initial start time to be as close as possible to when you actually start using the pacer, so don't want it to be set at construction time...

Reply all

Reply to author

Forward