I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
- throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
- response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
I would like feedback from the experts on this list on whether my measurement approach makes sense.ThroughputAsync logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?Response timeI would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram?
I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java
Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
- Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
- What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
- I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?
And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.Thoughts?Remko Popma
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Patrik Nordwall
Akka Tech Lead
Lightbend - Reactive apps on the JVM
Twitter: @patriknw
Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.
I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
- throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
- response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
I would like feedback from the experts on this list on whether my measurement approach makes sense.ThroughputAsync logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?Response timeI would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram? I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java
Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
- 1. Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
- 2. What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
- 3. I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?
On Thu, Apr 21, 2016 at 7:44 AM, Remko Popma <remko...@gmail.com> wrote:I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
- throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
- response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
I would like feedback from the experts on this list on whether my measurement approach makes sense.ThroughputAsync logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?Response timeI would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram?Perhaps you can use JLBH (Java Latency Benchmark Harness) that is based on JMH, from Peter Lawrey. I have not tried it myself but read about it here: https://vanilla-java.github.io/2016/04/02/Microservices-in-the-Chronicle-World-Part-5.html
Regards,PatrikI haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.javaGil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
- Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
- What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
- I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?
And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.Thoughts?Remko Popma--
Patrik Nordwall
Akka Tech Lead
You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.
+1 Date/Time formatter
What you can do is :
- keep a buffer with the formated date/time
- when new log, compare last time and new time.
- if same time at the second level -> just update the millisecond part.
- if same time at the minute level -> just update the seconds part.
- etc...
- otherwise format total
This is enough to boost significantly the formating because logs are usually very close to each other during processing.
And it's easy to do in GC-free with formating in byte[] (or similar) instead of going through String :)
Hope this helps
GG
--
On 24 April 2016 at 10:00, Remko Popma <remko...@gmail.com> wrote:On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.
Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.
I've often found that formatting of the dates in loggers to be one of the dominant factors. Have you considered writing a custom date formatter for the most common format? This has provided a many fold improvement for me when I must support a logging API.
On Sunday, April 24, 2016 at 7:13:18 PM UTC+9, Martin Thompson wrote:On 24 April 2016 at 10:00, Remko Popma <remko...@gmail.com> wrote:On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.
Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.Martin, the problem I encounter when trying to measure async logging throughput is that there is one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).This becomes a problem when I try to measure throughput with JMH. I would prefer using JMH to rolling my own performance test, but JMH exceeds the ringbuffer size... For the async case perhaps JMH is simply not the right tool for what I'm trying to do?(Or perhaps I don't understand how the consumeCPU() or Thread.yield() API can help with this.)
On 24 April 2016 at 11:47, Remko Popma <remko...@gmail.com> wrote:
On Sunday, April 24, 2016 at 7:13:18 PM UTC+9, Martin Thompson wrote:On 24 April 2016 at 10:00, Remko Popma <remko...@gmail.com> wrote:On Thursday, April 21, 2016 at 4:53:57 PM UTC+9, Nitsan Wakart wrote:You can throttle the logging/producer threads with a Blackhole.consumeCPU to show behaviour for less than all-out producers.
Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.Nitsan, for the throughput test I am looking for a way to only measure a fixed number of iterations (up to the ringbuffer size). Can the Blackhole.consumeCPU API help with that?As an alternative to Blackhole.consumeCPU, it is sometimes useful to call Thread.yield(). This will allow for many more threads interleaving plus it adds system calls and possible context switching into the mix which can be more realistic in some situations. It's good to try an number of techniques and see how they play out.Martin, the problem I encounter when trying to measure async logging throughput is that there is one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).This becomes a problem when I try to measure throughput with JMH. I would prefer using JMH to rolling my own performance test, but JMH exceeds the ringbuffer size... For the async case perhaps JMH is simply not the right tool for what I'm trying to do?(Or perhaps I don't understand how the consumeCPU() or Thread.yield() API can help with this.)You are right in that it does not address that issue directly. It can help by yielding the CPU so the flusher threads can do their work in writing the page cache down to the block devices.I find these types of approaches tend to have at least 3 modes.1. Bursting into the application buffer, Disruptor in this case. You need to size this for your typical bursts.2. Writing from the application buffer to file that is in the page cache. This is heavily influenced by amount of memory and vm.swappiness etc.3. When the page cache is overwhelmed then you are writing to real storage.These three can be combined/conflated depending on if memory mapping and/or synchronous IO is used.It is useful to benchmark each component of the solution then combine the components and bench the aggregate.When benching the aggregate I often find JMH to be limiting. Producer to consumer scenarios, having a fixed number of iterations in a run, or targeting a throughput rate don't tend to fit naturally. Maybe it is my limitation and I don't easily see how JMH works in these scenarios.
A test harness for aggregate components needs to be able to run at different throughput rates, optionally introduce bursts, and measure response time for all operations and not just sample.
To get back to your question. How about mocking out the writing to disk in the appender? Then set the timeOnIteration to be a suitably high number and measure that way. You will have issues with Coordinated Omission with JMH because it is really only tracking service time and not response time. You could add GC logging to see how "lumpy" the throughput is in classic mode vs garbage-free. However the GC is not likely to behave in a microbenchmark like in a real application with a complex heap of different objects.
Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
- 1. Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
The maximum load I refer to is the one for the throughput test above. But with the caveat that if you vary the number of logging threads, each # may have a different "maximum load" to test against. Remember to try at an attempted load that is greater than the achieved maximum load, to verify that the measurements are correct.
- 2. What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
Don't think of it as reducing the load. Think of it as pacing your load. The pace attempted pace could be faster than you are actually able to achieve. And that's fine (even recommended as a test point).For an example of how you might pace a thread, you can look at (and lift) the Pacer class we put in cassandra-stress2, and how it is used. That specific pacer implementation uses a loop with LockSupport.parkNanos(delta) until the "right time" is actually reached. But you can do the same with a spin loop, or [preferrably] a hybrid combo: spin if short, parkNanos to "a bit before the time actually needed" and spin for the rest of the time. The benefit of the spinning variants is that they will start each operation closer to when when it actually needs to start, such that the wakeup time (which ends up delaying the actual start compared to where it was supposed to be) does not count against the latency.And it is VERY important that you measure your response time based on what the expected start time for the operation was, and NOT based on when the operation actually started. The Pacer class example above provides an expected start time for a constant-throughput-per-thread model. This is probably sufficient for what you are looking to do. But more complex models are possible (e.g. random or poissonian distribution of start times, modeled spikes in activity, playback of actual recorded arrival times, etc.). The key thing is that the Pacer should know when each operation is supposed to start, and the response time is the delta between that (expected) start time, and the actual completion time. [The service time is the time between the actual start time of the operation and the actual completion time, and it's easy to track it too for comparison. The cassandra-stress2 thing linked to above does just that].
I'm working to make log4j garbage-free in steady running state (LOG4J2-1270). I'm now working on documentation, specifically, performance metrics that show some of the trade-offs for users:
- throughput of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
- response time of garbage-free vs "classic" logging, for both synchronous and asynchronous logging, as a function of the number of log event producer threads
I would like feedback from the experts on this list on whether my measurement approach makes sense.ThroughputAsync logging uses the Disruptor which has a fixed ringbuffer size. As a result, throughput is "bimodal": one level of performance as long as there is space in the ringbuffer, and a very different level after backpressure kicks in. When the ringbuffer is full we are essentially measuring the throughput of the appender (formatting + IO).Problem: I would like to use JMH to compare async logging throughput in garbage-free mode to "classic" mode. Unfortunately I haven't found a way to control the number of iterations; the JMH throughput test iterations always exceed the ringbuffer size and it ends up measuring synchronous logging throughput.One solution is to split this into two: measure async logging with a No-Op appender (essentially measuring only how fast log events can be enqueued), and measure synchronous logging separately (synchronous logging goes to the appender direcly without a Disruptor). The synchronous logging result shows what throughput will be after backpressure kicks in. JMH can be used for both tests, and on the log4j manual page we can show both these graphs with throughput results for an increasing number of threads.I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?
Response timeI would like to use HdrHistogram to measure and graph the response time distribution. Is it possible to test with JMH and collect the results in HdrHistogram? I haven't found a way to do this... I've given up on this and am writing a custom latency test. I would really like feedback on this custom test from people on this list: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/SimpleLatencyTest.java
Gil kindly presented a list of latency testing DOs and DONTs on a previous posting. The post recommends to measure latency at various loads, like 5%, 10%, 20%, 40% of the maximum load. Several things I'm still trying to work out here:
- Basic question: How to determine the maximum load? Is this the throughput number we got from the throughput test above?
- What mechanism to use to reduce the load? Busy spin? So if max throughput = 5,000,000 ops/sec, and target throughput = 5% (250,000 ops/s), then you need to spin for 4 micros between each log event?
- I was thinking to measure at various number of log event producer threads. Is this a good idea at all? More threads = more load, but it becomes harder to work out how to reduce the load. Would it be enough to multiply the idle time between events by the number of threads?
And again, with async logging we have the issue that the ringbuffer fills up if the load is too high. So again we have the bimodal effect where we intend to measure response time of async logging but end up measuring synchronous logging when going too fast. I'm not sure the previous solution is a good idea here: most (90%) of the memory is allocated in the downstream part (the appender), and since I want to compare garbage-free to "classic" logging I'm worried that excluding this part will skew the latency results.Thoughts?Remko Popma
I feel a bit like I'm cheating by measuring async logging with a no-op appender, but I can't think of any other way... What do you think?The throughput of the log system will have to account for the actual formatting and I/O. The main reason to use async vs. sync is to reduce latency observed in logging threads. So I don't think it's cheating for the async perf tests to be measuring expected latency instead of throughput.BTW, having a full buffer degrade the latency of logging calls seems bad. Past versions of log4j had an async appender that had a "non-blocking" mode, where events were discarded if the buffer was full. Metrics were recorded for the number of discarded events and extra log events synthesized so that the actual log output would indicate summaries for the events that had to be thrown away.I would think you could put async logging into such a mode (it would use RingBuffer#tryNext instead of RingBuffer#next) to do a throughput test that does not get bottlenecked by the appender.
Gil,I understand your point that response time is the real observable latency.The reason why (correct me if I'm wrong) is that a single spike in service time can delay many subsequent responses.I still have trouble understanding what I'm seeing in my own testing:Service time: max always less than 400 microseconds, even under loads of 1.4 million messages/sec. I'm pretty happy with this result. (Although I don't understand where these 260-350us spikes come from. Suspiciously regular too. Still, this is only ~30 spikes in 20+ million messages.)
But, we shouldn't look at this and instead look at response time:Response time (for the SAME tests as above): single-digit millis from 1 million msg/sec, 20+ millis from 1.2 million msg/sec.How can these response times be caused by queuing when service time behaviour is pretty much constant even when increasing the load?Or phrased differently: how can ~30 service time "spikes" of 260-350 microseconds add up to 20+ millis of response time delays?
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/HwTrOMnm6h0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
> email to mechanical-sympathy+unsub...@googlegroups.com.
A simpler fix would be to instantiate a new Pacer for each LatencyTest run. I.e. make a new Pacer here: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java#L170
On Sunday, May 1, 2016 at 3:03:41 AM UTC+9, Gil Tene wrote:A simpler fix would be to instantiate a new Pacer for each LatencyTest run. I.e. make a new Pacer here: https://github.com/apache/logging-log4j2/blob/master/log4j-core/src/test/java/org/apache/logging/log4j/core/async/perftest/ResponseTimeTest.java#L170Just to double check, you did mean the simpler fix is to instantiate the Pacer immediately before entering the do/while loop (line 169-171), use the same single Pacer instance in all invocations of the runLatencyTest() method for that thread, and never explicitly call Pacer.setInitialStartTime() once we enter the do/while loop, it that correct?