Coordinated Omission

12,929 views
Skip to first unread message

Gil Tene

unread,
Aug 3, 2013, 4:10:58 PM8/3/13
to
I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.

Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.

Basically, I think that the 99.99% observation computation is wrong, and demonstrably (using the data in the graph data posted) exhibits the classic "coordinated omission" measurement problem I've been preaching about. This test is not alone in exhibiting this, and there is nothing to be ashamed of when you find yourself making this mistake. I only figured it out after doing it myself many many times, and then I noticed that everyone else seems to also be doing it but most of them haven't yet figured it out. In fact, I run into this issue so often in percentile reporting and load testing that I'm starting to wonder if coordinated omission is there in 99.9% of  latency tests ;-)

Here are the basic assumptions, and the basic math:

Assumptions:

1. I am *assuming* that in the graphs titled "Async Logging Latency Histogram" and "Async Logging Max Latency of 99.99% of Observations", the latency being measured is the latency of the call to the logger. I.e. the time that the thread spent inside the call (as opposed to a time measured between the call and some asynchronous event recorded concurrently with the thread's execution). This assumption is key to the following math, so if it's wrong, the rest of the below is basically male cow dung.

2. The test description notes that "After each call to Logger.log, the test waits for 10 microseconds * threadCount before continuing", but doesn't mention how this wait is achieved (e.g. Thread.sleep() or spin). In any case, since 64 * 10 is 640usec, I'll round up and assume that a wait is for 1msec (analysis below will be even worse if the wait is shorter than that).

Analysis:

- The normal observation interval for this test is ~1 msec. I.e. each thread running the test would "normally" issue a new logging message once every ~1 msec. (the ~1msec wait plus the very tight ~1-5usec average Logger.log() call latency).

- As long as the  Logger.log() call latency is lower than ~1msec (which is the vast majority of the time), things are "fine". However, when Logger.log() call latencies much larger than ~1msec occur (e.g. the graph shows 10 recorded calls that took 536msec each), each of those calls also "freezes" the testing thread for a duration much longer that the normal interval between two observations.

- Omitted observations: Effectively, each such "freeze" omits observations that would have normally happened had it not been for the back-pressure created by the call latency. I.e. Had an observer continued to issue logging calls at the normal observation interval during this freeze, additional results would have been included in the overall stats computations. E.g. for each of those ~536msec freezes, there are ~535 missing observations.

- Coordinated Omission: Omission on it's own is not a big problem, as long as the omission is random. E.g. If we randomly threw away 3 million of those 5 million latency measurements each logging thread was doing, the statistical results would probably not be materially affected. However, when omission is coordinated with observed events, it can dramatically skew the statistical analysis of the remaining results. In the case of this sort of coordinated omission, we are only omitting the "very bad results" (results that are larger than 1msec. I.e. larger than 200x the average latency). This has a huge impact on the eventual percentile calculation.

- Based on the 10 observations of 526msec each alone (and ignoring the 50+ other huge results next to them), we can deduce that an actual observer would have seen 5,350 additional results [10 x (536 -1)] ranging linearly between 1 msec and 535 msec (with a median of ~268msec).

- If we add those 5,350 results to the 5,000,000 results that the thread recorded, they would represent >0.1% of the overall data, and half way into that additional data (at 0.05% of overall data) we would see the median result of the missing set, which is 268msec.
 
- Therefor, based on the 10 outlier  in the graph alone, your 99.95%'ile is at least 268msec.

- Similarly, the 99.99%'ile is at least 582 msec (the 90%'ile of the missing data that represents 0.01% of the total observations).

That's at least 35,000x higher than reported...

There is good news here, too. HdrHistogram has a recording mode that automatically corrects for coordinated omission if you know what your expected interval between measurements is (and here it is ~msec). If you add an HdrHistogram to such a test and record your call latencies using the recordValues(value, expectedIntervalBetweenValueSamples) form, the Histogram will automatically re-created the missing measurements. It is often useful to compare the output of two HdrHistogram's one of which uses the correcting form and one of which that just records the raw values, as the differences in output at the higher percentiles is often "impressive".

Foot note: anyone want to guess what that 582 msec 99.99%'ile is dominated by? 

 

Vladimir Sitnikov

unread,
Aug 3, 2013, 12:43:08 PM8/3/13
to mechanica...@googlegroups.com

Gil,

Do you think coordinated omission will ever happen for closed system?

I guess CO happens just for open systems when the number of clients is infinite.
In case the number of clients is limited (e.g. just a single client), CO does not happen.

Regards,
Vladimir Sitnikov

03.08.2013 19:20 пользователь "Gil Tene" <g...@azulsystems.com> написал:
I've been harping for a while not about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Gil Tene

unread,
Aug 3, 2013, 5:39:27 PM8/3/13
to
Actually, Vladimir, it's the opposite. The higher the thread count is in your test system, the longer the natural "think time" that each thread will normally model, and the less impactful coordinated omission will be. A single client system is the most likely to exhibit this problem, but most multi-threaded testers exhibit it as well. It's actually a per-thread (local to the testing thread) problem, which occurs whenever an observed response time is longer than the interval that the testing thread is trying to maintain between requests. Basically, whenever a test thread that waits for a response before issuing a new request ends up "missing" on sending the next request it would have normally sent, omission occurs.

If you look at the detailed response log files for such system loaders, it's pretty evident. Whenever you see a long response time, you'll also see a long gap in timestamps between requests. It's as if someone just redacted or erased a whole bunch of bad results from the log file, leaving only the good results and "a few" bad results. Only the first bad result in each sequence of bad results usually survives the redaction. The result heavily skews all stats done on the recorded result set. [I encourage anyone reading this to dig up a results log file and look for those redacted gaps].

The issue lies in the mismatch between what percentiles mean to the reader or a report, and what "coordinating" testers are actually doing. When you say "99.99% of observations show an X msec or better latency", the reader isn't expecting that to mean mean "99.99% of the good observations were that good". They expect it to mean that 99.99% of all random, uncoordinated attempts were that good.  The expectation is that if a request was sent with no warning, and at a random point in time, it will still have a roughly 99.99% chance of seeing the touted latency. But what each tester thread is actually doing as a result of waiting for a long response before sending the next one is the equivalent of asking the system "is it ok for me to test your latency right now? Because if it's not, I'm fine waiting and testing it only when you are ready..."

Note that Coordinated Omission is a measurement methodology problem in load testers. Not a problem with the system under test itself. This methodology "glitch" is unfortunately very common. I rarely see load testers that don't suffer from it, including common testers used in the industry (JMeter, Grinder, HP LoadRunner), benchmarks (YCSB, SPECeverything, etc.), and custom build testers I see used by many low latency shops. I've seen a few rare environments that don't exhibit this problem (e.g. Twitter's Iago, and the asynchronous, thread-less client that Mike McCandles used in testing some stuff here), and in low latency shops that load things up with non-blocking testers, but those are usually the rare cases.

This very prevalent Coordinated Omission behavior basically makes load testers lie about the system they are observing, making you think the system is much better than it really is (especially in the higher percentiles). Since reality doesn't omit or forgive the bad results the way many load testers do, coordinated omission often results in poor decision-making when it comes to capacity planning and setting business or service level expectations. 


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
Aug 3, 2013, 6:44:18 PM8/3/13
to mechanica...@googlegroups.com
Gil you are right in that the original latency tests I did on the Disruptor do suffer from the CO issue.  It happens for all events injected when the injector is stalled for more than one microsecond.  It does not impact the consumer as all consumed events are recorded in the histogram with no omissions, provided they got injected.  The latency tests were done at an injection rate of 1 million events per second. 

Measuring CO on something like the Disruptor is very challenging and some practical constraints have been applied.  The latencies I originally posted were for the full exchange of messages and not just the time to produce.  The time to produce is all that matters for the likes of a logger and the one that CO can be accurately applied to.  

If one was to measure the cost of producing into the Disruptor alone then the cost of measurement is so great it makes the measurement pointless.  On average it would an order of magnitude more cpu cycles in measurement compared to actual work.

Michael Barker

unread,
Aug 3, 2013, 7:34:47 PM8/3/13
to mechanica...@googlegroups.com
The Disruptor latency benchmarks have been long overdue for a
re-visit. I want to switch to using the HdrHistotram, both for the
Disruptor and for within LMAX. I've also noticed that on some newer
hardware (Sandy/Ivy Bridge) I get some strange results, e.g. the mean
is higher for the Disruptor than a BlockingQueue, but I get a fair
number of results <0ns for the Disruptor. I need to do a bit of
digging.

Mike.
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

Vladimir Sitnikov

unread,
Aug 4, 2013, 2:20:01 AM8/4/13
to mechanica...@googlegroups.com

> Actually, Vladimir, it's the opposite. The higher the thread count is in your test system, the longer the natural "think time" that each thread will normally model will be, and the less impactful coordinated omission will be.
Ok. Good point. I meant the following: CO happens when you try to model an open system with a closed system. Lots of load testers use limited number of threads, thus are closed.

>A single client system is the most likely to exhibit this problem, but most multi-threaded testers exhibit it as well.

In case your _real_ system is single client, no CO happens. Say, a specific task is performed by a single person. In case his request stuck there is noone to fire similar requests, thus no omissions, thus no CO. The case 'single person is not enough to generate required amount' is clear and it has nothing to do with CO compensation.

As far as I understand, there is no rule of thumb 'always use CO compensation in HdrHistogram'.
In case actual load rate matches the required one no compensation is required.

Regards,
Vladimir Sitnikov

Martin Thompson

unread,
Aug 4, 2013, 3:32:08 AM8/4/13
to mechanica...@googlegroups.com
Your observation on CO is really interesting and I have to admit I've made the mistake many times myself in the past.  More recently I've been trying to be more aware of what can potentially get hidden in measurement and it occurred to me that while your view of CO does surface a whole class of missing observations when measuring, it does also tend to assign the best case scenario to them.  Let me try and explain this.

In the case of the Distruptor latencies tests, for which I'm guilty :-), we inject a new event once per microsecond and then measure the end-to-end time averaged per hop over a 3 stage pipeline.  The CO issues arises when something stalls the injector from sending its next event at the 1us boundary.  This could be GC, TCP back pressure, or any number of things.  If I understand CO as you put it correctly, any events that did not get sent because the injector is stalled should be included in the percentiles.  This does get us a lot closer to true percentiles.  However it does not take into account the effect those missing events would have imposed on the system.  For example, what about the queueing effects, cache pressures, potential buffer exhaustion, etc.???  If in reality the actual number of events got injected into the system to account for CO then the system may even have collapsed under the load.

The more I dig into this subject the more I see evidence that not only is CO happening at the point of stalls in a system when doing latency testing, we are also deluding ourselves to how good our systems actually are compared to reality.  To me the evidence is suggesting that when we do latency testing (with load testers) on our systems then our measurements are reflecting a much better picture than reality can actually be.  If we are measuring actually latencies for all events in our production systems, especially with multiple points of injection, then we are getting a much more realistic picture. 

While on the subject of confessing measurement sins.  I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread.  If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using.  On Windows it often does not advance for 10s of calls at a time.  This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.

On the bright side of this, if the measurement technique, even when flawed, is showing a significant improvement between 2 systems then things are moving in the right direction even if the percentiles are a work of fiction.  Well done to the log4j v2 folk for taking a big step in the right direction.

Martin...

Peter Lawrey

unread,
Aug 4, 2013, 4:01:51 AM8/4/13
to mechanica...@googlegroups.com

The high speed input data for most systems usually comes from network adapters and these provide a constant and regular stream of data/events (when highly utilised). A single packet from a 1 Gb/s line can be  ~1470 bytes and take around 13 micro-seconds to come down the line.  If this translates to a number of events you have bursts of data as say 1-15 events every 13 micro-seconds.  The translation from raw data to event takes time and if you have designed your system well will be until the 13 microsecond. i.e you can parse the incoming data as fast as it comes and it shouldn't back up unless you have jitter.  When the network line is not so utilised,, there are larger delays between packets/data/events. 

So say you have 7 events per packet of  ~200 bytes and you want to be able to typically parse then and add them to a queue/disruptor in half the time between packets to allow for some jitter i.e. ~7 micro-seconds. This means you will have micro-bursts of 7 events at a rate of one every micro-seconds (as each event takes one micro-second to parse and add) with a pause of 6 micro-seconds.

In short, there can be good reasons why a real system already has some throttling built in which you might not be able to change. e.g. your market data source is limited in some way like it is in a 1 Gb/s line.

It is also worth noting that your latencies around 99.99% are very dependant on the hardware (esp the CPU) and the OS you are using at these time intervals. 




--

Kirk Pepperdine

unread,
Aug 4, 2013, 4:18:26 AM8/4/13
to mechanica...@googlegroups.com
Hi Martin,

I dedicate a fair bit of time to this subject in the benchmarking section of my performance course. IMHO, the benchmarking section of my course, a section that I'm still far from satisfied with, is one of the *most* important sections. Yet people don't either get the difficulties with benching or the importance of it or think they know how to tackle the problem. I have seen many many benchmark failures that have resulted in losses for one reason or another. I've also seen benchmarks used as political and negotiating weapons in bake-offs. You might be able to use beer to ply a story or two out of me ;-)

Not getting an event or request or what ever injected into the system at the scheduled time indeed does take some pressure off of the system which allows it to either recover from or better cope with it's existing workload. In fact, CO is more likely in a highly loaded system than in a lightly loaded one (though I'm not sure if pure thread count matters that much.. I need to better understand what Gil is trying to communicate).

The problem is that in larger more complex benchmarks, this effect is *often* overlooked because that's not the focus of the exercise. My Process Diagnostic Model (PDM) purposely includes the harness as part of the system to try to highlight the point that is the problem is in the harness/injector, it will either hide or create phantom problems in unit of test. I would also suggest that injecting into a system on a heart beat to a system that doesn't expect requests to be injected on a heartbeat will also have it's own subtle effects on performance measurements (which can be seen and explained with queuing theory (Little and Markov)).

Your observation on CO is really interesting and I have to admit I've made the mistake many times myself in the past.  More recently I've been trying to be more aware of what can potentially get hidden in measurement and it occurred to me that while your view of CO does surface a whole class of missing observations when measuring, it does also tend to assign the best case scenario to them.  Let me try and explain this.

In the case of the Distruptor latencies tests, for which I'm guilty :-), we inject a new event once per microsecond and then measure the end-to-end time averaged per hop over a 3 stage pipeline.  The CO issues arises when something stalls the injector from sending its next event at the 1us boundary.  This could be GC, TCP back pressure, or any number of things.  If I understand CO as you put it correctly, any events that did not get sent because the injector is stalled should be included in the percentiles.  This does get us a lot closer to true percentiles.  However it does not take into account the effect those missing events would have imposed on the system.  For example, what about the queueing effects, cache pressures, potential buffer exhaustion, etc.???  If in reality the actual number of events got injected into the system to account for CO then the system may even have collapsed under the load.

The more I dig into this subject the more I see evidence that not only is CO happening at the point of stalls in a system when doing latency testing, we are also deluding ourselves to how good our systems actually are compared to reality.  To me the evidence is suggesting that when we do latency testing (with load testers) on our systems then our measurements are reflecting a much better picture than reality can actually be.  If we are measuring actually latencies for all events in our production systems, especially with multiple points of injection, then we are getting a much more realistic picture. 

While on the subject of confessing measurement sins.  I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread.  If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using.  On Windows it often does not advance for 10s of calls at a time.  This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.

Yes, you have an overhead error on the front end and on error on the back end of any timed interval. Since the typical use case is get time(); doSomething(); getTime(), this is the equivalent of the cost of a single call to the timer. The error is in the backing out from getting the timer value in the first call +  getting the timer value in the second call. Do you have any idea on the effects of distance between the two timing events in this use case?


On the bright side of this, if the measurement technique, even when flawed, is showing a significant improvement between 2 systems then things are moving in the right direction even if the percentiles are a work of fiction.  Well done to the log4j v2 folk for taking a big step in the right direction.

Depends on how the fiction has been constructed.  ;-)

At any rate, the biggest problem with Log4J performance IME has always been in the forced object (String or Object[]) construction or the need to protect the method calls with an if statement. I've only once run into a problem where the async logger was a problem. In that case I recommended using per thread logging to individual files that could be merged later on. This drives the concurrency issues deeper into the system where they are better managed. 

-- Kirk

PS Thanks to Gil for giving this (well known) problem a name....
 

Martin...

Rüdiger Möller

unread,
Aug 4, 2013, 6:30:25 AM8/4/13
to mechanica...@googlegroups.com
Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.  

Am Samstag, 3. August 2013 17:20:55 UTC+2 schrieb Gil Tene:
I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.

Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.

Kirk Pepperdine

unread,
Aug 4, 2013, 10:03:33 AM8/4/13
to mechanica...@googlegroups.com
On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

I would restrict this to behaviour in the test bed, not the component/app/what ever being tested


E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.  

Why would a quote machine be hit ?

-- Kirk

Gil Tene

unread,
Aug 4, 2013, 11:40:51 AM8/4/13
to
CO is a testing methodology and reporting problem, and has nothing to do with the system under test and what it is used for, so no type of system is immune, only types of testers are. Even real systems with a single real world client are susceptible.

The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result. The stats can often be orders of magnitude off. E.g. 35,000x off for the 99.99%'ile as I show in the example above, or the 99.9%'ile being reported as better than the real 99%'lie, etc.. CO happens for all testers that avoid sending requests when any form of back-pressure occurs (usually in the form of some previous request not completing before a new one was supposed to be sent according to the testing model).

A simple way to demonstrate the CO problem on a "real system with a single client" would be this hypothetical:

Imagine that you have a high end concierge business with a single client, and that single client typically calls you on the phone about 10 times a day to perform some transaction (e.g. trade a stock, or check his account balances, buy a shirt, check the weather). In order to keep your customer happy, and avoid losing them to the internet businesses you compete with, you decide that you want to provide them with good customer service, which to you amounts to an actual person answering the phone within 3 rings, 99% of the time, as long as they call any time between 9AM and 5PM pacific time.

You decide to regularly measure your business performance to establish whether or not your behavior meets your goals (of 99% response within 3 rings), and to help you decide whether you need to hire additional people to answer the phone, or maybe replace someone if they are lazy.

So you build a test system. The test system is simple: During a day that your customer is away on vacation and won't be calling you, you ring the business once every minute during the entire 9AM to 5PM period, and check how many rings it took before someone answered each time. You than compute the 99%'lie of that set of samples, and if that 99%'ile is 3 rings or better, you are performing within expectations. If it's not, you know that you need to improve the way your business works somehow (replace someone, or hire additional people to cover each other).

You do the test, and it shows that your business really does answer the phone within 3 rings more than 99% of the time. In fact, most of the time the phone was answered in 1 or 2 rings, and of all the times your test system called, it took more than 3 rings only once. You feel happy. You tell your wife things are going great. You give your employees bonuses for over-performing.

The next day your client fires you. He tried to call during the lunch hour, and nobody was there. In fact, this has been happening for a week now, and he just can't believe your outright dishonesty and false advertisement of your services.

What happened in the above scenario is simple: You testing methodology experienced Coordinated Omission. You dialed the business once a minute for the entire day, and in 420 out of the 421 dialing tests you made, the phone was promptly answered within 3 rings or less. That's 99.76%! That's great. What you missed is that your single switch operator, the one that started last week and didn't get proper training, thought that she gets 1 hour off for lunch every day, and at 12 noon she left her desk and went to have lunch with her friends across the street. Being a conscientious worker, she was back at her desk promptly at 1PM, answering the phone that has been ringing for a while.

When your test system encountered this, it recorded a single, 1800-ring phone call attempt at noon, followed by a 2 ring call at 1PM. Because it was busy waiting for the phone to be answered between 12 and 1, the test system missed 59 opportunities to call during lunch. Had it made those calls, it would have found that they all took longer than 3 rings to answer, that your better-than-3-rings call-answering percentile is only 87.5%, and that your 99%'ile answering time is actually 1,650 rings, and not not 3.

And had you known that, you probably would have added capacity to your business so that when employees go out to lunch (or take bathroom breaks, or pause to take out the garbage), there is someone there to cover for them and answer the phone.

Chris Vest

unread,
Aug 4, 2013, 11:59:34 AM8/4/13
to mechanica...@googlegroups.com, mechanica...@googlegroups.com
What about benchmarks that don't try to make the calls at any particular interval, but instead place the calls back to back in a loop. Is it just plain wrong to try to extract data about latency percentiles from such tests?

Chris

On 04/08/2013, at 17.38, Gil Tene <g...@azulsystems.com> wrote:

CO is a testing methodology and reporting problem, and has nothing to do with the system under test and what it is used for, so no type of system is immune, only types of testers are. Even real systems with a single real would client are susceptible.

The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result. The stats can often be orders of magnitude off. E.g. 35,000x off for the 99.99%'ile as I show in the example above, or the 99.9%'ile being reported as better than the real 99%'lie, etc.. CO happens for all testers that avoid sending requests when any form of back-pressure occurs (usually in the form of some previous request not completing before a new one was supposed to be sent according to the testing model).

A simple way to demonstrate the CO problem on a "real system with a single client" would be this hypothetical:

Imagine that you have a high end concierge business with a single client, and that single client typically calls you on the phone about 10 times a day to perform some transaction (e.g. trade a stock, or check his account balances, buy a shirt, check the weather). In order to keep your customer happy, and avoid losing them to the internet businesses you compete with, you decide that you want to provide them with good customer service, which to you amounts to an actual person answering the phone within 3 rings, 99.9% of the time, as long as they call any time between 9AM and 5PM pacific time.

You decide to regularly measure your business performance to establish whether or not your behavior meets your goals (of 99% response within 3 rings), and to help you decide whether you need to hire additional people to answer the phone, or maybe replace someone if they are lazy.

So you build a test system. The test system is simple: During a day that your customer is away on vacation and won't be calling you, you ring the business once every minute during the entire 9AM to 5PM period, and check how many rings it took before someone answered each time. You than compute the 99%'lie of that set of samples, and if that 99%'ile is 3 rings or better, you are performing within expectations. If it's not, you know that you need to improve the way your business works somehow (replace someone, or hire additional people to cover each other).

You do the test, and it shows that your business really does answer the phone within 3 rings more than 99% of the time. In fact, most of the time the phone was answered in 1 or 2 rings, and of all the times your test system called, it took more than 3 rings only once. You feel happy. You tell your wife things are going great. You give your employees bonuses for over-performing.

The next day your client fires you. He tried to call during the lunch hour, and nobody was there. In fact, this has been happening for a week now, and he just can't believe your outright dishonesty and false advertisement of your services.

What happened in the above scenario is simple: You testing methodology experienced Coordinated Omission. You dialed the business once a minute for the entire day, and in 420 out of the 421 dialing tests you made, the phone was promptly answered within 3 rings or less. That's 99.76%! That's great. What you missed is that your single switch operator, the one that started last week and didn't get proper training, thought that she gets 1 hour off for lunch every day, and at 12 noon she left her desk and went to have lunch with her friends across the street. Being a conscientious worker, she was back at her desk promptly at 1PM, answering the phone that has been ringing for a while.

When your test system encountered this, it recorded a single, 1800-ring phone call attempt at noon, followed by a 2 ring call at 1PM. Because it was busy waiting for the phone to be answered between 12 and 1, the test system missed 59 opportunities to call during lunch. Had it made those calls, it would have found that they all took longer than 3 rings to answer, that your better-than-3-rings call-answering percentile is only 87.5%, and that your 99%'ile answering time is actually 1,650 rings, and not not 3.

And had you known that, you probably would have added capacity to your business so that when employees go out to lunch (or take bathroom breaks, or pause to take out the garbage), there is someone there to cover for them and answer the phone.


On Saturday, August 3, 2013 11:20:01 PM UTC-7, Vladimir Sitnikov wrote:
--

Gil Tene

unread,
Aug 4, 2013, 12:00:50 PM8/4/13
to mechanica...@googlegroups.com
I find that in most all high speed Java-based applications that allocate any amount of memory on a regular basis (i.e. not the ones Peter writes ;-)), the 99.99%'ile is completely dominated by GC pauses. It's the misreporting of percentiles due to CO that makes people think otherwise.

For example. A 30msec newgen pause occurring only once every 5 minutes will by definition completely dominate the 99.99%'ile of a system that shows typical 65usec response times. Similarly, a 2 seconds oldgen pause occurring only once every 4 hours would by definition completely dominate the 99.99%'ile of a web application that typically responds in 200msec or less.

There is obviously a very selfish reason for me to care about CO. When a test system incorrectly reports the 99%'ile as 1,000x better than it really is, it often stands in the way of demonstrating how good Zing really is at affecting 99%, 99.9%, 99.99% results... 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rüdiger Möller

unread,
Aug 4, 2013, 1:31:16 PM8/4/13
to mechanica...@googlegroups.com
I understood your point and admit all tests I ever have written are guilty of OC.

However I'd like to point out the following (I'll stay in your picture):

If I test my service calling each 5 Minutes, and the operator is going to the toilet for 4 minutes, the skew of OC is not there (like in manual order submission). A single customer can't hit twice.
If I test my service calling each 1 minute, and the operator is going to the toilet for 4 minutes, One failure (let the phone ring until operator picks up) has to count as [waitime/1 minute] failures (=2,5 failures in average). (That's the quote engine).

Or is this not correct (and why) ?


> There is obviously a very selfish reason for me to care about CO. When a test system incorrectly reports the 99%'ile as 1,000x better than it really is, it often stands in the way of demonstrating how good Zing really is at affecting 99%, 99.9%, 99.99% results... 

I'd never would have imagined that ;-) ..

I am not sure the percentile view is always the best way to do reporting. If e.g. a Java application sleeps for 280ms and the application has to listen for high volume unreliable UDP marketdata (because some clever dude figured out with good network hardware there are no packet losses -.- ), reporting "we miss xx thousands of datagrams a day" is likely to excite decision makers a lot more than arguing with 99.99% figures (because - well 99,99 thats actually 100% isn't it :-) ).

Gil Tene

unread,
Aug 4, 2013, 1:58:20 PM8/4/13
to mechanica...@googlegroups.com
The CO problem is the same. It's the omission of attempts that is the problem, and a back-to-back test, or a random-interval test would both experience the same omission problem as a fixed-interval test will. In fact, a back-to-back, peddle-to-the-metal test system is the most likely to suffer from CO, as it's "typical" interval is very short, and it therefore will miss many more sampling opportunities than a system with some amount of "think time".

Correcting for the problem is a different matter. The best way to correct the problem is to avoid it altogether (use a tester that sends stuff when it wants to, and never responds to any form of back pressure, like  Twitter's Iago). But since changing testers is often impractical, and any system that uses TCP for a transport has built-in back pressure, correcting for the problem is the next best thing. Test systems that have a known expected interval between requests (within a single client thread) are the ones that are easiest to correct in. E.g. when you know exactly what your test scenario was supposed to be doing, you can correct for it not doing so (e.g. tell HdrHistogram what the expected interval between samples is, and it will re-create the missing ones thru interpolation when your recorded time is greater than the interval). A back-to-back peddle-to-the-metal test system still has an expected interval, but it's one you'll have to compute (a good estimate is the ongoing average response time. A better estimate is the ongoing average with some type of outlier filtering).

There are many ways to try and compensate for CO, and there is no one "right" way. E.g. at Azul, I started a side-project this summer (with an Mike Chmiel, an Intern from U of I) to build an auto-corrector for JMeter and for generic tester log files (the kind that log all the results, not just the summary). Rather than just correcting the stats (as HdrHistogram does internally if asked to), the focus there is on re-creating missing results in log files and/or in JMeter listener streams, such that existing reporting tools (like the various graphing plugins in JMeter, or Excel sheets and other tools that read log files) will simply have "more correct" data to work on. After all, it's not the math in the reporting tools that is wrong, it's the missing data that screws up all the numbers.

When working on correction techniques, I currently try to stay conservative. I.e. correcting "less" is better than correcting "more", and focusing on building correctors that can reliably tell you that things are "at least this bad", rather than ones that say "it could be as bad as this". It's arguable that the more aggressive approach is warranted, as people who actually read percentile data expect them to mean "that percentile is at least this good", but I find that over-correcting leads to questions of credibility of the correction technique, so I'm staying on the conservative side of that for now.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Aug 4, 2013, 2:14:11 PM8/4/13
to
You don't need a customer to be able to "hit twice". They just need to experience the stats being something other than what you claim. The test technique should be used to establish that the stats are what you say they are (or need them to be). Similarly, in a multi-customer business, you don't need the same customer to even interact with you more than once to experience the issue. I.e. if 10% your single-visit customers see really bad behavior (i.e. nobody answering the phone), your business is in trouble even if the other 90% sees perfect behavior.

Think about it this way: When you tell your single customer that you'll answer the phone in 3 rings 99% of the time, they expect that to mean that if they randomly choose to call you (without warning you or making sure that you are really there first), the likelihood of your answering in 3 rings or less is 99% or better. If they called you 100 times over the course of 10 days, and on 5 of those attempts noone answers for 4 minutes (because they were off taking care of their digestive tract needs), they would rightly call you on it, and say that your 99% claims are completely bogus. You could claim that they just got unlucky, and that with a large enough sample your numbers are true, but you'd be lying if you did that based on test system results that had CO build-in.

So if the operator goes to the toilet for 4 minutes less than once every 400 minutes, you are probably rightly operating within your claims, and deserve a ringing endorsement (let's leave the need for a max time requirement in SLAs out of this discussion). However, if your operator goes to the toilet more frequently than that (i.e. twice a day or more), then your 99%'ile claims are false even if no lunch break was taken.

Rüdiger Möller

unread,
Aug 4, 2013, 2:56:35 PM8/4/13
to
> Why would a quote machine be hit ?

manual order entry cannot send faster than one order each ~1-2 seconds. A quote machine oftens sends several quotes/second. 

So if I test for manual order entry client with a single threaded program sending synchronous an order each 5 seconds, i don't have to adjust results (if max outlier is <5s).
If I test for quote machine client with a single threaded program sending synchronous 10 quotes/second, i have to add 'missing testcases' because of blocked test program not sending quotes while waiting. Else data would report 1 latency incident instead of like 20.

So a (stupid) way to escape OC is to lower test request frequency >max latency outlier. One then has to multithread a lot in order to get load on the system .. its no practical. Better approach is OC-compensation

Am Sonntag, 4. August 2013 16:03:33 UTC+2 schrieb Kirk Pepperdine:

On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

I would restrict this to behaviour in the test bed, not the component/app/what ever being tested

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.  


-- Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Aug 4, 2013, 3:51:00 PM8/4/13
to <mechanical-sympathy@googlegroups.com>
That's exactly right.

Whether or not your test "missed" samples depends only on the rate at which the tester "intended" to send requests, and not at all on the rate that the system-under test can handle.

So in your example below, the same exact quote system, behaving in exactly the same way (e.g. pausing for 1/2 a second every 5 minutes), could end up with "right" test results data set if tested [for a long enough period of time] under a "manual quote system" scenario with a 5 second interval between all attempts, while the same system would end up with a test results set that would under-report it's response time failures by a factor of 5x if tested under a "machine client" scenario with 10 quotes per second. 

As long as the tester didn't skip any requests it would have normally done under the intended test plan, your test data remains a valid, uncoordinated sample of the tested system's response times, and you just need enough of it to make it likely that you've sampled enough data. E.g. it's obviously useless to deduce 99.9%'ile from only 500 test samples, and risky to do trust it based on only only 2,000 samples, but you can probably safely project the 99.9%'ile from 10,000+ samples. Random "skips" in sampling are fine too, as long as they remain uncoordinated. So the client pausing for it's own uncoorinated-with-the-tested-system reasons, for example does not make it miss anything from a statistical perspective. It's only when the tester skips tests as a result of the tested system's long response times that coordination is introduced into the test, and that's when the results end up being completely skewed, since it's the equivalent of somebody taking a big thick black pen and selectively eliminating X results after each first occurance of a bad result. The likelihood that the eliminated results are non-random and skewed towards "bad" is very high is such cases, becoming a near-certaintly if the cause of the long response time was a systemic but momentary glitch in the tested system's responsiveness.

-- Gil.


On Aug 4, 2013, at 11:41 AM, Rüdiger Möller <moru...@gmail.com>
 wrote:

> Why would a quote machine be hit ?

manual order entry cannot send faster than one order each ~1-2 seconds. A quote machine oftens sends several quotes/second. 

So if I test for manual order entry client with a single threaded program sending synchronous an order each 5 seconds, i don't have to adjust results (if max outlier is <5s).
If I test for quote machine client with a single threaded program sending synchronous 10 quotes/second, i have to add 'missing testcases' because of blocked test program not sending quotes while waiting. Else data would report 1 latency incident instead of like 20.

Am Sonntag, 4. August 2013 16:03:33 UTC+2 schrieb Kirk Pepperdine:

On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

I would restrict this to behaviour in the test bed, not the component/app/what ever being tested

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.  


-- Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Aug 4, 2013, 7:36:17 PM8/4/13
to mechanica...@googlegroups.com
Martin, you are certainly right that there are other issues with tester that back off. CO being only one of them. But CO is certainly a very big one. I find that CO has such a huge impact on the wrongness of 99%+ reported numbers that it often results in bad decisions when it comes to capacity planning, software selection, and business expectations. 

Of the non-CO effects you mention below, the queueing-pressure-avoidance one is probably the most significant, and I see that as failed load testing (failure to load the system up). I've seen tests (like the YCSB benchmark, for example) that attempt to compensate for delayed requests by playing "catch up" and always sending requests until the total issued requests matches the expected sustained rate. It's hard to tell if that bursting effect has any bearing on behavior in reality though, and while this trick does create large queuing pressure right after a large delay, it actually makes the coordinated omission effect twice as bad (at least in the way it is used in YCSB). Instead of just dropping bad results from the data set, it actually replaces them with good results measured at "a more convenient time" from the tested system's perspective. You can certainly design a test system that does this "catch up" thing to correctly report the bad response times, btw. But doing so usually involves changing the actual tester (e.g. we can probably easily change the actual YCSB benchmark to correctly log response times). But it's hard to fix this sort of thing in post-processing of recorded data.

The sad reality is that once a test system misses an opportunity to actually send a request when it should have, we simply don't know what would have happened if it did. All correction techniques are estimates form that point. The various CO correction techniques I play with work under certain assumptions, and I'm hoping that our JMeter auto-correcting tool thing will help some, but as it stands it will probably only work for non-random test pattern scenarios. Unfortunately, random behavior is one of the things you can program load testers to do.

Peter Lawrey

unread,
Aug 5, 2013, 3:13:14 AM8/5/13
to mechanica...@googlegroups.com
My rule of thumb is that if you want a reproducible a N percentile you want (1-N)^-2 samples or for

99%       10,000
99.9%    one million
99.99%  100 million


--

Martin Thompson

unread,
Aug 5, 2013, 5:36:10 AM8/5/13
to mechanica...@googlegroups.com

On Sunday, August 4, 2013 9:18:26 AM UTC+1, Kirk Pepperdine wrote:
While on the subject of confessing measurement sins.  I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread.  If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using.  On Windows it often does not advance for 10s of calls at a time.  This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.

Yes, you have an overhead error on the front end and on error on the back end of any timed interval. Since the typical use case is get time(); doSomething(); getTime(), this is the equivalent of the cost of a single call to the timer. The error is in the backing out from getting the timer value in the first call +  getting the timer value in the second call. Do you have any idea on the effects of distance between the two timing events in this use case?

You have at least 2 classes of problem here.  

Firstly, when dealing with estimating the cost of getting the time.  At the lowest level you are going to call RDTSC and adding that to the the real time clock value.  Intel say that the first 3 times you call RDTSC is a fair bit slower and thus gets hidden in the tight loop test.  Also in the tight loop test your real time clock value is in local cache which is unlikely when doing a real timing test.  You then need to have the start time in cache to do the duration calculation which it may or not be depending on scheduling and what doSomething() executes.  If measuring something "large" then this does not really matter.

Secondly, when measuring something "small", i.e. things much less than a microsecond, then a straight call to RDTSC is problematic.  RDTSC is not a serialising instruction and thus can be moved anywhere in the instruction stream.  To address this you need a ordering instruction paired with it such as LFENCE;RDTSC or CPUID;RDTSC. This is ok for measuring the start time but not the stop time.   What you really need is start time captured with CPUID;RDTSC and the end time captured with RDTSCP;CPUID because no instruction can be moved to after RDTSCP, but instructions can be moved before it.  With System.nanoTime() there is no hint to if it a start time or a stop time.

On a related issue, if you are running on Sandy Bridge or later with Turbo Boost 2.0 enabled then after about 20-25 seconds the CPU frequency will drop due to turbo boost scaling down.  This can result in the first part of the test showing better latency than the end of the test. You need to ensure you take control of C-states and P-states to get remotely reliable results even after applying thread affinity and CPU isolation.  Best to lock all CPU cores to a single frequency for the duration of a test.

Michael Barker

unread,
Aug 6, 2013, 6:11:46 AM8/6/13
to mechanica...@googlegroups.com
Hi,

Caveat: these are only half formed thoughts on the topic.

I've been re-visiting the Disruptor latency tests and looking into the
effects of Coordinated Omission and how it is impacting our numbers.
I'd like to confirm a couple of observations. I've looked at the
approach that the HdrHistogram uses to backfill the omitted values. I
think that algorithm makes a couple of assumptions:

- Desired input rate is constant.
- The load generation/latency measurement synchronously waits for
responses before issuing the next request.

For the Disruptor tests, we timestamp just before publishing in the
producer thread and then take the end timestamp in the consumer
thread. Therefore a stall in the consumer thread may or may not have
an impact on the producer thread. If we are producing at a rate of 1M
events/sec and we stall for a short time (e.g. 30µs) then the impact
on the producer is absorbed by the Disruptor's buffer, so the producer
can continue unimpeded (to a point), so the cost of the consumer stall
will be included in the latency measurements - no compensation for the
stall is required. Obviously if the pause is sufficiently long the
buffer will fill and producer will be affected. So we can't tell from
the latency measurement alone the level of Coordinated Omission. I'm
currently messing around with tracking the delta between start
timestamps (in an HdrHistogram) as a way of trying to understand the
performance tests behaviour.

The other area I'm looking at is our own system latency tests. In
order to better simulate real-world behaviour we don't use a
consistent input rate, instead we attempt to simulate burst
conditions. I'm thinking using a similar measurement to the above
(delta between timestamps) to model the input rate as a distribution
and compare that to the intended input distribution.

So, I'm not sure whether it will possible in all cases to fix up
latency results to compensate for Coordinated Omission, but it should
be possible to track it as a form of measurement error. Anyway, I'm
still experimenting so I'll see how it goes. I'm interested in
hearing others experience in this area.

Mike.

Gil Tene

unread,
Aug 6, 2013, 10:01:05 AM8/6/13
to <mechanical-sympathy@googlegroups.com>
Mike, I think that you are correct in that the simple correction mode offered by HdrHistogram will not work for your scenarios below.

Correcting for coordinated omission can be challenging. When your test scenario is simple enough, it can be easily corrected, and HdrHistogram's can optionally help with some of those scenarios. e.g. Whenever a you have a fixed expected interval between requests or a fixed delay ("think time") between requests, and your load tester uses synchronous client threads (each thread waits for a response before adding a new request), an expectedIntervalBetweenValueSamples can be computed and passed as an extra parameter to HdrHistogram's recordValue call. This sort of correction can be applied either at test time or as a post-processing step to tester log files.

When the test scenarios are more complex than that, e.g. when there are patterns of different requests interleaved in the tester's work, correcting becomes more challenging, but linear and predictable behavior can still be corrected in post-processing in such cases. When random behavior is intentionally introduced in the tester, or when the tester does other reactive work (like compensate for not sending things when it should have by sending them later), it gets really hard to deduce and correct things in post-processing, and the tester itself needs to be changed in order to compensate.

I do believe that it should always be possible to correct for the backpressure-coordination cause in the tester itself, in order to avoid the coordinated omission of results. But in many cases, it becomes a matter of the tester knowing what it should/would have done, and using that knowledge in deciding what to record. This approach (as opposed to post-processing) can work even when the tester picks random or varying interval scenarios, as the tester can/should be able to always know what it had "intended" to do, and record latencies accordingly. The trick there is to have the tester act on some internally tracked "plan" and to then compute response times based on that plan, rather than based on what actually happened in back-pressure affected the behavior.

As an example, there is a style of load-generating that I call "constant throughput" (as opposed to "attempted-constant-interval", or "attempted-constant-think-time"). In this style, the tester always sends enough requests to match the number of requests that a sustained rate would have required "up to now". In such testers, a request burst of some sort (dampened or not) will typically follow any significant hiccup or pause. The issue with this type of load tester is that if recorded times are treated "naiively", the requests themselves are effectively shifted from the "bad" times to a "good" times if back-pressure stalled the tester temporarily and delayed some requests (and a burst of some sort would obviously follow such a pause). The current YCSB benchmark uses this model, which results in a worse-than-coordinated-omission skew in it's statistics. However, the same tester can easily be changed to avoid this problem by computing when it had "intended" to send each message, and recording response time for each request not as the delta between the actual request issue time and the actual response time, but instead as the delta between the intended request sending time and the actual response time. The intended request time can be trivially computed on the fly for such constant-throughput style testers.

The scenario you describe below is harder to model and correct, as your tester threads are not perfectly synchronous but they still experience back-pressure coordination. The fact that you measure your latencies at a point other than the source will also make it harder to model and communicate what "intended request sending time" was. If you could carry the intended request sending time in the payload (instead of the actual sending time), that would probably be easiest, but that is often not an option for testers that need to emulate "real traffic".

Another issue you will run into is that not all shifts form the intended plan can/should be assumed to be caused by the system-under-test. The tester itself is often not perfect, and if the tester experiences it's own "glitch" that is uncoordinated with the test system, that just amount to [uncoordinated] omission, which does not skew stats and should not be compensated for. Knowing the difference could be VERY hard, so you would obviously want your tester to be as near-perfect at being able to execute to plan. E.g. avoid/minimize all pauses in the tester by fine tuning it or using your favorite non-pausing JVM).

One last thing to keep in mind is that partial correction is better than no correction at all. E.g. if you believe that buffering throughout your system can absorb 200usec of traffic before system stalls induce back pressure, you can just choose to avoid correcting for anything smaller than 200usec even though your message rate is one per usec. You'd remain "potentially blind" to coordinated omission at those levels (if it existed), but would still "mostly" correct the stats when larger glitches (like a rare 8 msec scheduling delay, or some GC pause) stall the tester.

Whether you want to be conservative or aggressive in these corrections depends on whether you want your reported percentiles to carry an "at least as bad as" or an "or least as good as" meaning. I use the "at least as bad as" (i.e. be conservative and don't over-correct) approach when I try to demonstrate that the problem exists, but if you are basing your business decisions or placing bets based on the reported percentiles, there is a good argument for going with the "at least as good as" approach, and overcompensating when in doubt.

On Aug 6, 2013, at 3:11 AM, Michael Barker <mik...@gmail.com>
wrote:
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Attila-Mihaly Balazs

unread,
Aug 6, 2013, 12:25:16 PM8/6/13
to mechanica...@googlegroups.com
Hello all,

First of all, I would like to clarify that those measurements are not done by me but rather by some Log4j committers. I just found the page (by somebody linking to it) and ran the tests myself because I wanted to see if there is room for improvement (TL;DR - this rate of logging is reaching the limit of the memory bandwidth, however latencies could be made more consistent - please reply in the other thread if you want to discuss this).

Getting back to CO:

- you can see the actual test class here: https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/test/java/org/apache/logging/log4j/core/async/perftest/RunLog4j2.java - there are two methods (runThroughputTest and runLatencyTest) so I assume that the throughput graphs and latency graphs are independent (were collected in separate runs)

- for the latency case it is indeed measuring the cost of calling logger.log and also adjusting for the cost of calling nanoTime. Waiting after each logging statement is accomplished using busy-wait.

Now getting back to your description of CO: do I understand correctly that the basic problem can be described as "percentiles are not guarantees for maximum values"? And isn't the solution as simple as just including the maximum value into the discussion? Ie. we say "if 99.99% of the time we respond in 100 msec and we always respond in < 1sec, we make money"? I also think this is in line with Martin's methodology of talking with clients :-)

Attila

Attila-Mihaly Balazs

unread,
Aug 6, 2013, 1:09:07 PM8/6/13
to mechanica...@googlegroups.com
A potential problem I see with the benchmark code:

final long s1 = System.nanoTime();
logger.info(LATENCY_MSG);
final long s2 = System.nanoTime();
final long value = s2 - s1 - nanoTimeCost; 
if (value > 0) { 
   histogram.addObservation(value); 
}

It's omitting "to good to be true" measurements, which is understandable since we can't have < 0 time (I suspect this is happening because nanoTimeCost is actually an average and sometimes it can take less time to call nano time). Probably this should be redone such that we measure the average time for X logger.info calls without correcting for nanoTimeCost (ie we do s1; for (0..X) logger.info; value = nanoTime-s1) and we determine X such that nanoTimeCost / X is arbitrarily close to 0 (we set a limit we're comfortable with).

And there is of course the issue with using nanoTime which can have all kinds of weird behavior (like going backwards, not advancing, etc) over a short period of time as mentioned by somebody (Martin?) on this thread.

Martin Thompson

unread,
Aug 6, 2013, 1:15:52 PM8/6/13
to mechanica...@googlegroups.com
System.nanoTime() is monotonic and should not go backwards.  There was some bugs with processors sometime ago but this should no longer be an issue.  System.currentTimeMillis() reads the clock which could be adjusted backwards depending on what adjustment mechanism is used, e.g. ntpdate.

RDTSC is not guaranteed to be synchronised across sockets but an OS should hide this.

Gil Tene

unread,
Aug 6, 2013, 1:18:30 PM8/6/13
to mechanica...@googlegroups.com

On Tuesday, August 6, 2013 9:25:16 AM UTC-7, Attila-Mihaly Balazs wrote:
First of all, I would like to clarify that those measurements are not done by me but rather by some Log4j committers. I just found the page (by somebody linking to it) and ran the tests myself because I wanted to see if there is room for improvement (TL;DR - this rate of logging is reaching the limit of the memory bandwidth, however latencies could be made more consistent - please reply in the other thread if you want to discuss this).

Sorry for assuming it was your test, and thanks for posting. Let me re-iterate that I think this log4j2 stuff is super-cool and a great step forward for logging in Java. I'm sure that if the 99%'ile, 99.9%'lie and 99.99%'ile were reported correctly, it will still run circles around older loggers.
 

Getting back to CO:

- you can see the actual test class here: https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/test/java/org/apache/logging/log4j/core/async/perftest/RunLog4j2.java - there are two methods (runThroughputTest and runLatencyTest) so I assume that the throughput graphs and latency graphs are independent (were collected in separate runs)

This one is very easy to fix. A runLatencyTest()that will correctly report percentiles when faced with the occasional latencies larger than (10,000usec * threadcount) would look something like this:

     @Override
    public void runLatencyTest(final int samples, final Histogram histogram,
                               final long nanoTimeCost, final int threadCount) {
        final Logger logger = LogManager.getLogger(getClass());
        final long PAUSE_NANOS = 10000 * threadCount;
        for (int i = 0; i < samples; i++) {
            final long s1 = System.nanoTime();
            logger.info(LATENCY_MSG);
            final long s2 = System.nanoTime();
            final long value = s2 - s1 - nanoTimeCost;
            if (value > 0) {
                histogram.addObservation(value);
            }
            // Compensate for missing samples:
            long remainingValue = value - PAUSE_NANOS; 
            while (value > PAUSE_NANOS) {
                histogram.addObservation(remainingValue);
                remainingValue -= PAUSE_NANOS;         
            }
            // wait 1 microsec
            final long pauseStart = System.nanoTime();
            while (PAUSE_NANOS > (System.nanoTime() - pauseStart)) {
                // busy spin
            }
        }
    }
 

- for the latency case it is indeed measuring the cost of calling logger.log and also adjusting for the cost of calling nanoTime. Waiting after each logging statement is accomplished using busy-wait.

Now getting back to your description of CO: do I understand correctly that the basic problem can be described as "percentiles are not guarantees for maximum values"? And isn't the solution as simple as just including the maximum value into the discussion? Ie. we say "if 99.99% of the time we respond in 100 msec and we always respond in < 1sec, we make money"? I also think this is in line with Martin's methodology of talking with clients :-)

No. The problem is that the values for the percentiles themselves are reported wrong. The best description of the problem in this specific report is best described as "this test reported that the 99.99%'ile as 16usec, even though the 99.99%'ile during the test period was clearly at least 526msec." A report generated with the corrected code above would not make that mistake.

Attila, it would be really cool if you could re-run the test with and without the code correction I suggest above and post the results here for comparison.

-- Gil.

Ishaaq Chandy

unread,
Aug 6, 2013, 4:55:47 PM8/6/13
to mechanica...@googlegroups.com
Gil,
In your amended code, when you wrote this:
while (value > PAUSE_NANOS)

did you mean:
while (remainingValue > 0)

Ishaaq

--

Gil Tene

unread,
Aug 6, 2013, 5:01:48 PM8/6/13