Google Groups

Coordinated Omission


Gil Tene 03.08.2013 13:10
Veröffentlicht in der Gruppe: mechanical-sympathy
I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.

Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.

Basically, I think that the 99.99% observation computation is wrong, and demonstrably (using the data in the graph data posted) exhibits the classic "coordinated omission" measurement problem I've been preaching about. This test is not alone in exhibiting this, and there is nothing to be ashamed of when you find yourself making this mistake. I only figured it out after doing it myself many many times, and then I noticed that everyone else seems to also be doing it but most of them haven't yet figured it out. In fact, I run into this issue so often in percentile reporting and load testing that I'm starting to wonder if coordinated omission is there in 99.9% of  latency tests ;-)

Here are the basic assumptions, and the basic math:

Assumptions:

1. I am *assuming* that in the graphs titled "Async Logging Latency Histogram" and "Async Logging Max Latency of 99.99% of Observations", the latency being measured is the latency of the call to the logger. I.e. the time that the thread spent inside the call (as opposed to a time measured between the call and some asynchronous event recorded concurrently with the thread's execution). This assumption is key to the following math, so if it's wrong, the rest of the below is basically male cow dung.

2. The test description notes that "After each call to Logger.log, the test waits for 10 microseconds * threadCount before continuing", but doesn't mention how this wait is achieved (e.g. Thread.sleep() or spin). In any case, since 64 * 10 is 640usec, I'll round up and assume that a wait is for 1msec (analysis below will be even worse if the wait is shorter than that).

Analysis:

- The normal observation interval for this test is ~1 msec. I.e. each thread running the test would "normally" issue a new logging message once every ~1 msec. (the ~1msec wait plus the very tight ~1-5usec average Logger.log() call latency).

- As long as the  Logger.log() call latency is lower than ~1msec (which is the vast majority of the time), things are "fine". However, when Logger.log() call latencies much larger than ~1msec occur (e.g. the graph shows 10 recorded calls that took 536msec each), each of those calls also "freezes" the testing thread for a duration much longer that the normal interval between two observations.

- Omitted observations: Effectively, each such "freeze" omits observations that would have normally happened had it not been for the back-pressure created by the call latency. I.e. Had an observer continued to issue logging calls at the normal observation interval during this freeze, additional results would have been included in the overall stats computations. E.g. for each of those ~536msec freezes, there are ~535 missing observations.

- Coordinated Omission: Omission on it's own is not a big problem, as long as the omission is random. E.g. If we randomly threw away 3 million of those 5 million latency measurements each logging thread was doing, the statistical results would probably not be materially affected. However, when omission is coordinated with observed events, it can dramatically skew the statistical analysis of the remaining results. In the case of this sort of coordinated omission, we are only omitting the "very bad results" (results that are larger than 1msec. I.e. larger than 200x the average latency). This has a huge impact on the eventual percentile calculation.

- Based on the 10 observations of 526msec each alone (and ignoring the 50+ other huge results next to them), we can deduce that an actual observer would have seen 5,350 additional results [10 x (536 -1)] ranging linearly between 1 msec and 535 msec (with a median of ~268msec).

- If we add those 5,350 results to the 5,000,000 results that the thread recorded, they would represent >0.1% of the overall data, and half way into that additional data (at 0.05% of overall data) we would see the median result of the missing set, which is 268msec.
 
- Therefor, based on the 10 outlier  in the graph alone, your 99.95%'ile is at least 268msec.

- Similarly, the 99.99%'ile is at least 582 msec (the 90%'ile of the missing data that represents 0.01% of the total observations).

That's at least 35,000x higher than reported...

There is good news here, too. HdrHistogram has a recording mode that automatically corrects for coordinated omission if you know what your expected interval between measurements is (and here it is ~msec). If you add an HdrHistogram to such a test and record your call latencies using the recordValues(value, expectedIntervalBetweenValueSamples) form, the Histogram will automatically re-created the missing measurements. It is often useful to compare the output of two HdrHistogram's one of which uses the correcting form and one of which that just records the raw values, as the differences in output at the higher percentiles is often "impressive".

Foot note: anyone want to guess what that 582 msec 99.99%'ile is dominated by?