Coordinated Omission

Gil Tene

unread,

Aug 3, 2013, 4:10:58 PM8/3/13

to

I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.

Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.

Basically, I think that the 99.99% observation computation is wrong, and demonstrably (using the data in the graph data posted) exhibits the classic "coordinated omission" measurement problem I've been preaching about. This test is not alone in exhibiting this, and there is nothing to be ashamed of when you find yourself making this mistake. I only figured it out after doing it myself many many times, and then I noticed that everyone else seems to also be doing it but most of them haven't yet figured it out. In fact, I run into this issue so often in percentile reporting and load testing that I'm starting to wonder if coordinated omission is there in 99.9% of latency tests ;-)

Here are the basic assumptions, and the basic math:

Assumptions:

1. I am *assuming* that in the graphs titled "Async Logging Latency Histogram" and "Async Logging Max Latency of 99.99% of Observations", the latency being measured is the latency of the call to the logger. I.e. the time that the thread spent inside the call (as opposed to a time measured between the call and some asynchronous event recorded concurrently with the thread's execution). This assumption is key to the following math, so if it's wrong, the rest of the below is basically male cow dung.

2. The test description notes that "After each call to Logger.log, the test waits for 10 microseconds * threadCount before continuing", but doesn't mention how this wait is achieved (e.g. Thread.sleep() or spin). In any case, since 64 * 10 is 640usec, I'll round up and assume that a wait is for 1msec (analysis below will be even worse if the wait is shorter than that).

Analysis:

- The normal observation interval for this test is ~1 msec. I.e. each thread running the test would "normally" issue a new logging message once every ~1 msec. (the ~1msec wait plus the very tight ~1-5usec average Logger.log() call latency).

- As long as the Logger.log() call latency is lower than ~1msec (which is the vast majority of the time), things are "fine". However, when Logger.log() call latencies much larger than ~1msec occur (e.g. the graph shows 10 recorded calls that took 536msec each), each of those calls also "freezes" the testing thread for a duration much longer that the normal interval between two observations.

- Omitted observations: Effectively, each such "freeze" omits observations that would have normally happened had it not been for the back-pressure created by the call latency. I.e. Had an observer continued to issue logging calls at the normal observation interval during this freeze, additional results would have been included in the overall stats computations. E.g. for each of those ~536msec freezes, there are ~535 missing observations.

- Coordinated Omission: Omission on it's own is not a big problem, as long as the omission is random. E.g. If we randomly threw away 3 million of those 5 million latency measurements each logging thread was doing, the statistical results would probably not be materially affected. However, when omission is coordinated with observed events, it can dramatically skew the statistical analysis of the remaining results. In the case of this sort of coordinated omission, we are only omitting the "very bad results" (results that are larger than 1msec. I.e. larger than 200x the average latency). This has a huge impact on the eventual percentile calculation.

- Based on the 10 observations of 526msec each alone (and ignoring the 50+ other huge results next to them), we can deduce that an actual observer would have seen 5,350 additional results [10 x (536 -1)] ranging linearly between 1 msec and 535 msec (with a median of ~268msec).

- If we add those 5,350 results to the 5,000,000 results that the thread recorded, they would represent >0.1% of the overall data, and half way into that additional data (at 0.05% of overall data) we would see the median result of the missing set, which is 268msec.

- Therefor, based on the 10 outlier in the graph alone, your 99.95%'ile is at least 268msec.

- Similarly, the 99.99%'ile is at least 582 msec (the 90%'ile of the missing data that represents 0.01% of the total observations).

That's at least 35,000x higher than reported...

There is good news here, too. HdrHistogram has a recording mode that automatically corrects for coordinated omission if you know what your expected interval between measurements is (and here it is ~msec). If you add an HdrHistogram to such a test and record your call latencies using the recordValues(value, expectedIntervalBetweenValueSamples) form, the Histogram will automatically re-created the missing measurements. It is often useful to compare the output of two HdrHistogram's one of which uses the correcting form and one of which that just records the raw values, as the differences in output at the higher percentiles is often "impressive".

Foot note: anyone want to guess what that 582 msec 99.99%'ile is dominated by?

Vladimir Sitnikov

unread,

Aug 3, 2013, 12:43:08 PM8/3/13

to mechanica...@googlegroups.com

Gil,

Do you think coordinated omission will ever happen for closed system?

I guess CO happens just for open systems when the number of clients is infinite.
In case the number of clients is limited (e.g. just a single client), CO does not happen.

Regards,
Vladimir Sitnikov

03.08.2013 19:20 пользователь "Gil Tene" <g...@azulsystems.com> написал:

I've been harping for a while not about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gil Tene

unread,

Aug 3, 2013, 5:39:27 PM8/3/13

to

Actually, Vladimir, it's the opposite. The higher the thread count is in your test system, the longer the natural "think time" that each thread will normally model, and the less impactful coordinated omission will be. A single client system is the most likely to exhibit this problem, but most multi-threaded testers exhibit it as well. It's actually a per-thread (local to the testing thread) problem, which occurs whenever an observed response time is longer than the interval that the testing thread is trying to maintain between requests. Basically, whenever a test thread that waits for a response before issuing a new request ends up "missing" on sending the next request it would have normally sent, omission occurs.

If you look at the detailed response log files for such system loaders, it's pretty evident. Whenever you see a long response time, you'll also see a long gap in timestamps between requests. It's as if someone just redacted or erased a whole bunch of bad results from the log file, leaving only the good results and "a few" bad results. Only the first bad result in each sequence of bad results usually survives the redaction. The result heavily skews all stats done on the recorded result set. [I encourage anyone reading this to dig up a results log file and look for those redacted gaps].

The issue lies in the mismatch between what percentiles mean to the reader or a report, and what "coordinating" testers are actually doing. When you say "99.99% of observations show an X msec or better latency", the reader isn't expecting that to mean mean "99.99% of the good observations were that good". They expect it to mean that 99.99% of all random, uncoordinated attempts were that good. The expectation is that if a request was sent with no warning, and at a random point in time, it will still have a roughly 99.99% chance of seeing the touted latency. But what each tester thread is actually doing as a result of waiting for a long response before sending the next one is the equivalent of asking the system "is it ok for me to test your latency right now? Because if it's not, I'm fine waiting and testing it only when you are ready..."

Note that Coordinated Omission is a measurement methodology problem in load testers. Not a problem with the system under test itself. This methodology "glitch" is unfortunately very common. I rarely see load testers that don't suffer from it, including common testers used in the industry (JMeter, Grinder, HP LoadRunner), benchmarks (YCSB, SPECeverything, etc.), and custom build testers I see used by many low latency shops. I've seen a few rare environments that don't exhibit this problem (e.g. Twitter's Iago, and the asynchronous, thread-less client that Mike McCandles used in testing some stuff here), and in low latency shops that load things up with non-blocking testers, but those are usually the rare cases.

This very prevalent Coordinated Omission behavior basically makes load testers lie about the system they are observing, making you think the system is much better than it really is (especially in the higher percentiles). Since reality doesn't omit or forgive the bad results the way many load testers do, coordinated omission often results in poor decision-making when it comes to capacity planning and setting business or service level expectations.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Aug 3, 2013, 6:44:18 PM8/3/13

to mechanica...@googlegroups.com

Gil you are right in that the original latency tests I did on the Disruptor do suffer from the CO issue. It happens for all events injected when the injector is stalled for more than one microsecond. It does not impact the consumer as all consumed events are recorded in the histogram with no omissions, provided they got injected. The latency tests were done at an injection rate of 1 million events per second.

Measuring CO on something like the Disruptor is very challenging and some practical constraints have been applied. The latencies I originally posted were for the full exchange of messages and not just the time to produce. The time to produce is all that matters for the likes of a logger and the one that CO can be accurately applied to.

If one was to measure the cost of producing into the Disruptor alone then the cost of measurement is so great it makes the measurement pointless. On average it would an order of magnitude more cpu cycles in measurement compared to actual work.

Michael Barker

unread,

Aug 3, 2013, 7:34:47 PM8/3/13

to mechanica...@googlegroups.com

The Disruptor latency benchmarks have been long overdue for a
re-visit. I want to switch to using the HdrHistotram, both for the
Disruptor and for within LMAX. I've also noticed that on some newer
hardware (Sandy/Ivy Bridge) I get some strange results, e.g. the mean
is higher for the Disruptor than a BlockingQueue, but I get a fair
number of results <0ns for the Disruptor. I need to do a bit of
digging.

Mike.

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-symp...@googlegroups.com.

Vladimir Sitnikov

unread,

Aug 4, 2013, 2:20:01 AM8/4/13

to mechanica...@googlegroups.com

> Actually, Vladimir, it's the opposite. The higher the thread count is in your test system, the longer the natural "think time" that each thread will normally model will be, and the less impactful coordinated omission will be.
Ok. Good point. I meant the following: CO happens when you try to model an open system with a closed system. Lots of load testers use limited number of threads, thus are closed.

>A single client system is the most likely to exhibit this problem, but most multi-threaded testers exhibit it as well.

In case your _real_ system is single client, no CO happens. Say, a specific task is performed by a single person. In case his request stuck there is noone to fire similar requests, thus no omissions, thus no CO. The case 'single person is not enough to generate required amount' is clear and it has nothing to do with CO compensation.

As far as I understand, there is no rule of thumb 'always use CO compensation in HdrHistogram'.
In case actual load rate matches the required one no compensation is required.

Regards,
Vladimir Sitnikov

Martin Thompson

unread,

Aug 4, 2013, 3:32:08 AM8/4/13

to mechanica...@googlegroups.com

Your observation on CO is really interesting and I have to admit I've made the mistake many times myself in the past. More recently I've been trying to be more aware of what can potentially get hidden in measurement and it occurred to me that while your view of CO does surface a whole class of missing observations when measuring, it does also tend to assign the best case scenario to them. Let me try and explain this.

In the case of the Distruptor latencies tests, for which I'm guilty :-), we inject a new event once per microsecond and then measure the end-to-end time averaged per hop over a 3 stage pipeline. The CO issues arises when something stalls the injector from sending its next event at the 1us boundary. This could be GC, TCP back pressure, or any number of things. If I understand CO as you put it correctly, any events that did not get sent because the injector is stalled should be included in the percentiles. This does get us a lot closer to true percentiles. However it does not take into account the effect those missing events would have imposed on the system. For example, what about the queueing effects, cache pressures, potential buffer exhaustion, etc.??? If in reality the actual number of events got injected into the system to account for CO then the system may even have collapsed under the load.

The more I dig into this subject the more I see evidence that not only is CO happening at the point of stalls in a system when doing latency testing, we are also deluding ourselves to how good our systems actually are compared to reality. To me the evidence is suggesting that when we do latency testing (with load testers) on our systems then our measurements are reflecting a much better picture than reality can actually be. If we are measuring actually latencies for all events in our production systems, especially with multiple points of injection, then we are getting a much more realistic picture.

While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.

On the bright side of this, if the measurement technique, even when flawed, is showing a significant improvement between 2 systems then things are moving in the right direction even if the percentiles are a work of fiction. Well done to the log4j v2 folk for taking a big step in the right direction.

Martin...

Peter Lawrey

unread,

Aug 4, 2013, 4:01:51 AM8/4/13

to mechanica...@googlegroups.com

The high speed input data for most systems usually comes from network adapters and these provide a constant and regular stream of data/events (when highly utilised). A single packet from a 1 Gb/s line can be ~1470 bytes and take around 13 micro-seconds to come down the line. If this translates to a number of events you have bursts of data as say 1-15 events every 13 micro-seconds. The translation from raw data to event takes time and if you have designed your system well will be until the 13 microsecond. i.e you can parse the incoming data as fast as it comes and it shouldn't back up unless you have jitter. When the network line is not so utilised,, there are larger delays between packets/data/events.

So say you have 7 events per packet of ~200 bytes and you want to be able to typically parse then and add them to a queue/disruptor in half the time between packets to allow for some jitter i.e. ~7 micro-seconds. This means you will have micro-bursts of 7 events at a rate of one every micro-seconds (as each event takes one micro-second to parse and add) with a pause of 6 micro-seconds.

In short, there can be good reasons why a real system already has some throttling built in which you might not be able to change. e.g. your market data source is limited in some way like it is in a 1 Gb/s line.

It is also worth noting that your latencies around 99.99% are very dependant on the hardware (esp the CPU) and the OS you are using at these time intervals.

--

Kirk Pepperdine

unread,

Aug 4, 2013, 4:18:26 AM8/4/13

to mechanica...@googlegroups.com

Hi Martin,

I dedicate a fair bit of time to this subject in the benchmarking section of my performance course. IMHO, the benchmarking section of my course, a section that I'm still far from satisfied with, is one of the *most* important sections. Yet people don't either get the difficulties with benching or the importance of it or think they know how to tackle the problem. I have seen many many benchmark failures that have resulted in losses for one reason or another. I've also seen benchmarks used as political and negotiating weapons in bake-offs. You might be able to use beer to ply a story or two out of me ;-)

Not getting an event or request or what ever injected into the system at the scheduled time indeed does take some pressure off of the system which allows it to either recover from or better cope with it's existing workload. In fact, CO is more likely in a highly loaded system than in a lightly loaded one (though I'm not sure if pure thread count matters that much.. I need to better understand what Gil is trying to communicate).

The problem is that in larger more complex benchmarks, this effect is *often* overlooked because that's not the focus of the exercise. My Process Diagnostic Model (PDM) purposely includes the harness as part of the system to try to highlight the point that is the problem is in the harness/injector, it will either hide or create phantom problems in unit of test. I would also suggest that injecting into a system on a heart beat to a system that doesn't expect requests to be injected on a heartbeat will also have it's own subtle effects on performance measurements (which can be seen and explained with queuing theory (Little and Markov)).

Your observation on CO is really interesting and I have to admit I've made the mistake many times myself in the past. More recently I've been trying to be more aware of what can potentially get hidden in measurement and it occurred to me that while your view of CO does surface a whole class of missing observations when measuring, it does also tend to assign the best case scenario to them. Let me try and explain this.

In the case of the Distruptor latencies tests, for which I'm guilty :-), we inject a new event once per microsecond and then measure the end-to-end time averaged per hop over a 3 stage pipeline. The CO issues arises when something stalls the injector from sending its next event at the 1us boundary. This could be GC, TCP back pressure, or any number of things. If I understand CO as you put it correctly, any events that did not get sent because the injector is stalled should be included in the percentiles. This does get us a lot closer to true percentiles. However it does not take into account the effect those missing events would have imposed on the system. For example, what about the queueing effects, cache pressures, potential buffer exhaustion, etc.??? If in reality the actual number of events got injected into the system to account for CO then the system may even have collapsed under the load.

The more I dig into this subject the more I see evidence that not only is CO happening at the point of stalls in a system when doing latency testing, we are also deluding ourselves to how good our systems actually are compared to reality. To me the evidence is suggesting that when we do latency testing (with load testers) on our systems then our measurements are reflecting a much better picture than reality can actually be. If we are measuring actually latencies for all events in our production systems, especially with multiple points of injection, then we are getting a much more realistic picture.

While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.

Yes, you have an overhead error on the front end and on error on the back end of any timed interval. Since the typical use case is get time(); doSomething(); getTime(), this is the equivalent of the cost of a single call to the timer. The error is in the backing out from getting the timer value in the first call + getting the timer value in the second call. Do you have any idea on the effects of distance between the two timing events in this use case?

On the bright side of this, if the measurement technique, even when flawed, is showing a significant improvement between 2 systems then things are moving in the right direction even if the percentiles are a work of fiction. Well done to the log4j v2 folk for taking a big step in the right direction.

Depends on how the fiction has been constructed. ;-)

At any rate, the biggest problem with Log4J performance IME has always been in the forced object (String or Object[]) construction or the need to protect the method calls with an if statement. I've only once run into a problem where the async logger was a problem. In that case I recommended using per thread logging to individual files that could be merged later on. This drives the concurrency issues deeper into the system where they are better managed.

-- Kirk

PS Thanks to Gil for giving this (well known) problem a name....

Martin...

Rüdiger Möller

unread,

Aug 4, 2013, 6:30:25 AM8/4/13

to mechanica...@googlegroups.com

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.

Am Samstag, 3. August 2013 17:20:55 UTC+2 schrieb Gil Tene:

I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.

Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.

Kirk Pepperdine

unread,

Aug 4, 2013, 10:03:33 AM8/4/13

to mechanica...@googlegroups.com

On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

I would restrict this to behaviour in the test bed, not the component/app/what ever being tested

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.

Why would a quote machine be hit ?

-- Kirk

Gil Tene

unread,

Aug 4, 2013, 11:40:51 AM8/4/13

to

CO is a testing methodology and reporting problem, and has nothing to do with the system under test and what it is used for, so no type of system is immune, only types of testers are. Even real systems with a single real world client are susceptible.

The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result. The stats can often be orders of magnitude off. E.g. 35,000x off for the 99.99%'ile as I show in the example above, or the 99.9%'ile being reported as better than the real 99%'lie, etc.. CO happens for all testers that avoid sending requests when any form of back-pressure occurs (usually in the form of some previous request not completing before a new one was supposed to be sent according to the testing model).

A simple way to demonstrate the CO problem on a "real system with a single client" would be this hypothetical:

Imagine that you have a high end concierge business with a single client, and that single client typically calls you on the phone about 10 times a day to perform some transaction (e.g. trade a stock, or check his account balances, buy a shirt, check the weather). In order to keep your customer happy, and avoid losing them to the internet businesses you compete with, you decide that you want to provide them with good customer service, which to you amounts to an actual person answering the phone within 3 rings, 99% of the time, as long as they call any time between 9AM and 5PM pacific time.

You decide to regularly measure your business performance to establish whether or not your behavior meets your goals (of 99% response within 3 rings), and to help you decide whether you need to hire additional people to answer the phone, or maybe replace someone if they are lazy.

So you build a test system. The test system is simple: During a day that your customer is away on vacation and won't be calling you, you ring the business once every minute during the entire 9AM to 5PM period, and check how many rings it took before someone answered each time. You than compute the 99%'lie of that set of samples, and if that 99%'ile is 3 rings or better, you are performing within expectations. If it's not, you know that you need to improve the way your business works somehow (replace someone, or hire additional people to cover each other).

You do the test, and it shows that your business really does answer the phone within 3 rings more than 99% of the time. In fact, most of the time the phone was answered in 1 or 2 rings, and of all the times your test system called, it took more than 3 rings only once. You feel happy. You tell your wife things are going great. You give your employees bonuses for over-performing.

The next day your client fires you. He tried to call during the lunch hour, and nobody was there. In fact, this has been happening for a week now, and he just can't believe your outright dishonesty and false advertisement of your services.

What happened in the above scenario is simple: You testing methodology experienced Coordinated Omission. You dialed the business once a minute for the entire day, and in 420 out of the 421 dialing tests you made, the phone was promptly answered within 3 rings or less. That's 99.76%! That's great. What you missed is that your single switch operator, the one that started last week and didn't get proper training, thought that she gets 1 hour off for lunch every day, and at 12 noon she left her desk and went to have lunch with her friends across the street. Being a conscientious worker, she was back at her desk promptly at 1PM, answering the phone that has been ringing for a while.

When your test system encountered this, it recorded a single, 1800-ring phone call attempt at noon, followed by a 2 ring call at 1PM. Because it was busy waiting for the phone to be answered between 12 and 1, the test system missed 59 opportunities to call during lunch. Had it made those calls, it would have found that they all took longer than 3 rings to answer, that your better-than-3-rings call-answering percentile is only 87.5%, and that your 99%'ile answering time is actually 1,650 rings, and not not 3.

And had you known that, you probably would have added capacity to your business so that when employees go out to lunch (or take bathroom breaks, or pause to take out the garbage), there is someone there to cover for them and answer the phone.

Chris Vest

unread,

Aug 4, 2013, 11:59:34 AM8/4/13

to mechanica...@googlegroups.com, mechanica...@googlegroups.com

What about benchmarks that don't try to make the calls at any particular interval, but instead place the calls back to back in a loop. Is it just plain wrong to try to extract data about latency percentiles from such tests?

Chris

On 04/08/2013, at 17.38, Gil Tene <g...@azulsystems.com> wrote:

CO is a testing methodology and reporting problem, and has nothing to do with the system under test and what it is used for, so no type of system is immune, only types of testers are. Even real systems with a single real would client are susceptible.

The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result. The stats can often be orders of magnitude off. E.g. 35,000x off for the 99.99%'ile as I show in the example above, or the 99.9%'ile being reported as better than the real 99%'lie, etc.. CO happens for all testers that avoid sending requests when any form of back-pressure occurs (usually in the form of some previous request not completing before a new one was supposed to be sent according to the testing model).

A simple way to demonstrate the CO problem on a "real system with a single client" would be this hypothetical:

Imagine that you have a high end concierge business with a single client, and that single client typically calls you on the phone about 10 times a day to perform some transaction (e.g. trade a stock, or check his account balances, buy a shirt, check the weather). In order to keep your customer happy, and avoid losing them to the internet businesses you compete with, you decide that you want to provide them with good customer service, which to you amounts to an actual person answering the phone within 3 rings, 99.9% of the time, as long as they call any time between 9AM and 5PM pacific time.

You decide to regularly measure your business performance to establish whether or not your behavior meets your goals (of 99% response within 3 rings), and to help you decide whether you need to hire additional people to answer the phone, or maybe replace someone if they are lazy.

So you build a test system. The test system is simple: During a day that your customer is away on vacation and won't be calling you, you ring the business once every minute during the entire 9AM to 5PM period, and check how many rings it took before someone answered each time. You than compute the 99%'lie of that set of samples, and if that 99%'ile is 3 rings or better, you are performing within expectations. If it's not, you know that you need to improve the way your business works somehow (replace someone, or hire additional people to cover each other).

You do the test, and it shows that your business really does answer the phone within 3 rings more than 99% of the time. In fact, most of the time the phone was answered in 1 or 2 rings, and of all the times your test system called, it took more than 3 rings only once. You feel happy. You tell your wife things are going great. You give your employees bonuses for over-performing.

The next day your client fires you. He tried to call during the lunch hour, and nobody was there. In fact, this has been happening for a week now, and he just can't believe your outright dishonesty and false advertisement of your services.

What happened in the above scenario is simple: You testing methodology experienced Coordinated Omission. You dialed the business once a minute for the entire day, and in 420 out of the 421 dialing tests you made, the phone was promptly answered within 3 rings or less. That's 99.76%! That's great. What you missed is that your single switch operator, the one that started last week and didn't get proper training, thought that she gets 1 hour off for lunch every day, and at 12 noon she left her desk and went to have lunch with her friends across the street. Being a conscientious worker, she was back at her desk promptly at 1PM, answering the phone that has been ringing for a while.

When your test system encountered this, it recorded a single, 1800-ring phone call attempt at noon, followed by a 2 ring call at 1PM. Because it was busy waiting for the phone to be answered between 12 and 1, the test system missed 59 opportunities to call during lunch. Had it made those calls, it would have found that they all took longer than 3 rings to answer, that your better-than-3-rings call-answering percentile is only 87.5%, and that your 99%'ile answering time is actually 1,650 rings, and not not 3.

And had you known that, you probably would have added capacity to your business so that when employees go out to lunch (or take bathroom breaks, or pause to take out the garbage), there is someone there to cover for them and answer the phone.

On Saturday, August 3, 2013 11:20:01 PM UTC-7, Vladimir Sitnikov wrote:

--

Gil Tene

unread,

Aug 4, 2013, 12:00:50 PM8/4/13

to mechanica...@googlegroups.com

I find that in most all high speed Java-based applications that allocate any amount of memory on a regular basis (i.e. not the ones Peter writes ;-)), the 99.99%'ile is completely dominated by GC pauses. It's the misreporting of percentiles due to CO that makes people think otherwise.

For example. A 30msec newgen pause occurring only once every 5 minutes will by definition completely dominate the 99.99%'ile of a system that shows typical 65usec response times. Similarly, a 2 seconds oldgen pause occurring only once every 4 hours would by definition completely dominate the 99.99%'ile of a web application that typically responds in 200msec or less.

There is obviously a very selfish reason for me to care about CO. When a test system incorrectly reports the 99%'ile as 1,000x better than it really is, it often stands in the way of demonstrating how good Zing really is at affecting 99%, 99.9%, 99.99% results...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rüdiger Möller

unread,

Aug 4, 2013, 1:31:16 PM8/4/13

to mechanica...@googlegroups.com

I understood your point and admit all tests I ever have written are guilty of OC.

However I'd like to point out the following (I'll stay in your picture):

If I test my service calling each 5 Minutes, and the operator is going to the toilet for 4 minutes, the skew of OC is not there (like in manual order submission). A single customer can't hit twice.

If I test my service calling each 1 minute, and the operator is going to the toilet for 4 minutes, One failure (let the phone ring until operator picks up) has to count as [waitime/1 minute] failures (=2,5 failures in average). (That's the quote engine).

Or is this not correct (and why) ?

> There is obviously a very selfish reason for me to care about CO. When a test system incorrectly reports the 99%'ile as 1,000x better than it really is, it often stands in the way of demonstrating how good Zing really is at affecting 99%, 99.9%, 99.99% results...

I'd never would have imagined that ;-) ..

I am not sure the percentile view is always the best way to do reporting. If e.g. a Java application sleeps for 280ms and the application has to listen for high volume unreliable UDP marketdata (because some clever dude figured out with good network hardware there are no packet losses -.- ), reporting "we miss xx thousands of datagrams a day" is likely to excite decision makers a lot more than arguing with 99.99% figures (because - well 99,99 thats actually 100% isn't it :-) ).

Gil Tene

unread,

Aug 4, 2013, 1:58:20 PM8/4/13

to mechanica...@googlegroups.com

The CO problem is the same. It's the omission of attempts that is the problem, and a back-to-back test, or a random-interval test would both experience the same omission problem as a fixed-interval test will. In fact, a back-to-back, peddle-to-the-metal test system is the most likely to suffer from CO, as it's "typical" interval is very short, and it therefore will miss many more sampling opportunities than a system with some amount of "think time".

Correcting for the problem is a different matter. The best way to correct the problem is to avoid it altogether (use a tester that sends stuff when it wants to, and never responds to any form of back pressure, like Twitter's Iago). But since changing testers is often impractical, and any system that uses TCP for a transport has built-in back pressure, correcting for the problem is the next best thing. Test systems that have a known expected interval between requests (within a single client thread) are the ones that are easiest to correct in. E.g. when you know exactly what your test scenario was supposed to be doing, you can correct for it not doing so (e.g. tell HdrHistogram what the expected interval between samples is, and it will re-create the missing ones thru interpolation when your recorded time is greater than the interval). A back-to-back peddle-to-the-metal test system still has an expected interval, but it's one you'll have to compute (a good estimate is the ongoing average response time. A better estimate is the ongoing average with some type of outlier filtering).

There are many ways to try and compensate for CO, and there is no one "right" way. E.g. at Azul, I started a side-project this summer (with an Mike Chmiel, an Intern from U of I) to build an auto-corrector for JMeter and for generic tester log files (the kind that log all the results, not just the summary). Rather than just correcting the stats (as HdrHistogram does internally if asked to), the focus there is on re-creating missing results in log files and/or in JMeter listener streams, such that existing reporting tools (like the various graphing plugins in JMeter, or Excel sheets and other tools that read log files) will simply have "more correct" data to work on. After all, it's not the math in the reporting tools that is wrong, it's the missing data that screws up all the numbers.

When working on correction techniques, I currently try to stay conservative. I.e. correcting "less" is better than correcting "more", and focusing on building correctors that can reliably tell you that things are "at least this bad", rather than ones that say "it could be as bad as this". It's arguable that the more aggressive approach is warranted, as people who actually read percentile data expect them to mean "that percentile is at least this good", but I find that over-correcting leads to questions of credibility of the correction technique, so I'm staying on the conservative side of that for now.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,

Aug 4, 2013, 2:14:11 PM8/4/13

to

You don't need a customer to be able to "hit twice". They just need to experience the stats being something other than what you claim. The test technique should be used to establish that the stats are what you say they are (or need them to be). Similarly, in a multi-customer business, you don't need the same customer to even interact with you more than once to experience the issue. I.e. if 10% your single-visit customers see really bad behavior (i.e. nobody answering the phone), your business is in trouble even if the other 90% sees perfect behavior.

Think about it this way: When you tell your single customer that you'll answer the phone in 3 rings 99% of the time, they expect that to mean that if they randomly choose to call you (without warning you or making sure that you are really there first), the likelihood of your answering in 3 rings or less is 99% or better. If they called you 100 times over the course of 10 days, and on 5 of those attempts noone answers for 4 minutes (because they were off taking care of their digestive tract needs), they would rightly call you on it, and say that your 99% claims are completely bogus. You could claim that they just got unlucky, and that with a large enough sample your numbers are true, but you'd be lying if you did that based on test system results that had CO build-in.

So if the operator goes to the toilet for 4 minutes less than once every 400 minutes, you are probably rightly operating within your claims, and deserve a ringing endorsement (let's leave the need for a max time requirement in SLAs out of this discussion). However, if your operator goes to the toilet more frequently than that (i.e. twice a day or more), then your 99%'ile claims are false even if no lunch break was taken.

Rüdiger Möller

unread,

Aug 4, 2013, 2:56:35 PM8/4/13

to

> Why would a quote machine be hit ?

manual order entry cannot send faster than one order each ~1-2 seconds. A quote machine oftens sends several quotes/second.

So if I test for manual order entry client with a single threaded program sending synchronous an order each 5 seconds, i don't have to adjust results (if max outlier is <5s).

If I test for quote machine client with a single threaded program sending synchronous 10 quotes/second, i have to add 'missing testcases' because of blocked test program not sending quotes while waiting. Else data would report 1 latency incident instead of like 20.

So a (stupid) way to escape OC is to lower test request frequency >max latency outlier. One then has to multithread a lot in order to get load on the system .. its no practical. Better approach is OC-compensation

Am Sonntag, 4. August 2013 16:03:33 UTC+2 schrieb Kirk Pepperdine:

On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

I would restrict this to behaviour in the test bed, not the component/app/what ever being tested

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.

-- Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,

Aug 4, 2013, 3:51:00 PM8/4/13

to <mechanical-sympathy@googlegroups.com>

That's exactly right.

Whether or not your test "missed" samples depends only on the rate at which the tester "intended" to send requests, and not at all on the rate that the system-under test can handle.

So in your example below, the same exact quote system, behaving in exactly the same way (e.g. pausing for 1/2 a second every 5 minutes), could end up with "right" test results data set if tested [for a long enough period of time] under a "manual quote system" scenario with a 5 second interval between all attempts, while the same system would end up with a test results set that would under-report it's response time failures by a factor of 5x if tested under a "machine client" scenario with 10 quotes per second.

As long as the tester didn't skip any requests it would have normally done under the intended test plan, your test data remains a valid, uncoordinated sample of the tested system's response times, and you just need enough of it to make it likely that you've sampled enough data. E.g. it's obviously useless to deduce 99.9%'ile from only 500 test samples, and risky to do trust it based on only only 2,000 samples, but you can probably safely project the 99.9%'ile from 10,000+ samples. Random "skips" in sampling are fine too, as long as they remain uncoordinated. So the client pausing for it's own uncoorinated-with-the-tested-system reasons, for example does not make it miss anything from a statistical perspective. It's only when the tester skips tests as a result of the tested system's long response times that coordination is introduced into the test, and that's when the results end up being completely skewed, since it's the equivalent of somebody taking a big thick black pen and selectively eliminating X results after each first occurance of a bad result. The likelihood that the eliminated results are non-random and skewed towards "bad" is very high is such cases, becoming a near-certaintly if the cause of the long response time was a systemic but momentary glitch in the tested system's responsiveness.

-- Gil.

On Aug 4, 2013, at 11:41 AM, Rüdiger Möller <moru...@gmail.com>

wrote:

> Why would a quote machine be hit ?

manual order entry cannot send faster than one order each ~1-2 seconds. A quote machine oftens sends several quotes/second.

So if I test for manual order entry client with a single threaded program sending synchronous an order each 5 seconds, i don't have to adjust results (if max outlier is <5s).

If I test for quote machine client with a single threaded program sending synchronous 10 quotes/second, i have to add 'missing testcases' because of blocked test program not sending quotes while waiting. Else data would report 1 latency incident instead of like 20.

Am Sonntag, 4. August 2013 16:03:33 UTC+2 schrieb Kirk Pepperdine:

On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:

Interesting remark.

The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.

I would restrict this to behaviour in the test bed, not the component/app/what ever being tested

E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.

-- Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Aug 4, 2013, 7:36:17 PM8/4/13

to mechanica...@googlegroups.com

Martin, you are certainly right that there are other issues with tester that back off. CO being only one of them. But CO is certainly a very big one. I find that CO has such a huge impact on the wrongness of 99%+ reported numbers that it often results in bad decisions when it comes to capacity planning, software selection, and business expectations.

Of the non-CO effects you mention below, the queueing-pressure-avoidance one is probably the most significant, and I see that as failed load testing (failure to load the system up). I've seen tests (like the YCSB benchmark, for example) that attempt to compensate for delayed requests by playing "catch up" and always sending requests until the total issued requests matches the expected sustained rate. It's hard to tell if that bursting effect has any bearing on behavior in reality though, and while this trick does create large queuing pressure right after a large delay, it actually makes the coordinated omission effect twice as bad (at least in the way it is used in YCSB). Instead of just dropping bad results from the data set, it actually replaces them with good results measured at "a more convenient time" from the tested system's perspective. You can certainly design a test system that does this "catch up" thing to correctly report the bad response times, btw. But doing so usually involves changing the actual tester (e.g. we can probably easily change the actual YCSB benchmark to correctly log response times). But it's hard to fix this sort of thing in post-processing of recorded data.

The sad reality is that once a test system misses an opportunity to actually send a request when it should have, we simply don't know what would have happened if it did. All correction techniques are estimates form that point. The various CO correction techniques I play with work under certain assumptions, and I'm hoping that our JMeter auto-correcting tool thing will help some, but as it stands it will probably only work for non-random test pattern scenarios. Unfortunately, random behavior is one of the things you can program load testers to do.

Peter Lawrey

unread,

Aug 5, 2013, 3:13:14 AM8/5/13

to mechanica...@googlegroups.com

My rule of thumb is that if you want a reproducible a N percentile you want (1-N)^-2 samples or for

99% 10,000

99.9% one million

99.99% 100 million

--

Martin Thompson

unread,

Aug 5, 2013, 5:36:10 AM8/5/13

to mechanica...@googlegroups.com

On Sunday, August 4, 2013 9:18:26 AM UTC+1, Kirk Pepperdine wrote:

While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.

Yes, you have an overhead error on the front end and on error on the back end of any timed interval. Since the typical use case is get time(); doSomething(); getTime(), this is the equivalent of the cost of a single call to the timer. The error is in the backing out from getting the timer value in the first call + getting the timer value in the second call. Do you have any idea on the effects of distance between the two timing events in this use case?

You have at least 2 classes of problem here.

Firstly, when dealing with estimating the cost of getting the time. At the lowest level you are going to call RDTSC and adding that to the the real time clock value. Intel say that the first 3 times you call RDTSC is a fair bit slower and thus gets hidden in the tight loop test. Also in the tight loop test your real time clock value is in local cache which is unlikely when doing a real timing test. You then need to have the start time in cache to do the duration calculation which it may or not be depending on scheduling and what doSomething() executes. If measuring something "large" then this does not really matter.

Secondly, when measuring something "small", i.e. things much less than a microsecond, then a straight call to RDTSC is problematic. RDTSC is not a serialising instruction and thus can be moved anywhere in the instruction stream. To address this you need a ordering instruction paired with it such as LFENCE;RDTSC or CPUID;RDTSC. This is ok for measuring the start time but not the stop time. What you really need is start time captured with CPUID;RDTSC and the end time captured with RDTSCP;CPUID because no instruction can be moved to after RDTSCP, but instructions can be moved before it. With System.nanoTime() there is no hint to if it a start time or a stop time.

On a related issue, if you are running on Sandy Bridge or later with Turbo Boost 2.0 enabled then after about 20-25 seconds the CPU frequency will drop due to turbo boost scaling down. This can result in the first part of the test showing better latency than the end of the test. You need to ensure you take control of C-states and P-states to get remotely reliable results even after applying thread affinity and CPU isolation. Best to lock all CPU cores to a single frequency for the duration of a test.

Michael Barker

unread,

Aug 6, 2013, 6:11:46 AM8/6/13

to mechanica...@googlegroups.com

Hi,

Caveat: these are only half formed thoughts on the topic.

I've been re-visiting the Disruptor latency tests and looking into the
effects of Coordinated Omission and how it is impacting our numbers.
I'd like to confirm a couple of observations. I've looked at the
approach that the HdrHistogram uses to backfill the omitted values. I
think that algorithm makes a couple of assumptions:

- Desired input rate is constant.
- The load generation/latency measurement synchronously waits for
responses before issuing the next request.

For the Disruptor tests, we timestamp just before publishing in the
producer thread and then take the end timestamp in the consumer
thread. Therefore a stall in the consumer thread may or may not have
an impact on the producer thread. If we are producing at a rate of 1M
events/sec and we stall for a short time (e.g. 30µs) then the impact
on the producer is absorbed by the Disruptor's buffer, so the producer
can continue unimpeded (to a point), so the cost of the consumer stall
will be included in the latency measurements - no compensation for the
stall is required. Obviously if the pause is sufficiently long the
buffer will fill and producer will be affected. So we can't tell from
the latency measurement alone the level of Coordinated Omission. I'm
currently messing around with tracking the delta between start
timestamps (in an HdrHistogram) as a way of trying to understand the
performance tests behaviour.

The other area I'm looking at is our own system latency tests. In
order to better simulate real-world behaviour we don't use a
consistent input rate, instead we attempt to simulate burst
conditions. I'm thinking using a similar measurement to the above
(delta between timestamps) to model the input rate as a distribution
and compare that to the intended input distribution.

So, I'm not sure whether it will possible in all cases to fix up
latency results to compensate for Coordinated Omission, but it should
be possible to track it as a form of measurement error. Anyway, I'm
still experimenting so I'll see how it goes. I'm interested in
hearing others experience in this area.

Mike.

Gil Tene

unread,

Aug 6, 2013, 10:01:05 AM8/6/13

to <mechanical-sympathy@googlegroups.com>

Mike, I think that you are correct in that the simple correction mode offered by HdrHistogram will not work for your scenarios below.

Correcting for coordinated omission can be challenging. When your test scenario is simple enough, it can be easily corrected, and HdrHistogram's can optionally help with some of those scenarios. e.g. Whenever a you have a fixed expected interval between requests or a fixed delay ("think time") between requests, and your load tester uses synchronous client threads (each thread waits for a response before adding a new request), an expectedIntervalBetweenValueSamples can be computed and passed as an extra parameter to HdrHistogram's recordValue call. This sort of correction can be applied either at test time or as a post-processing step to tester log files.

When the test scenarios are more complex than that, e.g. when there are patterns of different requests interleaved in the tester's work, correcting becomes more challenging, but linear and predictable behavior can still be corrected in post-processing in such cases. When random behavior is intentionally introduced in the tester, or when the tester does other reactive work (like compensate for not sending things when it should have by sending them later), it gets really hard to deduce and correct things in post-processing, and the tester itself needs to be changed in order to compensate.

I do believe that it should always be possible to correct for the backpressure-coordination cause in the tester itself, in order to avoid the coordinated omission of results. But in many cases, it becomes a matter of the tester knowing what it should/would have done, and using that knowledge in deciding what to record. This approach (as opposed to post-processing) can work even when the tester picks random or varying interval scenarios, as the tester can/should be able to always know what it had "intended" to do, and record latencies accordingly. The trick there is to have the tester act on some internally tracked "plan" and to then compute response times based on that plan, rather than based on what actually happened in back-pressure affected the behavior.

As an example, there is a style of load-generating that I call "constant throughput" (as opposed to "attempted-constant-interval", or "attempted-constant-think-time"). In this style, the tester always sends enough requests to match the number of requests that a sustained rate would have required "up to now". In such testers, a request burst of some sort (dampened or not) will typically follow any significant hiccup or pause. The issue with this type of load tester is that if recorded times are treated "naiively", the requests themselves are effectively shifted from the "bad" times to a "good" times if back-pressure stalled the tester temporarily and delayed some requests (and a burst of some sort would obviously follow such a pause). The current YCSB benchmark uses this model, which results in a worse-than-coordinated-omission skew in it's statistics. However, the same tester can easily be changed to avoid this problem by computing when it had "intended" to send each message, and recording response time for each request not as the delta between the actual request issue time and the actual response time, but instead as the delta between the intended request sending time and the actual response time. The intended request time can be trivially computed on the fly for such constant-throughput style testers.

The scenario you describe below is harder to model and correct, as your tester threads are not perfectly synchronous but they still experience back-pressure coordination. The fact that you measure your latencies at a point other than the source will also make it harder to model and communicate what "intended request sending time" was. If you could carry the intended request sending time in the payload (instead of the actual sending time), that would probably be easiest, but that is often not an option for testers that need to emulate "real traffic".

Another issue you will run into is that not all shifts form the intended plan can/should be assumed to be caused by the system-under-test. The tester itself is often not perfect, and if the tester experiences it's own "glitch" that is uncoordinated with the test system, that just amount to [uncoordinated] omission, which does not skew stats and should not be compensated for. Knowing the difference could be VERY hard, so you would obviously want your tester to be as near-perfect at being able to execute to plan. E.g. avoid/minimize all pauses in the tester by fine tuning it or using your favorite non-pausing JVM).

One last thing to keep in mind is that partial correction is better than no correction at all. E.g. if you believe that buffering throughout your system can absorb 200usec of traffic before system stalls induce back pressure, you can just choose to avoid correcting for anything smaller than 200usec even though your message rate is one per usec. You'd remain "potentially blind" to coordinated omission at those levels (if it existed), but would still "mostly" correct the stats when larger glitches (like a rare 8 msec scheduling delay, or some GC pause) stall the tester.

Whether you want to be conservative or aggressive in these corrections depends on whether you want your reported percentiles to carry an "at least as bad as" or an "or least as good as" meaning. I use the "at least as bad as" (i.e. be conservative and don't over-correct) approach when I try to demonstrate that the problem exists, but if you are basing your business decisions or placing bets based on the reported percentiles, there is a good argument for going with the "at least as good as" approach, and overcompensating when in doubt.

On Aug 6, 2013, at 3:11 AM, Michael Barker <mik...@gmail.com>
wrote:

> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Attila-Mihaly Balazs

unread,

Aug 6, 2013, 12:25:16 PM8/6/13

to mechanica...@googlegroups.com

Hello all,

First of all, I would like to clarify that those measurements are not done by me but rather by some Log4j committers. I just found the page (by somebody linking to it) and ran the tests myself because I wanted to see if there is room for improvement (TL;DR - this rate of logging is reaching the limit of the memory bandwidth, however latencies could be made more consistent - please reply in the other thread if you want to discuss this).

Getting back to CO:

- you can see the actual test class here: https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/test/java/org/apache/logging/log4j/core/async/perftest/RunLog4j2.java - there are two methods (runThroughputTest and runLatencyTest) so I assume that the throughput graphs and latency graphs are independent (were collected in separate runs)

- for the latency case it is indeed measuring the cost of calling logger.log and also adjusting for the cost of calling nanoTime. Waiting after each logging statement is accomplished using busy-wait.

Now getting back to your description of CO: do I understand correctly that the basic problem can be described as "percentiles are not guarantees for maximum values"? And isn't the solution as simple as just including the maximum value into the discussion? Ie. we say "if 99.99% of the time we respond in 100 msec and we always respond in < 1sec, we make money"? I also think this is in line with Martin's methodology of talking with clients :-)

Attila

Attila-Mihaly Balazs

unread,

Aug 6, 2013, 1:09:07 PM8/6/13

to mechanica...@googlegroups.com

A potential problem I see with the benchmark code:

final long s1 = System.nanoTime();

logger.info(LATENCY_MSG);

final long s2 = System.nanoTime();

final long value = s2 - s1 - nanoTimeCost;

if (value > 0) {

histogram.addObservation(value);

}

It's omitting "to good to be true" measurements, which is understandable since we can't have < 0 time (I suspect this is happening because nanoTimeCost is actually an average and sometimes it can take less time to call nano time). Probably this should be redone such that we measure the average time for X logger.info calls without correcting for nanoTimeCost (ie we do s1; for (0..X) logger.info; value = nanoTime-s1) and we determine X such that nanoTimeCost / X is arbitrarily close to 0 (we set a limit we're comfortable with).

And there is of course the issue with using nanoTime which can have all kinds of weird behavior (like going backwards, not advancing, etc) over a short period of time as mentioned by somebody (Martin?) on this thread.

Martin Thompson

unread,

Aug 6, 2013, 1:15:52 PM8/6/13

to mechanica...@googlegroups.com

System.nanoTime() is monotonic and should not go backwards. There was some bugs with processors sometime ago but this should no longer be an issue. System.currentTimeMillis() reads the clock which could be adjusted backwards depending on what adjustment mechanism is used, e.g. ntpdate.

RDTSC is not guaranteed to be synchronised across sockets but an OS should hide this.

Gil Tene

unread,

Aug 6, 2013, 1:18:30 PM8/6/13

to mechanica...@googlegroups.com

On Tuesday, August 6, 2013 9:25:16 AM UTC-7, Attila-Mihaly Balazs wrote:

First of all, I would like to clarify that those measurements are not done by me but rather by some Log4j committers. I just found the page (by somebody linking to it) and ran the tests myself because I wanted to see if there is room for improvement (TL;DR - this rate of logging is reaching the limit of the memory bandwidth, however latencies could be made more consistent - please reply in the other thread if you want to discuss this).

Sorry for assuming it was your test, and thanks for posting. Let me re-iterate that I think this log4j2 stuff is super-cool and a great step forward for logging in Java. I'm sure that if the 99%'ile, 99.9%'lie and 99.99%'ile were reported correctly, it will still run circles around older loggers.

Getting back to CO:

- you can see the actual test class here: https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/test/java/org/apache/logging/log4j/core/async/perftest/RunLog4j2.java - there are two methods (runThroughputTest and runLatencyTest) so I assume that the throughput graphs and latency graphs are independent (were collected in separate runs)

This one is very easy to fix. A runLatencyTest()that will correctly report percentiles when faced with the occasional latencies larger than (10,000usec * threadcount) would look something like this:

@Override

public void runLatencyTest(final int samples, final Histogram histogram,

final long nanoTimeCost, final int threadCount) {

final Logger logger = LogManager.getLogger(getClass());

final long PAUSE_NANOS = 10000 * threadCount;

for (int i = 0; i < samples; i++) {

final long s1 = System.nanoTime();

logger.info(LATENCY_MSG);

final long s2 = System.nanoTime();

final long value = s2 - s1 - nanoTimeCost;

if (value > 0) {

histogram.addObservation(value);

}

// Compensate for missing samples:

long remainingValue = value - PAUSE_NANOS;

while (value > PAUSE_NANOS) {

histogram.addObservation(remainingValue);

remainingValue -= PAUSE_NANOS;

}

// wait 1 microsec

final long pauseStart = System.nanoTime();

while (PAUSE_NANOS > (System.nanoTime() - pauseStart)) {

// busy spin

}

- for the latency case it is indeed measuring the cost of calling logger.log and also adjusting for the cost of calling nanoTime. Waiting after each logging statement is accomplished using busy-wait.

Now getting back to your description of CO: do I understand correctly that the basic problem can be described as "percentiles are not guarantees for maximum values"? And isn't the solution as simple as just including the maximum value into the discussion? Ie. we say "if 99.99% of the time we respond in 100 msec and we always respond in < 1sec, we make money"? I also think this is in line with Martin's methodology of talking with clients :-)

No. The problem is that the values for the percentiles themselves are reported wrong. The best description of the problem in this specific report is best described as "this test reported that the 99.99%'ile as 16usec, even though the 99.99%'ile during the test period was clearly at least 526msec." A report generated with the corrected code above would not make that mistake.

Attila, it would be really cool if you could re-run the test with and without the code correction I suggest above and post the results here for comparison.

-- Gil.

Ishaaq Chandy

unread,

Aug 6, 2013, 4:55:47 PM8/6/13

to mechanica...@googlegroups.com

Gil,

In your amended code, when you wrote this:

while (value > PAUSE_NANOS)

did you mean:

while (remainingValue > 0)

Ishaaq

--

Gil Tene

unread,

Aug 6, 2013, 5:01:48 PM8/6/13

to mechanica...@googlegroups.com

It might seem that way, but no...

It's a bit non-obvious, but the reason is that you only ever want to compensate with results that are larger than the expected interval. Whatever fractional leftover is left ( < PAUSE_NANOS) you have to drop, because adding it could cause a report of a better-than-ever-experienced latency to the data. Basically, you can only start compensating when the measured response time is 2x the expected request interval, as up to that point you would only have fractional correction.

It can be changed to while (value >= PAUSE_NANOS)though...

-- Gil.

Ishaaq

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Barker

unread,

Aug 6, 2013, 6:29:21 PM8/6/13

to mechanica...@googlegroups.com

> Mike, I think that you are correct in that the simple correction mode offered by HdrHistogram will not work for your scenarios below.
>
> Correcting for coordinated omission can be challenging. When your test scenario is simple enough, it can be easily corrected, and HdrHistogram's can optionally help with some of those scenarios. e.g. Whenever a you have a fixed expected interval between requests or a fixed delay ("think time") between requests, and your load tester uses synchronous client threads (each thread waits for a response before adding a new request), an expectedIntervalBetweenValueSamples can be computed and passed as an extra parameter to HdrHistogram's recordValue call. This sort of correction can be applied either at test time or as a post-processing step to tester log files.

So I've decide to do the latency measurement synchronously in the
Disruptor tests to see what results I get. My test loop is (would be
nice if it could be sanity checked):

for (long i = 0; i < ITERATIONS; i++)
{
long t0 = System.nanoTime();
long sequence = ringBuffer.next();
ringBuffer.get(sequence).setValue(t0);
ringBuffer.publish(sequence);

while (stepThreeSequence.get() < sequence)
{
// busy spin
}

long t1 = System.nanoTime();
histogram.recordValue(((t1 - t0) / 3) - nanoTimeCost, PAUSE_NANOS);
// histogram.recordValue(((t1 - t0) / 3) - nanoTimeCost);

while (PAUSE_NANOS > (System.nanoTime() - t1))
{
// busy spin
}
}

I've tested with and without tracking Coordinated Omission. I'm
reasonably sure that the impact is low.

Without CO tracking (last of 5 runs/30M iterations/1µs interval):

%-tile: latency(µs) (count)
0.000200: 0.081 (6)
50.697467: 0.197 (69141)
75.910433: 0.204 (118580)
89.975933: 0.208 (95653)
95.403900: 0.210 (65536)
97.095200: 0.212 (20504)
98.556833: 0.216 (6635)
99.248500: 0.257 (1339)
99.616167: 0.263 (1842)
99.808067: 0.269 (615)
99.903633: 0.286 (41)
99.951267: 0.381 (5)
99.975667: 0.636 (4)
99.987833: 0.829 (2)
99.993900: 1.277 (1)
99.997000: 1.574 (2)
99.998500: 2.047 (2)
99.999267: 2.816 (1)
99.999633: 3.708 (1)
99.999833: 5.808 (1)
99.999933: 8.397 (1)
99.999967: 15.549 (1)
100.000000: 17.639 (1)
100.000000: 17.639 (1)

With CO tracking (last of 5 runs/30M iterations/1µs interval):

%-tile: latency(µs) (count)
0.000033: 0.070 (1)
50.210110: 0.149 (89374)
77.597080: 0.207 (157038)
90.293694: 0.209 (187058)
94.614563: 0.210 (129630)
97.573240: 0.212 (29029)
98.680907: 0.214 (13394)
99.269655: 0.217 (3378)
99.618445: 0.264 (727)
99.822072: 0.271 (824)
99.903103: 0.278 (109)
99.952001: 0.332 (44)
99.975601: 0.521 (2)
99.987800: 0.801 (1)
99.993900: 1.458 (1)
99.996967: 1.999 (1)
99.998500: 3.618 (1)
99.999267: 7.540 (1)
99.999633: 11.540 (1)
99.999833: 14.540 (1)
99.999933: 16.067 (1)
99.999967: 16.540 (1)
100.000000: 17.067 (1)
100.000000: 17.067 (1)

There is not a large difference between the 2. In fact when we track
CO, the results are sometimes better. Which indicates that local
system jitter is has a bigger hand in this. I've seen the max jump
between 10µs and 200µs. I think the next thing is to play with thread
affinity and isolcpus.

Mike

P.S. the /3 is there because we push the event through 3 handlers to
try an increase the overall latency to produce something that is more
measurable.

Ishaaq Chandy

unread,

Aug 6, 2013, 7:11:11 PM8/6/13

to mechanica...@googlegroups.com

Yes, I understand the reasoning behind the extra loop (after watching your talk on CO) but still think the loop condition as you originally wrote it is incorrect...

value is final so your while loop is going to either be a no-op (if value < PAUSE_NANOS) or non-terminating.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Aug 6, 2013, 9:19:44 PM8/6/13

to mechanica...@googlegroups.com

Sorry. I was right on the "> 0" vs. the "> PAUSE_NANOS" part, but wrong on the variable. Serves me right for posting code that wasn't actually run.

Bottom line, the correct sequence (still posting without running it) is something like this: (changed to use >= as well)

@Override

public void runLatencyTest(final int samples, final Histogram histogram,

final long nanoTimeCost, final int threadCount) {

final Logger logger = LogManager.getLogger(getClass());

final long PAUSE_NANOS = 10000 * threadCount;

for (int i = 0; i < samples; i++) {

final long s1 = System.nanoTime();

logger.info(LATENCY_MSG);

final long s2 = System.nanoTime();

final long value = s2 - s1 - nanoTimeCost;

if (value > 0) {

histogram.addObservation(value);

}

// Compensate for missing samples:

long remainingValue = value - PAUSE_NANOS;

while (remainingValue >= PAUSE_NANOS) {

Ishaaq

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Gil Tene

unread,

Aug 6, 2013, 9:50:03 PM8/6/13

to mechanica...@googlegroups.com

Three things here:

1. The best way to compare CO-corrected vs. raw histograms is to record the same data to two histograms (one with and without an expected interval) in the same run and print them both out. This way you take any cross-run test noise out of the picture.

2. I think CO is not being corrected in your second data set, for some reason. If it were, and assuming PAUSE_NANOS is 1000, I would expect the histogram data to have at least 1 data value at each of the 1 usec intervals counting down from the max value (i.e. 17.067, 16.067, 15.067, 14.067, 13.036, ...), and the counts in the percentile buckets reported show that is not the case.

3. Your printout is in a format that's a bit different than what I usually use. I'm assuming that you somehow iterate on some percentile buckets and print counts in each bucket, and your code may very well be right for that, but to sanity check your printing, can you also use something like the below for the same data?

histogram.getHistogramData().outputPercentileDistribution(System.out, 5 /* steps per half */, 1000.0 /* unit /* value scaling */);

It prints cumulative counts up to a percentile rather than delta counts per line, so it has a slightly different meaning. I've tested that format many times so I have some confidence in it's output being right. For example, using this outputPercentileDistribution() form I get the below output I get from raw and corrected histograms each recording 100 values of 1000 each, and one value of 17,000 (the corrected histogram was given an expected interval of 1000)

Raw Histogram:

Value, Percentile, TotalCountIncludingThisValue

1.000 0.000000000000 100

1.000 0.100000000000 100

1.000 0.200000000000 100

1.000 0.300000000000 100

1.000 0.400000000000 100

1.000 0.500000000000 100

1.000 0.550000000000 100

1.000 0.600000000000 100

1.000 0.650000000000 100

1.000 0.700000000000 100

1.000 0.750000000000 100

1.000 0.775000000000 100

1.000 0.800000000000 100

1.000 0.825000000000 100

1.000 0.850000000000 100

1.000 0.875000000000 100

1.000 0.887500000000 100

1.000 0.900000000000 100

1.000 0.912500000000 100

1.000 0.925000000000 100

1.000 0.937500000000 100

1.000 0.943750000000 100

1.000 0.950000000000 100

1.000 0.956250000000 100

1.000 0.962500000000 100

1.000 0.968750000000 100

1.000 0.971875000000 100

1.000 0.975000000000 100

1.000 0.978125000000 100

1.000 0.981250000000 100

1.000 0.984375000000 100

1.000 0.985937500000 100

1.000 0.987500000000 100

1.000 0.989062500000 100

17.007 0.990625000000 101

17.007 1.000000000000 101

#[Mean = 1.158, StdDeviation = 1.584]

#[Max = 16.992, Total count = 101]

#[Buckets = 22, SubBuckets = 2048]

Corrected Histogram:

Value, Percentile, TotalCountIncludingThisValue

1.000 0.000000000000 101

1.000 0.100000000000 101

1.000 0.200000000000 101

1.000 0.300000000000 101

1.000 0.400000000000 101

1.000 0.500000000000 101

1.000 0.550000000000 101

1.000 0.600000000000 101

1.000 0.650000000000 101

1.000 0.700000000000 101

1.000 0.750000000000 101

1.000 0.775000000000 101

1.000 0.800000000000 101

1.000 0.825000000000 101

1.000 0.850000000000 101

3.001 0.875000000000 103

4.001 0.887500000000 104

6.003 0.900000000000 106

7.003 0.912500000000 107

9.007 0.925000000000 109

10.007 0.937500000000 110

11.007 0.943750000000 111

12.007 0.950000000000 112

12.007 0.956250000000 112

13.007 0.962500000000 113

14.007 0.968750000000 114

14.007 0.971875000000 114

15.007 0.975000000000 115

15.007 0.978125000000 115

15.007 0.981250000000 115

16.007 0.984375000000 116

16.007 0.985937500000 116

16.007 0.987500000000 116

16.007 0.989062500000 116

16.007 0.990625000000 116

17.007 0.992187500000 117

17.007 1.000000000000 117

#[Mean = 2.163, StdDeviation = 3.383]

#[Max = 16.992, Total count = 117]

#[Buckets = 22, SubBuckets = 2048]

Michael Barker

unread,

Aug 6, 2013, 10:50:26 PM8/6/13

to mechanica...@googlegroups.com

I think that my attempts to correct for the three event handler and
remove the System.nanoTime prior to recording the value messes with
the CO calculation in the HdrHistogram. If we remove both of those,
we get something more like:

ThrottledOnePublisherToThreeProcessorPipelineLatencyTest run 4
Disruptor org.HdrHistogram.Histogram@34470a37
Value, Percentile, TotalCountIncludingThisValue

0.2280 0.000000000000 1
0.4610 0.100000000000 309784
0.4700 0.200000000000 613122
0.4770 0.300000000000 924905
0.4850 0.400000000000 1220652
0.4990 0.500000000000 1519022
0.5100 0.550000000000 1662254
0.6170 0.600000000000 1808832
0.6460 0.650000000000 1969317
0.6530 0.700000000000 2114161
0.6590 0.750000000000 2265129
0.6620 0.775000000000 2343293
0.6650 0.800000000000 2421266
0.6680 0.825000000000 2494012
0.6720 0.850000000000 2581726
0.6750 0.875000000000 2639059
0.6780 0.887500000000 2693153
0.6800 0.900000000000 2722394
0.6830 0.912500000000 2757148
0.6870 0.925000000000 2791740
0.6920 0.937500000000 2826347
0.6960 0.943750000000 2849151
0.6990 0.950000000000 2863308
0.7040 0.956250000000 2882524
0.7110 0.962500000000 2901272
0.7240 0.968750000000 2919461
0.7430 0.971875000000 2928885
0.7640 0.975000000000 2938314
0.7930 0.978125000000 2947768
0.8270 0.981250000000 2957154
0.8430 0.984375000000 2967353
0.8480 0.985937500000 2971499
0.8540 0.987500000000 2976322
0.8610 0.989062500000 2981083
0.8700 0.990625000000 2985562
0.8830 0.992187500000 2990086
0.8980 0.992968750000 2992450
0.9380 0.993750000000 2994805
1.0490 0.994531250000 2997154
1.6340 0.995312500000 2999498
4.2810 0.996093750000 3001851
6.9900 0.996484375000 3003028
11.3620 0.996875000000 3004205
19.2050 0.997265625000 3005382
34.0190 0.997656250000 3006559
81.1550 0.998046875000 3007737
140.8150 0.998242187500 3008325
216.5430 0.998437500000 3008914
306.3510 0.998632812500 3009502
407.5510 0.998828125000 3010091
533.4070 0.999023437500 3010680
607.2950 0.999121093750 3010974
680.4150 0.999218750000 3011268
754.3030 0.999316406250 3011562
885.3430 0.999414062500 3011857
1032.3510 0.999511718750 3012151
1105.4070 0.999560546875 3012298
1179.3910 0.999609375000 3012445
1252.4150 0.999658203125 3012592
1394.4310 0.999707031250 3012740
1541.4390 0.999755859375 3012887
1614.4630 0.999780273438 3012960
1688.4470 0.999804687500 3013034
1761.4070 0.999829101563 3013107
1835.4550 0.999853515625 3013181
1909.4390 0.999877929688 3013255
1945.4070 0.999890136719 3013291
1982.4630 0.999902343750 3013328
2019.4550 0.999914550781 3013365
2056.4470 0.999926757813 3013402
2093.4390 0.999938964844 3013439
2111.4870 0.999945068359 3013457
2129.4070 0.999951171875 3013475
2148.4790 0.999957275391 3013494
2166.5270 0.999963378906 3013512
2185.4710 0.999969482422 3013531
2194.4310 0.999972534180 3013540
2203.5190 0.999975585938 3013549
2212.4790 0.999978637695 3013558
2221.4390 0.999981689453 3013567
2231.4230 0.999984741211 3013577
2235.5190 0.999986267090 3013581
2240.5110 0.999987792969 3013586
2244.4790 0.999989318848 3013590
2249.4710 0.999990844727 3013595
2254.4630 0.999992370605 3013600
2256.5110 0.999993133545 3013602
2258.4310 0.999993896484 3013604
2260.4790 0.999994659424 3013606
2263.4230 0.999995422363 3013609
2265.4710 0.999996185303 3013611
2266.4950 0.999996566772 3013612
2267.5190 0.999996948242 3013613
2268.4150 0.999997329712 3013614
2270.4630 0.999997711182 3013616
2271.4870 0.999998092651 3013617
2271.4870 0.999998283386 3013617
2272.5110 0.999998474121 3013618
2272.5110 0.999998664856 3013618
2273.4070 0.999998855591 3013619
2274.4310 0.999999046326 3013620
2274.4310 0.999999141693 3013620
2274.4310 0.999999237061 3013620
2274.4310 0.999999332428 3013620
2275.4550 0.999999427795 3013621
2275.4550 0.999999523163 3013621
2275.4550 0.999999570847 3013621
2275.4550 0.999999618530 3013621
2275.4550 0.999999666214 3013621
2276.4790 0.999999713898 3013622
2276.4790 1.000000000000 3013622
#[Mean = 1.9967, StdDeviation = 40.5951]
#[Max = 2276.3520, Total count = 3013622]
#[Buckets = 20, SubBuckets = 32768]

Michael Barker

unread,

Aug 6, 2013, 11:10:45 PM8/6/13

to mechanica...@googlegroups.com

I think its capturing the CO correctly now. Here are the 2 histograms
for a single run, with and without compensation. The 99.99% latencies
are 247.1910µs and 8.4410µs for compensated and uncompensated
respectively. Just need to figure out where that nasty ~millisecond
spike is coming from :-).

Mike.

#[Buckets = 20, SubBuckets = 32768]

ThrottledOnePublisherToThreeProcessorPipelineLatencyTest run 2
Disruptor org.HdrHistogram.Histogram@1fe5e871
Value, Percentile, TotalCountIncludingThisValue

0.2580 0.000000000000 1
0.5110 0.500000000000 15116389
0.6810 0.750000000000 22595493
0.6910 0.875000000000 26487585
0.7010 0.937500000000 28251471
0.7270 0.968750000000 29109860
0.8620 0.984375000000 29575154
0.8770 0.992187500000 29810562
0.8950 0.996093750000 29928821
1.2140 0.998046875000 29985583
3.9300 0.999023437500 30014903
7.9950 0.999511718750 30029573
17.1800 0.999755859375 30036908
45.2370 0.999877929688 30040576
247.1910 0.999938964844 30042410
670.2070 0.999969482422 30043327
1005.2150 0.999984741211 30043785
1234.2390 0.999992370605 30044014
1349.2470 0.999996185303 30044129
1406.2070 0.999998092651 30044186
1435.1990 0.999999046326 30044215
1449.2150 0.999999523163 30044229
1456.1910 0.999999761581 30044236
1460.2230 0.999999880791 30044240
1462.2070 0.999999940395 30044242
1463.2310 0.999999970198 30044243
1463.2310 1.000000000000 30044243
#[Mean = 0.6441, StdDeviation = 6.4251]
#[Max = 1463.1680, Total count = 30044243]

#[Buckets = 20, SubBuckets = 32768]

Value, Percentile, TotalCountIncludingThisValue

0.2580 0.000000000000 1
0.5110 0.500000000000 15116389
0.6810 0.750000000000 22595493
0.6910 0.875000000000 26487585
0.7000 0.937500000000 28148798
0.7200 0.968750000000 29065192
0.8580 0.984375000000 29538186
0.8750 0.992187500000 29771691
0.8830 0.996093750000 29885644
0.9050 0.998046875000 29942005
1.0470 0.999023437500 29970760
1.3380 0.999511718750 29985361
2.4570 0.999755859375 29992680
5.2630 0.999877929688 29996338
8.4410 0.999938964844 29998172
9.4300 0.999969482422 29999085
13.7910 0.999984741211 29999543
23.4930 0.999992370605 29999772
28.6220 0.999996185303 29999886
33.7170 0.999998092651 29999943
41.5310 0.999999046326 29999972
67.7710 0.999999523163 29999986
184.7990 0.999999761581 29999993
280.1750 0.999999880791 29999997
792.2870 0.999999940395 29999999
1463.2310 0.999999970198 30000000
1463.2310 1.000000000000 30000000
#[Mean = 0.5845, StdDeviation = 0.3638]
#[Max = 1463.1680, Total count = 30000000]

#[Buckets = 20, SubBuckets = 32768]

Gil Tene

unread,

Aug 6, 2013, 11:36:28 PM8/6/13

to <mechanical-sympathy@googlegroups.com>

Yup. A 3.75x difference at the 99.9%, and a 29x difference at the 99.99%. That's coordinated omission in action.

And this is for a run where the worst result was only 1.46msec, which is a pretty well tuned system.

On Aug 6, 2013, at 8:10 PM, Michael Barker <mik...@gmail.com>

> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Peter Lawrey

unread,

Aug 7, 2013, 2:14:25 AM8/7/13

to mechanica...@googlegroups.com

How do you work out your percentiles? I use the count and divide those count equally with the highest being at the top. This means the spacing between percentiles for a count of 1 will all be the same.

Gil Tene

unread,

Aug 7, 2013, 2:40:15 AM8/7/13

to

HdrHistogram provides a built in percentile iterator (along with linear iterator and logarithmic iterators, and a recorded value iterator). The percentile iterator lets you iterate the histogram's recorded value counts by percentiles in diminishing power of 2 steps until it reaches 100% (where the last finest granularity count step gets you to 100%). You can see the source code that has the math for stepping through percentiles here (see the incrementIterationLevel() method).

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Aug 7, 2013, 3:09:33 AM8/7/13

to mechanica...@googlegroups.com

Mike,

This method can work but does give additional latency in the test. The reading of the stepThreeSequence will always be at least a dirty-hit cache snoop (~60 cycles) and possibly a full cache miss. Plus without the PAUSE instruction it can mess with speculative execution.

I think a better approach would be to make a ping pong test so you are measuring time at the same location. You can then just divide the result by 2. This way you do not have an additional busy spin plus cache miss to deal with.

Martin...

Peter Lawrey

unread,

Aug 7, 2013, 3:20:34 AM8/7/13

to mechanica...@googlegroups.com

For ping pong tests I assume that when there is an above average delay, the worst case is that the above average delay was entirely due to one hop. i.e. one direction was normal and the other was delayed . The difference is that if the average is say 50 each way and it sees a delay of 1000, you can either say there was one hop of 50 and the other 950. Or you might assume that there was two delays of 500. The distribution of latencies suggest that two very high delays in a row is far less likely than one average delay and one long delay. I don't have a simple solution for this, other than to be away that if you are look at half round trip, the distribution is a little different to measuring a single hop.

Michael Barker

unread,

Aug 7, 2013, 3:26:16 AM8/7/13

to mechanica...@googlegroups.com

I think I'll probably take the ping pong option, but quote that the
latency is for a round trip rather than halving it. Halve it in the
wrong place and it messes up the CO tracking for the HdrHistogram. As
long as I measure the ABQ the same way then I think the test is still
valid.

Also I'm no longer compensating for the cost to measure nanoTime. Not
sure what to do about that. I've noticed that it is also can be a bit
inconsistent. It definitely appears to be that once you start going
below 1µs the relative error starts to get pretty large. I'm still to
do the whole CPU pinning bit to see how consistent I can make the
results.

Mike.

Martin Thompson

unread,

Aug 7, 2013, 3:30:37 AM8/7/13

to mechanica...@googlegroups.com

I'd agree that the likelihood is that any variation in delay is likely to be from one leg in the case of not running close to saturation rates for throughput. If the throughput rates are approaching saturation then timings could be more evenly spread.

If thread affinity and cpu isolation is not applied then anything is possible. For example, any combination of threads could be context switched to another core mid run.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,

Aug 7, 2013, 9:56:13 AM8/7/13

to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com

Capping the recorded value (from below) at zero may let you do the nanoTime correction, as the issue may be about trying to log negative values into the histogram. It should be throwing an ArrayIndexOutOfBoundsException if the value to be recorded is negative, which I think the nanoTime correction below is somewhat likely to make happen occasionally.

Sent from my iPad

> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Attila-Mihaly Balazs

unread,

Aug 7, 2013, 3:43:51 PM8/7/13

to mechanica...@googlegroups.com

Hello all,

I had a chance to rerun the test with Gil's modifications and also some changes though up by me: https://gist.github.com/cdman/6177268

The gist of it (sorry for the pun) is that I don't seem to get differences with this test.

And sorry for harp on this, but I still don't get why CO is something to worry about, so I would appreciate a "for dummies" explanation:

- I get that no percentile is a guarantee for the maximum value (as it demonstrated by Gil's customer service metaphor) and you should always agree on both with the business sponsors when creating a system

- On the other hand I find it to be just a flaw in a theoretical model where we insist that we want to "poke" the system every X ms (or maybe us) and failing to do so deviates from our test plan. However what would such a test model? Are we trying to write a hard-realtime system? :-) (or maybe a trading system as Peter has described where we have X ms to process each network packet). In this case we should be worried about maximums. Other than that systems are very undeterministic and we're not showing anything by trying to create a very rigid benchmark runner. The only usecase I can think of if for some reason (marketing?) we are trying to show that the system can handle "a sustained load of X events / second" - btw if we make such a claim, what is the difference between showing down X events at once and waiting until the next second vs trickling them? The math works out the same.

- Finally, an observation: I'm no math person (just ask any of my university math teachers :-)), but AFAIK StdDev / Mean are only useful if we suppose that the underlying distribution is a normal one. However latency distributions (again, AFAIK) usually are not normal and as such mean / stddev are not useful. As such I am surprised that HdrHistogram prints them out.

Attila

Jason Koch

unread,

Aug 7, 2013, 5:37:27 PM8/7/13

to mechanica...@googlegroups.com

Hi Attila

I'll attempt to explain. Gil has a presentation on it and I'm sure there's a few analogies in these email threads. CO is something to do with your testing methodology, and I'll give an "obvious" example that shows why your tests might not show the real world problems if you're suffering CO.

The challenge is that, in the real world, your consumers are not coordinated with your system. The consumers are independent. They will send a request whether you have responded to someone else's request or not. So ..

Suppose you are using a few N consumer threads in a tester, and each is in a tight loop for 1 million "events". Eg:

for(int i = 0; i < 1000000; i++) { long start = nanotime(); log.info("xxxxxxx"); long duration = nanotime() - start; addToHisto(duration); }

Here's where CO comes in:

Suppose that during the middle of the run, there was some thing that causes the log.info() time to take longer. It could be a hashmap rebuild, a file system buffer full flush, a GC run; it doesn't really matter what it is, just that for some reason processing is stalled. In your test harness, the "consumer" threads wait for the system to recover before sending more requests (they are coordinated). So, only "N" requests would be showing the full pause time (eg 1 second), all other requests would show un-paused time (eg 1 ms).

However in a _real world_ situation, it's unlikely your consumers are going to recognise that your system is under load and stop sending requests for that full second. They're going to keep firing them, and probably a lot of times, during that 1 second. So in the real world, all of these consumers will ALSO see that pause. The issue of CO is how many requests are _truly_ impacted by that 1 second pause versus the N that you've shown in your test report (due to omitted results).

Hope this helps (?).

Thanks

Jason

Gil Tene

unread,

Aug 8, 2013, 12:01:11 AM8/8/13

to <mechanical-sympathy@googlegroups.com>

On Aug 7, 2013, at 12:43 PM, Attila-Mihaly Balazs <dify...@gmail.com> wrote:

> ...

I think Jason did a good job answering the "why CO matters" question. Simply put, dropping requests from a plan is fine, as long as the drops are not coordinated with the test system. percentiles are meaningless if you allow coordinated drops to affect them.

>
> - Finally, an observation: I'm no math person (just ask any of my university math teachers :-)), but AFAIK StdDev / Mean are only useful if we suppose that the underlying distribution is a normal one. However latency distributions (again, AFAIK) usually are not normal and as such mean / stddev are not useful. As such I am surprised that HdrHistogram prints them out.

You are absolutely right, in that standard deviation is virtually always a useless metric for latency distribution, as in virtually all softtare-based I've ever seen, latency is strongly multi-modal, exhibiting the absolute opposite of a normal distribution. Standard deviation is meaningless at best, and highly misleading at worst.

A perfect example of this can be found in Mike's posted result comparing CO-corrected and uncorrected Disruptor latency behaviors. Ignoring the CO and percentile data, and looking purely at the std. deviation and mean numbers, you can see how standard deviation is useless at giving the reader a "feel" for how the latency is distributed. This is true for both the corrected and the uncorrected output, as even though the std. deviation in the CO-corrected data set is 17x (!!) larger than the one in the uncorrected set, it would still give the casual reader absolutely no hint of what the Max is (it is 227 std deviations away from the mean and 99.99%'ile is (it is 38 std. deviations away from the mean). For context, if a distribution is anywhere near normal, you would expect 99.999999999% (no exaggeration here, that's 11 9's) to fall within 7 standard deviations of the mean.

As to your very valid question about why HdrHistogram prints out mean and standard deviation. The explanation is simple: it is mostly so I can highlight exactly how ridiculous they are as metrics for describing latency behavior...

Well, that and the hope that being a generic histogram for generic value sets (not just for latency measurements on software based systems), people may actually use HdrHistogram for data sets that actually do have a normal distribution...

-- Gil.

Gil Tene

unread,

Aug 8, 2013, 12:25:15 AM8/8/13

to <mechanical-sympathy@googlegroups.com>

On Aug 7, 2013, at 12:43 PM, Attila-Mihaly Balazs <dify...@gmail.com> wrote:

> I had a chance to rerun the test with Gil's modifications and also some changes though up by me: https://gist.github.com/cdman/6177268

[note: to be very clear, there absolutely no disrespect intended by the below]

If I read and understand your version correctly, you are trying to amortize the cost of nanoTime by only hitting each logging event with 1 nanosecond of that cost.

The problem with that, assuming e.g. an avg. loggin latency of 4,135nsec (per the original page at http://logging.apache.org/log4j/2.x/manual/async.html#Performance), you'll be measuring latencies across bulks of 4,000+ logging calls per latency measurement, and then logging 4,000+ identical latencies into the histogram (each of which is the average of the latencies). The effect of this on the outcome should be obvious..... Given that the test only measures a total of 5,000,000 logging samples, you will only be logging a total of ~1,250 actual latency measurements. You can only tout the resulting histogram as a "histogram depicting the distribution of the total time it took to do bulk logging of 4,000 log entries", and cannot expect to deduce anything about the behavior of individual logging call latency from it.

While these extreme numbers make it obvious why averaging latencies before logging them is a bad idea (it's not that different from simply dividing the total time for the 5,000,000 results by the total time), I'd like to highlight that this is not a question of magnitude, but a binary yes/not thing. Any amount of averaging or amortization is wrong when trying to characterize the latency behavior of individual calls, as a histogram of averages is a pretty meaningless thing.

>
> The gist of it (sorry for the pun) is that I don't seem to get differences with this test.

Can you post some results data with and without the correction? Assuming your un-corrected numbers are similar to the ones posted in the original page (i.e. they include several 500+ msec individual call events), I have a hard time seeing how the correction loop won't shift the results in a pretty dramatic way from the 99.99%'ile being ~16usec.

-- Gil.

Michael Barker

unread,

Aug 8, 2013, 12:51:24 AM8/8/13

to mechanica...@googlegroups.com

> I think a better approach would be to make a ping pong test so you are
> measuring time at the same location. You can then just divide the result by
> 2. This way you do not have an additional busy spin plus cache miss to deal
> with.

Done:

https://github.com/LMAX-Exchange/disruptor/blob/master/src/perftest/java/com/lmax/disruptor/PingPongLatencyTest.java

Gil Tene

unread,

Aug 8, 2013, 1:11:45 AM8/8/13

to mechanica...@googlegroups.com

Correction: The bulks won't be 4,000+, just (2 * nanoTimeCost) (which is ~70-100 on most systems, I think). Serves me right for doing math in a car (and no, the car wasn't moving...).

But the same issue holds. Averaging will make it meaningless. E.g. that ~500+msec outlier will disappear, and will be replace by 70-100 outliers of 5-7msec each. Other distortions will occur, e.g. It's contribution towards the average (in a properly CO-corrected system) will be cut by a factor of 70-100 (the contribution towards the average is ~N^2/2 of the outlier, so dividing N by 70-100 reduces the contribution is reduced by the same ratio).

Peter Lawrey

unread,

Aug 8, 2013, 2:01:29 AM8/8/13

to mechanica...@googlegroups.com

I don't know if this falls under the banner of CO, but something to consider is that you can see much higher latencies if a process slows down. e.g. say you have a process which gets interrupted 100 times per second and when it does the next task takes much longer. lets say 100 us longer (one thread time slice) If you measure one million tasks, this may show up as a jump in the 99.99%tile. i.e. it impacts one in 10,000 tasks. However, the reality may be that you have say an FX feed and you are getting less than 100 ticks per second. It takes a long time to measure one million samples if they are only 20 times per second :( The problem with not testing this however is that every one of those tasks is now being impacted. i.e. the jump has moved from the 99.99%tile to 0.01%tile.

There are a number of locking strategies which busy wait for a short period, but fall back to sleeping the process if it cannot obtain something in that time.

I have found a thread which is 100% busy is less likely to be interrupted by the OS scheduler than a thread which is 50% busy.

I am sure there are other examples of where the latency can increase at lower throughputs.

In short, there can be a number of reasons that higher loads can appear to have better latencies. This is fine if those loads are the expected behaviour of the system, however not everyone might really be doing one million tasks per second, or whatever you chose to test, and this can lead to higher latencies which might be surprising to some.

--

Remko Popma

unread,

Aug 8, 2013, 5:02:36 AM8/8/13

to mechanica...@googlegroups.com

Atilla, Gil, Martin, All,

Thanks for dissecting the Log4j2 async logger latency tests!

The published results are actually slightly out of date, as they were done on beta5, and we've since made a few performance improvements. One improvement specifically has to do with nulling ringbuffer references when the consumer is done so that objects can be GC'd sooner (oops, a bit embarrassing, that one... :-). I suspect that this may have some impact on the latency "bump" between 33 and 536 ms. I really need to run the tests again and update the results but I've been busy with other Log4j bugfixes and $work...

Fortunately I hope to have a new toy to play with at work soon, as we'll be evaluating something called "Zing" or so... :-) I will try to re-run the log4j performance tests on Zing as well, see if Gil's hunch is right and the 33-536ms latency bump will go away with Zing (I'm just guessing that that is what you meant, Gil :-).

A few questions:

I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?
I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?

Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.

Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.

BTW, I am currently on holidays and travelling, so I won't be able to start working on redoing tests until a few weeks from now.

Any other feedback or ideas to improve the test or the presentation of the results would be welcome!

Remko Popma

On Sunday, August 4, 2013 12:20:55 AM UTC+9, Gil Tene wrote:

I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).

I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.

Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.

Basically, I think that the 99.99% observation computation is wrong, and demonstrably (using the data in the graph data posted) exhibits the classic "coordinated omission" measurement problem I've been preaching about. This test is not alone in exhibiting this, and there is nothing to be ashamed of when you find yourself making this mistake. I only figured it out after doing it myself many many times, and then I noticed that everyone else seems to also be doing it but most of them haven't yet figured it out. In fact, I run into this issue so often in percentile reporting and load testing that I'm starting to wonder if coordinated omission is there in 99.9% of latency tests ;-)

Here are the basic assumptions, and the basic math:

Assumptions:

1. I am *assuming* that in the graphs titled "Async Logging Latency Histogram" and "Async Logging Max Latency of 99.99% of Observations", the latency being measured is the latency of the call to the logger. I.e. the time that the thread spent inside the call (as opposed to a time measured between the call and some asynchronous event recorded concurrently with the thread's execution). This assumption is key to the following math, so if it's wrong, the rest of the below is basically male cow dung.

2. The test description notes that "After each call to Logger.log, the test waits for 10 microseconds * threadCount before continuing", but doesn't mention how this wait is achieved (e.g. Thread.sleep() or spin). In any case, since 64 * 10 is 640usec, I'll round up and assume that a wait is for 1msec (analysis below will be even worse if the wait is shorter than that).

Analysis:

- The normal observation interval for this test is ~1 msec. I.e. each thread running the test would "normally" issue a new logging message once every ~1 msec. (the ~1msec wait plus the very tight ~1-5usec average Logger.log() call latency).

- As long as the Logger.log() call latency is lower than ~1msec (which is the vast majority of the time), things are "fine". However, when Logger.log() call latencies much larger than ~1msec occur (e.g. the graph shows 10 recorded calls that took 536msec each), each of those calls also "freezes" the testing thread for a duration much longer that the normal interval between two observations.

- Omitted observations: Effectively, each such "freeze" omits observations that would have normally happened had it not been for the back-pressure created by the call latency. I.e. Had an observer continued to issue logging calls at the normal observation interval during this freeze, additional results would have been included in the overall stats computations. E.g. for each of those ~536msec freezes, there are ~535 missing observations.

- Coordinated Omission: Omission on it's own is not a big problem, as long as the omission is random. E.g. If we randomly threw away 3 million of those 5 million latency measurements each logging thread was doing, the statistical results would probably not be materially affected. However, when omission is coordinated with observed events, it can dramatically skew the statistical analysis of the remaining results. In the case of this sort of coordinated omission, we are only omitting the "very bad results" (results that are larger than 1msec. I.e. larger than 200x the average latency). This has a huge impact on the eventual percentile calculation.

- Based on the 10 observations of 526msec each alone (and ignoring the 50+ other huge results next to them), we can deduce that an actual observer would have seen 5,350 additional results [10 x (536 -1)] ranging linearly between 1 msec and 535 msec (with a median of ~268msec).

- If we add those 5,350 results to the 5,000,000 results that the thread recorded, they would represent >0.1% of the overall data, and half way into that additional data (at 0.05% of overall data) we would see the median result of the missing set, which is 268msec.

- Therefor, based on the 10 outlier in the graph alone, your 99.95%'ile is at least 268msec.

- Similarly, the 99.99%'ile is at least 582 msec (the 90%'ile of the missing data that represents 0.01% of the total observations).

That's at least 35,000x higher than reported...

There is good news here, too. HdrHistogram has a recording mode that automatically corrects for coordinated omission if you know what your expected interval between measurements is (and here it is ~msec). If you add an HdrHistogram to such a test and record your call latencies using the recordValues(value, expectedIntervalBetweenValueSamples) form, the Histogram will automatically re-created the missing measurements. It is often useful to compare the output of two HdrHistogram's one of which uses the correcting form and one of which that just records the raw values, as the differences in output at the higher percentiles is often "impressive".

Foot note: anyone want to guess what that 582 msec 99.99%'ile is dominated by?

Martin Thompson

unread,

Aug 8, 2013, 5:15:08 AM8/8/13

to mechanica...@googlegroups.com

It is useful for telling you the average of possible best case scenarios. In a real test expect the cost of System.nanoTime() to be greater than the reported average of the tight loop test.

Message has been deleted

Peter Lawrey

unread,

Aug 8, 2013, 11:02:47 AM8/8/13

to mechanica...@googlegroups.com

I agree the minimum is safer than the average. If you take the average, I am sure you could see *negative latencies* for a short enough tests.

I haven't corrected for the delay in nanoTime because I know that real applications will struggle to do anything useful in this order of time frame and will likely to use nanoTime() or currentTimeMillis() just once at least or something as expensive.

On 8 August 2013 16:57, Gil Tene <g...@azulsystems.com> wrote:

On Thursday, August 8, 2013 2:02:36 AM UTC-7, Remko Popma wrote:
...

A few questions:
I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?

I think such corrections are valid, but that one should be careful with establishing the estimated cost of nanoTime that is actually safe to subtract. Specifically, I believe that it is valid and safe to subtract the minimum measured cost of nanoTime from all measurements, but not the average cost. So in establishing the cost, I'd run a long enough loop of two back-to-back calls to nanoTime, and look for the minimum experienced gap between them. And yes, the minimum could end up being 0, which would mean that's the maximum safe amount to shave off your latency measurement, because it would mean that you don't actually know that the measurement cost was not zero at your sample point.

I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?

Take a look at the percentile distribution diagram style I use in jHiccup. Rather than choosing one or two percentiles to report, I like to plot the entire percentile spectrum. The Excel (or other plotting setup) work to make this chart happen can be a bit tricky (I can detail it and provide an Excel sheet example for anyone interested), but it's worth it presentation-wise. While showing the distribution in counts-per-bucket is useful (keep doing that), I find that the most informative "relevant to operational and business requirements" chart is a percentile distribution chart that looks like the below, as it allows you to directly compare the reported behavior against requirements (an SLA, for example, would be represented as a simple step-function thing on such a graph):

The output from HdrHistogram's histogram.getHistogramData().outputPercentileDistribution() was specifically designed so that a few tens of percentile report lines still provide enough dynamic range and percentile resolution that it can be taken straight into a plot like the above, and keep the plot nice, smooth and accurate. This is exactly where reports from linear or logarithmic histogram buckets fail, as quantization kills the graphs, and that (I think) is why most people report histograms in counts-per-bucket format instead of like the above...

Note: This specific plot may be recognized by some on this list, as it is someone else's actual test, data and plot (go ahead and identify yourself if you want ;-). It was not prepared by me, except that I anonymized all the labels.

Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.

Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.

The cleanest way to do a CO-correction vs. non-corrected comparison (like Mike B. did) is to collect and report two separate histograms (one with CO-correction and one without) in the same run, on the same data. I.e. measure only once but record twice. This will take any cross-test noise (of which there is usually plenty) out of the picture, and makes the CO (and CO correction) effect very clear and easy to do math on. Text output from histogram.getHistogramData().outputPercentileDistribution(System.out, 5 /* steps per half */, 1000.0 /* unit /* value scaling */); is enough to see the difference, and a plot like the above would make it very clear visually.

I'd obviously love to see Zing numbers too, of course. I think those will be much more relevant after you settle on whether or not the CO correction is something that is needed.

Gil Tene

unread,

Aug 8, 2013, 11:12:55 AM8/8/13

to

On Thursday, August 8, 2013 2:02:36 AM UTC-7, Remko Popma wrote:

...

A few questions:
I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?

I think such corrections are valid, but that one should be careful with establishing the estimated cost of nanoTime that is actually safe to subtract. Specifically, I believe that it is valid and safe to subtract the minimum measured cost of nanoTime from all measurements, but not the average cost. So in establishing the cost, I'd run a long enough loop of two back-to-back calls to nanoTime, and look for the minimum experienced gap between them. And yes, the minimum could end up being 0, which would mean that's the maximum safe amount to shave off your latency measurement, because it would mean that you don't actually know that the measurement cost was not zero at your sample point.

I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?

Take a look at the percentile distribution diagram style I use in jHiccup. Rather than choosing one or two percentiles to report, I like to plot the entire percentile spectrum. The Excel (or other plotting setup) work to make this chart happen can be a bit tricky (I can detail it and provide an Excel sheet example for anyone interested), but it's worth it presentation-wise. While showing the distribution in counts-per-bucket is useful (keep doing that), I find that the most informative "relevant to operational and business requirements" chart is a percentile distribution chart that looks like the below, as it allows you to directly compare the reported behavior against requirements (an SLA, for example, would be represented as a simple step-function thing on such a graph):

The output from HdrHistogram's histogram.getHistogramData().outputPercentileDistribution() was specifically designed so that a few tens of percentile report lines still provide enough dynamic range and percentile resolution that it can be taken straight into a plot like the above, and keep the plot nice, smooth and accurate. This is exactly where reports from linear or logarithmic histogram buckets fail, as quantization kills the graphs, and that (I think) is why most people report histograms in counts-per-bucket format instead of like the above...

Note: This specific plot may be recognized by some on this list, as it is someone else's actual test, data and plot (go ahead and identify yourself if you want ;-). It was not prepared by me, except that I anonymized all the labels.

Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.

Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.

Attila-Mihaly Balazs

unread,

Aug 8, 2013, 12:54:13 PM8/8/13

to mechanica...@googlegroups.com

I think I got it!

Here it goes:

- percentiles are not guarantees for maximum values. Measure maximum values if that's what you need.

- if you measure percentiles, measuring it this way (send a request / wait / send an other request) can give you an overly optimistic estimation of the percentile because it is a more "kind" distribution than the random one you will have in production. Or to put it otherwise: if you use such benchmarks for SLAs, your clients will prove you quickly long.

- possible solutions: use CO correction (already included in HdrHistogram). Or "Peter's method": give yourself a 10x headroom (build a 10x better system than you think you need).

Did I achieve clarity? :-)

And thank you for explaining it over and over again :-)

Attila

Gil Tene

unread,

Aug 8, 2013, 1:18:54 PM8/8/13

to <mechanical-sympathy@googlegroups.com>

Yup. good summary.

Except that 10x is nowhere near enough headroom to combat the hyper-optimism at the high percentile levels. Headroom is always good to have. E.g. padding by 5x-10x against wrongly predicting what reality will demand of you is a good idea, especially since people often run into "3.7x higher than ever seen or projected" things in markets.

Since the "optimistic" (more like "ignore most of the bad results") reports are quite often multiple orders of magnitude off in the latencies they show for higher percentiles (99.9%, 99.99%), they will consume a very different level of headroom. Measuring correctly (or correcting for measurement issues) is much cheaper than padding by 1,000x.

-- Gil.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Peter Lawrey

unread,

Aug 8, 2013, 1:30:05 PM8/8/13

to mechanica...@googlegroups.com

Even measuring a micro-benchmark correctly I would add 10x to how it might behave in a real world, complex system over a longer period of time where it doesn't have all the system resources to itself.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Remko Popma

unread,

Aug 9, 2013, 2:36:04 AM8/9/13

to mechanica...@googlegroups.com

On Friday, August 9, 2013 12:08:09 AM UTC+9, Gil Tene wrote:

On Thursday, August 8, 2013 2:02:36 AM UTC-7, Remko Popma wrote:

...

A few questions:
I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?

I think such corrections are valid, but that one should be careful with establishing the estimated cost of nanoTime that is actually safe to subtract. Specifically, I believe that it is valid and safe to subtract the minimum measured cost of nanoTime from all measurements, but not the average cost. So in establishing the cost, I'd run a long enough loop of two back-to-back calls to nanoTime, and look for the minimum experienced gap between them. And yes, the minimum could end up being 0, which would mean that's the maximum safe amount to shave off your latency measurement, because it would mean that you don't actually know that the measurement cost was not zero at your sample point.

Ok, interesting.

I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?

Take a look at the percentile distribution diagram style I use in jHiccup. Rather than choosing one or two percentiles to report, I like to plot the entire percentile spectrum. The Excel (or other plotting setup) work to make this chart happen can be a bit tricky (I can detail it and provide an Excel sheet example for anyone interested), but it's worth it presentation-wise. While showing the distribution in counts-per-bucket is useful (keep doing that), I find that the most informative "relevant to operational and business requirements" chart is a percentile distribution chart that looks like the below, as it allows you to directly compare the reported behavior against requirements (an SLA, for example, would be represented as a simple step-function thing on such a graph):

This is an interesting idea, thanks! I look forward to playing with this when I get back. I would definitely be interested in an Excel sheet example, I know how fiddly this can get...

The output from HdrHistogram's histogram.getHistogramData().outputPercentileDistribution() was specifically designed so that a few tens of percentile report lines still provide enough dynamic range and percentile resolution that it can be taken straight into a plot like the above, and keep the plot nice, smooth and accurate. This is exactly where reports from linear or logarithmic histogram buckets fail, as quantization kills the graphs, and that (I think) is why most people report histograms in counts-per-bucket format instead of like the above...

Note: This specific plot may be recognized by some on this list, as it is someone else's actual test, data and plot (go ahead and identify yourself if you want ;-). It was not prepared by me, except that I anonymized all the labels.

Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.

Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.

The cleanest way to do a CO-correction vs. non-corrected comparison (like Mike B. did) is to collect and report two separate histograms (one with CO-correction and one without) in the same run, on the same data. I.e. measure only once but record twice. This will take any cross-test noise (of which there is usually plenty) out of the picture, and makes the CO (and CO correction) effect very clear and easy to do math on. Text output from histogram.getHistogramData().outputPercentileDistribution(System.out, 5 /* steps per half */, 1000.0 /* unit /* value scaling */); is enough to see the difference, and a plot like the above would make it very clear visually.

Ok, taking 2 histograms should be easy and there is definitely noise between separate measurements so I see the benefit...

I'd obviously love to see Zing numbers too, of course. I think those will be much more relevant after you settle on whether or not the CO correction is something that is needed.

-- Gil.

Thanks for the many great tips. I look forward to trying them when I get back.

Many thanks,

-Remko

Simone Bordet

unread,

Aug 9, 2013, 3:57:11 AM8/9/13

to mechanica...@googlegroups.com

Hi,

On Thu, Aug 8, 2013 at 6:54 PM, Attila-Mihaly Balazs <dify...@gmail.com> wrote:
> I think I got it!
>
> Here it goes:
>
> - percentiles are not guarantees for maximum values. Measure maximum values
> if that's what you need.
>
> - if you measure percentiles, measuring it this way (send a request / wait /
> send an other request) can give you an overly optimistic estimation of the
> percentile because it is a more "kind" distribution than the random one you
> will have in production. Or to put it otherwise: if you use such benchmarks
> for SLAs, your clients will prove you quickly long.

On the other hand, if you run the same test, but you don't measure
latency but throughput (that is, how many sends you can do in the unit
of time, perhaps with wait = 0), then the result does not suffer of
CO: if you have a GC pause between 2 requests, the total number of
requests will be less and the throughput will be less (not measuring
percentiles here).

Also, if you measure latency asynchronously so that your test has one
thread running the requests in a send + wait loop, but other threads
measuring the response arrival time, then also in that case there is
no CO: a GC pause in the sends will not affect the roundtrip time
measured, and a GC pause in the responses will be included in the
measurements, and the percentiles will be right.

Did I get these two correct ?

Thanks !

--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz

Martin Thompson

unread,

Aug 9, 2013, 4:12:52 AM8/9/13

to mechanica...@googlegroups.com

> - percentiles are not guarantees for maximum values. Measure maximum values
> if that's what you need.
>
> - if you measure percentiles, measuring it this way (send a request / wait /
> send an other request) can give you an overly optimistic estimation of the
> percentile because it is a more "kind" distribution than the random one you
> will have in production. Or to put it otherwise: if you use such benchmarks
> for SLAs, your clients will prove you quickly long.

On the other hand, if you run the same test, but you don't measure
latency but throughput (that is, how many sends you can do in the unit
of time, perhaps with wait = 0), then the result does not suffer of
CO: if you have a GC pause between 2 requests, the total number of
requests will be less and the throughput will be less (not measuring
percentiles here).

For measuring average throughput then the measurement is fair.

Also, if you measure latency asynchronously so that your test has one
thread running the requests in a send + wait loop, but other threads
measuring the response arrival time, then also in that case there is
no CO: a GC pause in the sends will not affect the roundtrip time
measured, and a GC pause in the responses will be included in the
measurements, and the percentiles will be right.

This would still be an issue. The fact that you cannot send adds latency. If you needed to send at the point of pause you would have incurred the latency. Round trip delay because of a GC pause verses recovery from loss exhibits the same behaviour from a latency perspective.

Simone Bordet

unread,

Aug 9, 2013, 5:51:45 AM8/9/13

to mechanica...@googlegroups.com

Hi,

On Fri, Aug 9, 2013 at 10:12 AM, Martin Thompson <mjp...@gmail.com> wrote:
> This would still be an issue. The fact that you cannot send adds latency.
> If you needed to send at the point of pause you would have incurred the
> latency. Round trip delay because of a GC pause verses recovery from loss
> exhibits the same behaviour from a latency perspective.

Ah, right, thanks for clarifying this.

Alex Averbuch

unread,

Apr 17, 2014, 9:35:08 AM4/17/14

to mechanica...@googlegroups.com

Are there any published papers about Coordinated Omission?

Gil Tene

unread,

Apr 17, 2014, 11:17:26 AM4/17/14

to mechanica...@googlegroups.com

Not yet (for academic papers). Some folks at UIUC are potentially working on one.

You can find some discussion in my Latency and Response time related presentations (e.g. http://www.infoq.com/presentations/latency-pitfalls), in the posting here and elsewhere.

Robert Engels

unread,

Apr 18, 2014, 4:47:30 PM4/18/14

to mechanica...@googlegroups.com

We are lucking in this regard, as we can use the received packet hardware timestamps at the network layer to perform all relative performance analysis, by carrying this timestamp throughout stages in the system, we can detect any latency in the upper layers. (on some systems we need to use kernel timestamps and they can be slightly subject to kernel latency).

Gil Tene

unread,

Apr 18, 2014, 9:54:12 PM4/18/14

to mechanica...@googlegroups.com

Unfortunately, measuring on the wire doesn't add or remove luck here.

If your clients (e.g. in a load tester, or in the real world) back off and avoid sending more stuff when they don't get responses or ACKs (e.g. in TCP), your results can still be wildly swayed by coordinated omission any time the system glitches for longer than the expected interval between messages or long enough to make protocols create back pressure. Avoiding this in loaders requires either a non-blocking loader, or a loader that report latency based on when they were supposed to send things, rather than when they actually sent them. avoid this problem. Avoiding it with real world traffic is much harder, where there is no "plan" you know was supposed to be followed and which you can use to correct results with.

E.g. TCP will back-pressure to the client very quickly when the receiver process stops reading for any reason, and the receive window is usually only good for a very short length of time in high volume connections on high speed networks. So snooping a TCP steam for timestamps will not help avoid getting your percentiles wrong by a few orders of magnitude when a e.g. a 10-20msec hiccup affects the reader.

robert engels

unread,

Apr 18, 2014, 10:11:23 PM4/18/14

to mechanica...@googlegroups.com, mechanica...@googlegroups.com

Luckily it is all external asynchronous events via udp (market data) that we have to deal with, so it works very well for us. And if the packet gets dropped by the kernel we can detect this as well (doesn't happen) and factor it in to the performance metrics. Our loader is non blocking and ultra efficient and can easily saturate the network.

This won't allow us to detect latency in switches and such, but my side of the house is only concerned with the performance once it hits the servers nic.

Don't get me wrong, we use your product for a good reason, and it's extremely valuable bringing this topic to light. I guess we were just ahead of the curve in figuring out the true cost of GC pauses...

Sent from my iPad

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,

Apr 18, 2014, 11:18:22 PM4/18/14

to mechanica...@googlegroups.com

Async everywhere, UDP, and a non blocking loader. Cool!

Then yes. You are in the lucky group that probably don't experience coordinated omission. ;-).

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

alex bagehot

unread,

Apr 21, 2014, 6:14:16 PM4/21/14

to mechanica...@googlegroups.com

Hi Gil,

I watched the presentation - really good.

I have some thoughts though to check -

in the video at minute 35 (just to ref. back) co-ordinated omission can be caused by a load generator like jmeter or grinder (~ 37:45) both of which I've used in the past.

The test requirement as I understand it is to test a request rate of 100 requests /sec.

Absolutely agree with the analysis in terms of (1) identifying there is an issue with the data and (2) fixing it where possible. I come at it from a slightly different direction though.

(1)

the workload model : "1 user" - this frames it as a closed (as in feedback loop) workload model where as you said each subsequent request is blocked by the current one. The requirement is an arrival rate which is an open workload model where each request is independent.

From there it's not going to work as the load tool selected will be wrong - eg. most (web) load generators are vuser per thread implementation.

This paper which I stumbled across funnily enough from the jmeter plugins website seems to sum up in similar terms the catastrophic impact of getting the workload model wrong, for latency which you have covered and system utilisation/other effects.

http://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf

(2)

In terms of how to fix the situation described - ideally get an open workload model load generator (the paper identifies some).

Given the difficulty of implementing a truely open generator, and the fact that you have identified that some (spec) are not in fact open at all, I suspect that some thorough testing of the tool would be needed before trusting the results and what bounds they are reliable within.

A lot of systems/components service open workloads so seems important to get this right.

Related - on your hdrhistogram - how do you think that compares with Ted Dunning's T-Digest? I think they are both trying to achieve similar goals(?) but different implementations, I guess there will be different performance characteristics when measuring at high rates. accuracy wise would be interested in comparison https://github.com/tdunning/t-digest

Thanks,

Alex

--

alex bagehot

unread,

Apr 23, 2014, 4:46:07 AM4/23/14

to mechanica...@googlegroups.com

hmm very embarrassingly I didn't realise there was a 70 posts prior to this on this thread last year (!) as I didn't have them in my inbox...

in any case, the open vs. closed academic paper should add something and is worth reading if you haven't seen it before.

as someone who is interested in load tool design/implementation and measurement (often percentile estimation) good to see (in some respects) this area is still alive and kicking

Gil Tene

unread,

Apr 28, 2014, 3:13:41 AM4/28/14

to mechanica...@googlegroups.com

Sorry for the multi-day pause preceding my response (traveling and bigger down in work). See some commentary inline.

On Monday, April 21, 2014 3:14:16 PM UTC-7, alex bagehot wrote:

Hi Gil,

I watched the presentation - really good.

I have some thoughts though to check -

in the video at minute 35 (just to ref. back) co-ordinated omission can be caused by a load generator like jmeter or grinder (~ 37:45) both of which I've used in the past.

The test requirement as I understand it is to test a request rate of 100 requests /sec.

Not quite. The key thing is that this is not about a test, or test conditions. It's about describing the behavior of a system, and about how the description resulting from measuring with coordinated omission can fall very far away from the actual behavior as it happens. I describe a system exhibiting a specific, known, well defined behavior, whose proper description can be easily agreed on. I then ask a simple question: How will your measurement tools describe the behavior of this system?

The specific known behavior description is simple: The system is able to perfectly handle 100 requests per second answering each in 1msec, but it is artificially stalled for a period 100 seconds every 200 seconds (i.e. 100 seconds of perfect operation, followed by 100 seconds of a complete stall, repeat). The easy to agree upon description of the behavior of such a system by any external observer (open or closed would not matter) is this: The 50%'lie is 1 msec, the Max and 99.99%'ile will both be very close to ~100 seconds, and the 75%'lie would be ~50 seconds. In addition, it's easy to se that the mean should be around 25 seconds. Any measurement that describes this system in significantly more positive (or negative) terms would be flat out lying about what is going on.

I then demonstrate how coordinated omission, which amounts to preferrably throwing away bad results (as opposed to good results) that would/should have been measured or kept if a valid description of system behavior is to be made, results in dramatically swayed results. As in reporting both the 99.99%'lie and 75%'lie as 1msec (which is 100,000x and 50,000x off from reality), and reporting a mean of 10.9msec (which is ~2300x off from reality).

Absolutely agree with the analysis in terms of (1) identifying there is an issue with the data and (2) fixing it where possible. I come at it from a slightly different direction though.

(1)
the workload model : "1 user" - this frames it as a closed (as in feedback loop) workload model where as you said each subsequent request is blocked by the current one. The requirement is an arrival rate which is an open workload model where each request is independent.

There is no requirement stated, or needed, for a test system. There is a description of a known system behavior, and an expected sensible report on it's performance. The test system is supposed to be designed to supply that description, and this discussion is a demonstration of how it does not. A "correct" test system can certainly be built to report on this system's behavior. Any test system that fails to collect the same number of measurements per second during the stall period as it does outside of it will always fail to describe the system behavior. But pretty much any test setup which will be able to keep 10,000 requests in flight would work for this example. Why 10,000? Because that is the number of requests (in this example) that must be in flight during the 100 second stall, if the system is tested with 100 requests per second (closed or open does not matter) outside of the stall.

In the specific example, I demonstrated the problem with a single client tester. But it would separately happen for each client of any multi-client tester that does not issue requests asynchronously, without waiting for responses first. E.g. if I change the description to "able to handle 100 clients, with 100 requests per second from each client" the numbers would remain exactly the same. And if I changed it to "handles 10 clients totaling 100 requests per second", the numbers of a classic load generator would get 10x closer to reality. Which would still be meaningless when facing being 100,000x off. Basically, each client will be off by roughly the factor between it's expected interval between issuing requests and the system's stall time. If the stall time is larger than the expected interval, the results will be off. By a lot.

And no, stating that there should be enough clients to sustain an "open" system test does not help avoid this mis-representation, unless the world you live in is one where each client sends exactly one request, and never follows up. Very few real world application are like that. Certainly no web application looks like that. For example, a tcpdump of the https session re-loading this discussion page involves many tens of serial round trips, spread across more than 10 separate https socket connections, and spanning more than 4 seconds. There is no valid "open" load tester way of describing what the experienced response time of a user of this google-groups page would be.

From there it's not going to work as the load tool selected will be wrong - eg. most (web) load generators are vuser per thread implementation.

This paper which I stumbled across funnily enough from the jmeter plugins website seems to sum up in similar terms the catastrophic impact of getting the workload model wrong, for latency which you have covered and system utilisation/other effects.

http://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf

The paper does a good job of describing the difference between open and closed load testing systems, and demonstrates how the results for mean response times can vary widely between them. But it focuses on load testers and on mean measurements, neither rod which are really of interest to someone looking at an SLA report, at a bunch opt angry users, or at failed trades.

One of the first points I try to make in my "How NOT to measure" talks is that any latency or response time behavior reported with a mean and/or a standard deviation metric is wrong from the start. This is because of the simple reality we face, where latency and response time behavior is virtually always (as in "in all software based systems I've ever measured") a classic example of non-normal distribution. Latency behavior tends to exhibit multiple strong "modes" (where a normal distribution only has one), and those modes tend to dominate the important part of latency reporting, but are invisible in mean/std.dev. reports.

E.g. in the real world users will regularly encounter the 99.9%'lie of a web system about once per hour (on the average) if their web browser interacts with the system more than 1000 times in that hour. And 10% of users with more than 100 interactions will experience the 99th percentile (on the average). Those are very low interaction levels, BTW. E.g. given that loading this discussion page once involves over 100 individual interactions, users loading it will experience the 99'th percentile once per page load on average.

(2)
In terms of how to fix the situation described - ideally get an open workload model load generator (the paper identifies some).

As explained above, open load generators are not very useful for describing the latency experienced by users in most applications, as most application interactions are not open.

Given the difficulty of implementing a truely open generator, and the fact that you have identified that some (spec) are not in fact open at all, I suspect that some thorough testing of the tool would be needed before trusting the results and what bounds they are reliable within.

Yes. The tools should always be tested. It's sad to see how rarely they are. specially given how trivial doing such a test is.

All you need to do is cause a system behavior you know to be true (e.g. stall the system for 10 seconds with a ^Z), and see if your test system reports results that bear any resemblance to what you *know* was going on. If it does not, wasting your time looking at any of it's other results (until the system can be demonstrated to have been corrected reliably) is nothing more than a way to keep from doing productive work.

A lot of systems/components service open workloads so seems important to get this right.

I see very few things that exhibit open workload characteristics. Certainly nothing that arrives over TCP connections that live for more than one request does.

Related - on your hdrhistogram - how do you think that compares with Ted Dunning's T-Digest? I think they are both trying to achieve similar goals(?) but different implementations, I guess there will be different performance characteristics when measuring at high rates. accuracy wise would be interested in comparison https://github.com/tdunning/t-digest

I'm not an expert on the various forms of Q-digest, T-digest, and other approximation techniques. My impression is that Q-digest and T-digest both try to scientifically address a similar need to HdrHistogram's: the ability to track quantiles with good accuracy and acceptable storage. But their approach seems to come at it from a very different angle than mine, in that they are still trying to maintain accuracy benefits for semi-normal distributions, or for random arrival distributions, and in that recording values appears to be a pretty involved operation (which includes growing data strictures). In contrast, HdrHistogram starts with the assumption that required accuracy and range are dictated by the user (as opposed to "try to be as accurate as possible), and it is designed to (also) work well in the very real world where results can be "clumped" in both value range and arrival times (e.g. bad results tend to happen close to each other) - which is the situation that seems to describe the degenerate case (from a size or computation complexity perspective) for t-digets. HdrHistogram is also designed to be FAST, and fixed in both space and time.

Still, I'd be happy to see a performant implementation of t-diget compared head-to-head against HdrHistogram...

Thanks,
Alex

On Thu, Apr 17, 2014 at 4:17 PM, Gil Tene <g...@azulsystems.com> wrote:

Not yet (for academic papers). Some folks at UIUC are potentially working on one.

You can find some discussion in my Latency and Response time related presentations (e.g. http://www.infoq.com/presentations/latency-pitfalls), in the posting here and elsewhere.

On Thursday, April 17, 2014 6:35:08 AM UTC-7, Alex Averbuch wrote:
Are there any published papers about Coordinated Omission?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

alex bagehot

unread,

May 2, 2014, 12:49:36 PM5/2/14

to mechanica...@googlegroups.com

no problem at all - it took me a while to get the "partly open" information together.

I am interested in how best CO can be avoided/prevented/minimised.

On Mon, Apr 28, 2014 at 8:13 AM, Gil Tene <g...@azulsystems.com> wrote:

Sorry for the multi-day pause preceding my response (traveling and bigger down in work). See some commentary inline.

On Monday, April 21, 2014 3:14:16 PM UTC-7, alex bagehot wrote:

Hi Gil,

I watched the presentation - really good.

I have some thoughts though to check -

in the video at minute 35 (just to ref. back) co-ordinated omission can be caused by a load generator like jmeter or grinder (~ 37:45) both of which I've used in the past.

The test requirement as I understand it is to test a request rate of 100 requests /sec.

Not quite. The key thing is that this is not about a test, or test conditions.

It's about describing the behavior of a system,

and about how the description resulting from measuring with coordinated omission

can fall very far away from the actual behavior as it happens. I describe a system exhibiting a specific, known, well defined behavior, whose proper description can be easily agreed on. I then ask a simple question: How will your measurement tools describe the behavior of this system?

The specific known behavior description is simple: The system is able to perfectly handle 100 requests per second answering each in 1msec,

but it is artificially stalled for a period 100 seconds every 200 seconds (i.e. 100 seconds of perfect operation, followed by 100 seconds of a complete stall, repeat). The easy to agree upon description of the behavior of such a system by any external observer (open or closed would not matter) is this: The 50%'lie is 1 msec, the Max and 99.99%'ile will both be very close to ~100 seconds, and the 75%'lie would be ~50 seconds. In addition, it's easy to se that the mean should be around 25 seconds. Any measurement that describes this system in significantly more positive (or negative) terms would be flat out lying about what is going on.

I then demonstrate how coordinated omission, which amounts to preferrably throwing away bad results (as opposed to good results) that would/should have been measured or kept if a valid description of system behavior is to be made, results in dramatically swayed results. As in reporting both the 99.99%'lie and 75%'lie as 1msec (which is 100,000x and 50,000x off from reality), and reporting a mean of 10.9msec (which is ~2300x off from reality).

Absolutely agree with the analysis in terms of (1) identifying there is an issue with the data and (2) fixing it where possible. I come at it from a slightly different direction though.

(1)
the workload model : "1 user" - this frames it as a closed (as in feedback loop) workload model where as you said each subsequent request is blocked by the current one. The requirement is an arrival rate which is an open workload model where each request is independent.

There is no requirement stated, or needed, for a test system. There is a description of a known system behavior, and an expected sensible report on it's performance. The test system is supposed to be designed to supply that description, and this discussion is a demonstration of how it does not. A "correct" test system can certainly be built to report on this system's behavior. Any test system that fails to collect the same number of measurements per second during the stall period as it does outside of it will always fail to describe the system behavior. But pretty much any test setup which will be able to keep 10,000 requests in flight would work for this example. Why 10,000? Because that is the number of requests (in this example) that must be in flight during the 100 second stall, if the system is tested with 100 requests per second (closed or open does not matter) outside of the stall.

Agree in theory it doesn't matter which model open/closed.

In the specific example, I demonstrated the problem with a single client tester. But it would separately happen for each client of any multi-client tester that does not issue requests asynchronously, without waiting for responses first.

E.g. if I change the description to "able to handle 100 clients, with 100 requests per second from each client" the numbers would remain exactly the same. And if I changed it to "handles 10 clients totaling 100 requests per second", the numbers of a classic load generator would get 10x closer to reality. Which would still be meaningless when facing being 100,000x off. Basically, each client will be off by roughly the factor between it's expected interval between issuing requests and the system's stall time. If the stall time is larger than the expected interval, the results will be off. By a lot.

I think this demonstrates the risks of using a classic load generator in closed model mode. I don't advocate doing this but as you said above in a closed model you would need 10000 users to cover the example 100 second stall time, and a type of think time generally called pacing where, as long as the stall time is less than the pacing time, the request rate is held constant. Pacing introduces additional risks like uneven load etc.

And no, stating that there should be enough clients to sustain an "open" system test does not help avoid this mis-representation, unless the world you live in is one where each client sends exactly one request, and never follows up. Very few real world application are like that. Certainly no web application looks like that.

(1) is an open model real world?

There is a middle ground which I failed to mention - mixed or partly open.

Gunther 2005a p121:

"A common example where a mixed-class circuit might be used to represent a computer system [] in which the interactive (closed) flow of requests is mixed with HTTP Get (open) requests from the internet."

There may be synchronous round trips at lower layers in the system but that doesn't stop the workload being driven at the top level eg. a "user" in the web case as open. In other words there is some kind of session (http, tcp, etc). Within the session the requests are closed, but sessions are driven to the system openly. The arrival of a new session is not blocked by a system freeze. This closely models what happens in real life.

reading back through the thread:

Jason Koch said:

"However in a _real world_ situation, it's unlikely your consumers are going to recognise that your system is under load and stop sending requests for that full second. They're going to keep firing them, and probably a lot of times, during that 1 second. So in the real world, all of these consumers will ALSO see that pause."

> this looks like an open system as a user will not get blocked and keep firing async requests. If it's partly open then some workloads might retry requests after some timeout

For example, a tcpdump of the https session re-loading this discussion page involves many tens of serial round trips, spread across more than 10 separate https socket connections, and spanning more than 4 seconds. There is no valid "open" load tester way of describing what the experienced response time of a user of this google-groups page would be.

The user would arrive refresh page then leave. The http or tcp requests part of the user's session would be in a closed feedback loop. Agree looks like a closed model if only 1 user in the test. It would suffer from CO as the http / tcp requests would block.

Subsequent users, if any, would arrive asynchronously forming a partly open model.

To have a load test, at least for google groups page refresh, generally speaking more users would be needed. The model at the user level would be async or open but at the request level closed/ sync; resulting in a partly open model.

From there it's not going to work as the load tool selected will be wrong - eg. most (web) load generators are vuser per thread implementation.

This paper which I stumbled across funnily enough from the jmeter plugins website seems to sum up in similar terms the catastrophic impact of getting the workload model wrong, for latency which you have covered and system utilisation/other effects.

http://users.cms.caltech.edu/~adamw/papers/openvsclosed.pdf

The paper does a good job of describing the difference between open and closed load testing systems, and demonstrates how the results for mean response times can vary widely between them. But it focuses on load testers and on mean measurements, neither rod which are really of interest to someone looking at an SLA report, at a bunch opt angry users, or at failed trades.

Agree the use of mean, and possibly a lot of other stats detail in the paper, is not of direct use. My takeaway was how to better model workloads. Or do it in a more informed way. Also why getting the workload model wrong will invalidate the test and as a result the measurements.

If we do workload modelling better then we can implement the tools better and measure better.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nick Yantikov

unread,

May 4, 2014, 6:27:47 PM5/4/14

to mechanica...@googlegroups.com

Hello Gil

First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).

I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.

Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.

I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.

As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.

You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.

Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened. Which is impossible. In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.

The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.

Let's take your wonderful concierge business example.

If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.

The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.

In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.

To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.

Solving "Coordinated Omission" problem inside benchmarking/load testing tools is however very reasonable approach, but in my opinion compensating for CO should remain within the realm of testing tools and must not be spilled into production codebase.

Thanks
-Nick

Gil Tene

unread,

May 5, 2014, 1:04:04 AM5/5/14

to

I started talking about Coordinated Omission in test systems mostly, but I kept finding it everywhere, including in many production monitoring systems. Coordinated Omission is nothing more than a measurement methodology problem. It exists in both test and production environments, and occurs whenever data is omitted ahead of analysis in a way that is coordinated with the [usually bad] behavior of the system. Coordinated omission basically guarantees a huge garbage-in-garbage-out relationship between the data feeding into [downstream] analysis and most operational conclusions that people make based on that data.

It is true that CO is most easily demonstrated in test environments, because test environments commonly have a "plan" they are supposed to execute, and it is easy to demonstrate that the expected plan was not actually executed, and that the results collected are therefore bogus. Avoiding CO in test environments can sometimes be done if a valid asynchronous tester can be build. This is NOT possible for many applications, e.g. ones that include synchronous request-reponse semantics in their protocols, like all those that use HTTP, and most of those that use TCP. Correcting for CO is a separate thing: it can be done to varying degrees depending on what is known and what can be assumed about the system. In test environments with known expected scenario plans usually include enough information for pretty good correction to be possible.

But CO happens in production. All the time. And production monitoring tools that attempt to report on latency behavior often encounter it. It is certainly harder to correct for CO in production, since there is not "plan" that you can use for correction like you can in a test system. But it is easy to show that CO happens (or at least demonstrate an outrageously high likelihood that it happened). E.g. whenever a system freeze seems to magically coincide with a suspicious gap in recorded information, you should suspect CO. The existence of CO in production is sometimes easy to demonstrate even in summary latency reports whenever you notice that reported stats (e.g. Max, 99.9%, and average) are "obviously impossible". It usually takes CO + significant actual production time pauses to make the mess big enough that it is obviously wrong in the summaries, and most of the time production %'ile reporting will be be 2-4 orders of magnitude off from reality before the fallacy is easily demonstrated in summary reports. But I see this so often that I've started doing simple math tests on the common metrics people give me, and I often hit with one of them E.g.:

When your summary report includes have a Max, Average, and total test time, the following test is easy to use the following pseudo logic::

if (reportedAverageLatency < ((maxReportedLatency * maxReportedLatency) / (2 * totalTestLength))) {
     Suspect Coordinated Omission.
     // Note: Usually, when this test triggers due to CO. it will not by a small margin...
     // Logic:
     // If Max latency was related to a process freeze of some sort (commonly true
     // for Java apps, for example), and if taking measurements were not coordinated
     // to avoid measuring during the freeze, then even if all results outside the pause
     // were 0, the average during the pause would have been Max/2, and the overall
     // average would be at least as high as stated above.
}

This sanity test is valid for any system (test or production) as long as reported Max times are related to system freezes. In Java based systems, this is easy to glean from looking the the max GC pause time in the GC logs. If it correlates in magnitude to the Max reported latency, the test above will reliably tell you whether your measurements include CO.

In addition, In the test includes a Max latency, a total test time, and some percentile level, you can do the following pseudo logic:

// Sanity test reportedLatencyAtPercentileLevel for a given
// percentileLevel, maxReportedLatency, and totalTestLength. 

Double fractionOfTimeInPause = maxReportedLatency / totalTestLength;
Double fractionOfTimeAbovePercentileLevel = 1.0 -  percentileLevel; // Assumes percentileLevel is a fraction, between 0.0 and 1.0
Double smallestPossibleLatencyAtPercentileLevel = maxReportedLatency * ( 1.0 - (fractionOfTimeAbovePercentileLevel / fractionOfTimeInPause));
if (smallestPossibleLatencyAtPercentileLevel > reportedLatencyAtPercentileLevel) {
    Suspect Coordinated Omission.
}

A common cause for production-time CO has to do with the fact that production time latencies are often reported not based on client perception, but based on internally collected information in what we would (outside of production) call the "system under test". E.g. many servers will collect internal operation latency stats and float them up to consoles. E.g. Cassandra reports the internally observed 99%'lies of read and writes operations, and those numbers are floated up and shown across clusters in monitoring dashboards. The times are collected by measuring time before and after each operation and recording it into a stats bucket, and that measurement technique has classic CO: It will basically ignore of dramatically diminish the reporting of any "freeze" in the process. I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happened. But where such arguments miss the mark is when you look at what people reading the dashboards *think* the %'lies are telling them. I've tested this by asking ops people (without raising suspicion of a "trap") what they think dashboard reports mean, and how they can use them. In my samples, all ops people, developers, and testers I've asked assume that when a console says "99%'lie", is effectively saying something along the lines of "99% of reads took 5msec or less", and that they can use that as a way to either describe the behavior the system had exhibited, or use it to project how it may behave. E.g. they may use this information in capacity planning, or in pre-deployment verification of SLA readiness, or look at current production numbers to judge the operational readiness reports ahead of expected load increases [think "planning for holiday season shopping"]. I all these cases, the numbers they are looking at often drive very bad decisions. Unfortunately, in systems that display this sort of %'lie data (like Cassandra's read and write latency reports for the 99%'lie), the actual meaning of the number displayed for 99% is: "1% of reads took at least this much time, but it could be 1,000x as much as what is shown here. We only measured a lower bound on how bad things were in the remaining %'lie (1-X), not an upper bound on how bad things were in the % we show."

For anyone who doubts this, here is a simple exercise: go watch the %'lie outputs in Cassandra monitoring output. Manually ^Z the Cassandra servers for 6 seconds every minute, and see what your console reports about the cluster's 99%'ile... Now think about what the ops and capacity planning people responsible for service level management of these clusters are thinking when they look at that output.

[Oh, and for those who doubt that this ^Z is a valid test for Cassandra: Read your GC logs. ;-) ]

More comments inline below.

On Sunday, May 4, 2014 3:27:47 PM UTC-7, Nick Yantikov wrote:

Hello Gil

First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).

I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.

Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.

I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.

As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.

Actually, sampling would be fine, and omission would be fine, as long as no coordination happens.

But unfortunately, production monitoring of event latency is just as susceptible to coordination as test measurement is.

Even if you collected all actual production client-observed response times, you would remain exposed. E.g. if you data indicated that during each period of 1000 seconds, each production client reported 1 response taking 500 seconds, and 9999 responses taking 20 msec each, and you did not recognize that there is coordinated omission in the incoming data set, you will erroneously report that the 99.99%'lie observed by clients is 20msec (which would be wrong by a factor of 2,500x). Natural noise and variability in these numbers would not make the conclusion any less wrong.

Attached is a chart of an actual production system doing exactly this sort of reporting (at a different time scale, but with a similar ~2,500x discrepancy), plotted against the reporting of two different level of CO-correction of the same data. All three sets of data were collected at the same time, on the same system, in production. The corrected data sets include a correction technique that is now part of LatencyStats class, meant to drop straight into this sort of measurement idiom (see http://latencyutils.org)

You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.

Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened.

Which is impossible.

For in-process latency measurement in production, that's what I thought, too. But when I tell myself something is "impossible", I often find it hard to resist the urge to prove myself wrong. So I did, that's what the LatencyUtils library is about.

It turns out that while correcting for all causes of CO is "very hard", correcting for process freezes during in-process latency measurement is not that hard. And in much of the world I see, CO reporting errors are dominated by such process freezes. The technique I use under the hood in a LatencyStats object leverages a process-wide pause detector with a per-recorded-stats-object interval estimator. The pause detector is pluggable, with a provided default detector is based on consensus pause observation across a number of detection threads. It uses settable thread counts and thresholds. It is very reliable at detecting pauses of 1msec or more using three sleeping threads (which is nearly free to do), and seems to detect process-wide pauses as small as 50usec reliably if you are willing to burn 3-4 cpus spinning non-stop to do so (which most people will likely not be willing to do). The interval estimator is also pluggable, but the one provided is basically a time-capped moving average estimator, which does a pretty good job.

The answer to the "in order to compensate you have to predict rate of events that might or might not have happened" point you make above (which is a good way of stating the problem) is that when a known process freeze occurs, projecting from the reality that preceded the freeze into the freeze period is a very good guess. Certainly much better than ignoring the fact that the freeze erased a whole bunch of known-to-be-terrible results, destroying the validity of the data set. Since we just need counts at various latencies, all we need to know is how many missing measurements there were, and what their magnitude would have been, and this is trivial to reconstruct given the freeze size and the expected interval estimation. Using a constant rate projection into the freeze works fine, with no need to model rate variability *within* the freeze, since it would not make much a difference to the impact of correction on the percentiles picture.

With all this, LatencyUtils only provides partial correction, but that partial correction is often of 99%+ of the error. The bigger and/or more frequent the actual process freezes are in your actual system, the more important the correction becomes, especially when freezes are a significant multiple of the typical measured latency (in which case the correction may often become the only valid signal above ~99%).

In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.

The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.

Let's take your wonderful concierge business example.

If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.

In the specific scenario, I don't see a need for groups-of-10-calls testing. While it will probably work, a very simple test system, as well as a very simple production-time-sanity-checking system can easily measure the right thing (the right thing is "what would your customer's perception of the percentile be?"): just poll the business at random intervals some number of times per day (which should also vary somewhat), and report of the stats of what you see. Just make sure not to coordinate between polling attempts, and your percentile numbers will match what your customer would consider to be a valid representation, and therefore a useful way to gauge the quality of your system.

A scenario like the one I describe (the concierge service) is actually one where you would want you rproduction-time validation to be happening much more frequently than the customer's actual use, so that you will be likely to detect (and potentially correct) issues before your customer does. This sort of "failure is very expensive" driven continuous production validation is actually common in the real world as well. E.g. of algo trading system that only trade a few times a day, it is common practice to do sanity test trades (small or irrelevant) at a significanlty higher rate than the actual trades that matter happen, just to maintain an evaluation of what expected trade behavior may be like (some systems can even loop this knowledge back into the actual trading decisions).

The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.

I agree that percentiles should usually be reported in time intervals, as that's how they are most often stated. When requirements are set, they are usually stated about a worst case time interval. E.g. "The 99.9%'lie should be 200usec or better in any given 15 minute period during the day" as opposed to "in at least one 5 minute interval during the day" or "across the entire day, even if it is much worse during some 15 minute period."

In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.

To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.

Look at LatencyUtils.org, think of how using LatencyStats would apply to something like Cassandra reporting it's production-time latency stats, and let me know if that changes your mind...

Nick Yantikov

unread,

May 5, 2014, 2:11:52 PM5/5/14

to mechanica...@googlegroups.com

Hello Gil.

On Sunday, May 4, 2014 6:26:24 PM UTC-7, Gil Tene wrote:

I started talking about Coordinated Omission in test systems mostly, but I kept finding it everywhere, including in many production monitoring systems.

[...skip...]

I skip (but do not dismiss) CO sanity check logic for now because to be honest I will need significantly more time to parse through it...

For anyone who doubts this, here is a simple exercise: go watch the %'lie outputs in Cassandra monitoring output. Manually ^Z the Cassandra servers for 6 seconds every minute, and see what your console reports about the cluster's 99%'ile... Now think about what the ops and capacity planning people responsible for service level management of these clusters are thinking when they look at that output.

I do not have Cassandra cluster in front of me to complete this exercise so I will have to trust your inputs on this.
- What kind of cluster size are we talking about?
- Is it 6 seconds pause in one of the nodes of 100 nodes cluster?
- Does dashboard only display per cluster stats or is there a way to see stats per node?
- What is the stats aggregation technique to arrive at per cluster level?
- If you are concerned with 6 seconds pause are Cassandra stats configured to report on at least 6 seconds intervals?
- Does dashboard register a drop in throughput during pause?

[Oh, and for those who doubt that this ^Z is a valid test for Cassandra: Read your GC logs. ;-) ]

More comments inline below.

On Sunday, May 4, 2014 3:27:47 PM UTC-7, Nick Yantikov wrote:

Hello Gil

First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).

I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.

Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.

I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.

As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.

Actually, sampling would be fine, and omission would be fine, as long as no coordination happens.

I was talking about different sampling. I meant sampling of response times as opposed to sampling of percentiles.

But unfortunately, production monitoring of event latency is just as susceptible to coordination as test measurement is.

Even if you collected all actual production client-observed response times, you would remain exposed. E.g. if you data indicated that during each period of 1000 seconds, each production client reported 1 response taking 500 seconds, and 9999 responses taking 20 msec each, and you did not recognize that there is coordinated omission in the incoming data set, you will erroneously report that the 99.99%'lie observed by clients is 20msec (which would be wrong by a factor of 2,500x). Natural noise and variability in these numbers would not make the conclusion any less wrong.

I don't think you are telling the whole story here. I need more data. Let's dig in. Just to continue with Cassandra theme lets assume our clients are tomcat threads in the same webapp instance and there is some code that is measuring Cassandra read response times. I emphasized "same webapp" as that's what I believer will make clients "coordinated". Am I on the right track so far? I have my metrics reported to graphite as "cassandra.read.9999thpercentile" and "cassandra.read.count". Now one of the issues that I see in your example is that percentiles are reported with 1000 seconds intervals. This is a red flag for me as it seems way too infrequent. The first question I will ask "Does that reporting frequency align with SLA we are monitoring?". This is very important question which is the basis of my argument. This is the same as asking "Are we OK to miss 500 second full GC"?

Let's continue. So we set it up the way you described it. I plot my 99.99th percentile and it shows 20msec. But am I exposed? What happens to my
"cassandra.read.count" metric which I use to derive my rate from? Do I see that my rate has dropped 50% compared to previous 1000 second reporting interval? Let's say I don't see a drop because every time without a fail every 1000 seconds there is a full GC and my rate is constant at 10000 per 1000 seconds.

Now help me please understand how CO correction works here. Freeze detection mechanism has figured out that there was a full webapp pause. What rate would you use to assume expected rate? Will you calculate the rate based on 1000 seconds moving average? Or will you use say 100 seconds moving average? If you do use 100 seconds moving average then what you essentially do is aligning your monitoring with SLA which I am arguing you should in a first place.

This is not it yet. Let's say we corrected for CO by recording "fake" or "virtual" cassandra read response times. What is going to happen to my cassandra.read.9999thpercentile and cassandra.read.count metrics? They will show that 99.99th percentile is 500seconds and my rate is 20000 reads per 1000 seconds. Am I still on the right track here? Will you use these stats for Cassandra capacity planning? What should person looking at this stat conclude? That Cassandra was slow? But Cassandra was just fine. It was idle all that time while webapp was GC'ing and Cassandra does not need any additional capacity. Are we going to send ops chasing Cassandra's slow response time which did not exist?

Another piece that is missing in this story which I was trying to convey is that if I had invested in adequate monitoring I would have another uncoordinated "clients" that would detect that webapp has not responded for 500 seconds.
Moreover my access log based stats will show spike in http request response times pointing to true source of the issue - webapp.

Attached is a chart of an actual production system doing exactly this sort of reporting (at a different time scale, but with a similar ~2,500x discrepancy), plotted against the reporting of two different level of CO-correction of the same data. All three sets of data were collected at the same time, on the same system, in production. The corrected data sets include a correction technique that is now part of LatencyStats class, meant to drop straight into this sort of measurement idiom (see http://latencyutils.org)

You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.

Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened.

Which is impossible.

For in-process latency measurement in production, that's what I thought, too. But when I tell myself something is "impossible", I often find it hard to resist the urge to prove myself wrong. So I did, that's what the LatencyUtils library is about.

It turns out that while correcting for all causes of CO is "very hard", correcting for process freezes during in-process latency measurement is not that hard. And in much of the world I see, CO reporting errors are dominated by such process freezes. The technique I use under the hood in a LatencyStats object leverages a process-wide pause detector with a per-recorded-stats-object interval estimator. The pause detector is pluggable, with a provided default detector is based on consensus pause observation across a number of detection threads. It uses settable thread counts and thresholds. It is very reliable at detecting pauses of 1msec or more using three sleeping threads (which is nearly free to do), and seems to detect process-wide pauses as small as 50usec reliably if you are willing to burn 3-4 cpus spinning non-stop to do so (which most people will likely not be willing to do). The interval estimator is also pluggable, but the one provided is basically a time-capped moving average estimator, which does a pretty good job.

The answer to the "in order to compensate you have to predict rate of events that might or might not have happened" point you make above (which is a good way of stating the problem) is that when a known process freeze occurs, projecting from the reality that preceded the freeze into the freeze period is a very good guess. Certainly much better than ignoring the fact that the freeze erased a whole bunch of known-to-be-terrible results, destroying the validity of the data set. Since we just need counts at various latencies, all we need to know is how many missing measurements there were, and what their magnitude would have been, and this is trivial to reconstruct given the freeze size and the expected interval estimation. Using a constant rate projection into the freeze works fine, with no need to model rate variability *within* the freeze, since it would not make much a difference to the impact of correction on the percentiles picture.

With all this, LatencyUtils only provides partial correction, but that partial correction is often of 99%+ of the error. The bigger and/or more frequent the actual process freezes are in your actual system, the more important the correction becomes, especially when freezes are a significant multiple of the typical measured latency (in which case the correction may often become the only valid signal above ~99%).

The use cases where you have no way of reliably detect and correct for CO worry me a great deal for the simple reason that people become complacent thinking they have covered everything which they have not. You cannot detect reliably if Cassandra's threads did pause because they entered into prolonged contention due to locking bug in code, or if raid controller was acting up slowing down all threads, etc, etc. If instead of chasing every single use case where pause can occur in production environment you invest in more adequate monitoring you would be covered better than correcting for CO.

In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.

The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.

Let's take your wonderful concierge business example.

If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.

In the specific scenario, I don't see a need for groups-of-10-calls testing. While it will probably work, a very simple test system, as well as a very simple production-time-sanity-checking system can easily measure the right thing (the right thing is "what would your customer's perception of the percentile be?"): just poll the business at random intervals some number of times per day (which should also vary somewhat), and report of the stats of what you see. Just make sure not to coordinate between polling attempts, and your percentile numbers will match what your customer would consider to be a valid representation, and therefore a useful way to gauge the quality of your system.

Yes #1.

A scenario like the one I describe (the concierge service) is actually one where you would want you rproduction-time validation to be happening much more frequently than the customer's actual use, so that you will be likely to detect (and potentially correct) issues before your customer does. This sort of "failure is very expensive" driven continuous production validation is actually common in the real world as well. E.g. of algo trading system that only trade a few times a day, it is common practice to do sanity test trades (small or irrelevant) at a significanlty higher rate than the actual trades that matter happen, just to maintain an evaluation of what expected trade behavior may be like (some systems can even loop this knowledge back into the actual trading decisions).

Yes #2.

The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.

I agree that percentiles should usually be reported in time intervals, as that's how they are most often stated. When requirements are set, they are usually stated about a worst case time interval. E.g. "The 99.9%'lie should be 200usec or better in any given 15 minute period during the day" as opposed to "in at least one 5 minute interval during the day" or "across the entire day, even if it is much worse during some 15 minute period."

Yes #3.

Thank you. These were my arguments all along. Understand your SLA and make sure you have adequate monitoring techniques aligned with SLA and you will be covered regardless whether you detected/corrected for CO or not.

In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.

To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.

Look at LatencyUtils.org, think of how using LatencyStats would apply to something like Cassandra reporting it's production-time latency stats, and let me know if that changes your mind...

I will certainly take a look at it.

Thanks
-Nick

Gyula Csom

unread,

Jan 13, 2015, 3:10:09 PM1/13/15

to mechanica...@googlegroups.com

Hi Jason,

Although it's a bit old thread, let me add a few thoughts/questions. From the previous explanation(s)...

To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.

Hence, besides the testing method, this might be related to the (probably unknown) reasons causing the unexpected latencies (high over the average). Iff the effect is global then CO fires and breaks the unaltered stats. If the effect is local (delays only one or some thread) then probably CO doesn't, the unaltered stat might work.

Although the event 'as is' might not be observable, its effect (whether it is global or local) might be detected, at least in a statistical way. We just have to check whether the high latency points of different threads are correlated to each other or not. If they tend to happen at the same time then one would suspect global effects, if not then probably they represent local events.

After all I guess that before altering the original stats, it would be reasonable to check whether the delays were correlated or not. If this makes sense:-) then I would like to ask, whether there's any (practical or lab) experience regarding this? That is: Is it typical that events causing delays are global ones? Are there any local events at all?

Regards,
Gyula

2013. augusztus 7., szerda 23:37:27 UTC+2 időpontban Jason Koch a következőt írta:

Hi Attila

I'll attempt to explain. Gil has a presentation on it and I'm sure there's a few analogies in these email threads. CO is something to do with your testing methodology, and I'll give an "obvious" example that shows why your tests might not show the real world problems if you're suffering CO.

The challenge is that, in the real world, your consumers are not coordinated with your system. The consumers are independent. They will send a request whether you have responded to someone else's request or not. So ..

Suppose you are using a few N consumer threads in a tester, and each is in a tight loop for 1 million "events". Eg:
for(int i = 0; i < 1000000; i++) { long start = nanotime(); log.info("xxxxxxx"); long duration = nanotime() - start; addToHisto(duration); }

Here's where CO comes in:
Suppose that during the middle of the run, there was some thing that causes the log.info() time to take longer. It could be a hashmap rebuild, a file system buffer full flush, a GC run; it doesn't really matter what it is, just that for some reason processing is stalled. In your test harness, the "consumer" threads wait for the system to recover before sending more requests (they are coordinated). So, only "N" requests would be showing the full pause time (eg 1 second), all other requests would show un-paused time (eg 1 ms).

However in a _real world_ situation, it's unlikely your consumers are going to recognise that your system is under load and stop sending requests for that full second. They're going to keep firing them, and probably a lot of times, during that 1 second. So in the real world, all of these consumers will ALSO see that pause. The issue of CO is how many requests are _truly_ impacted by that 1 second pause versus the N that you've shown in your test report (due to omitted results).

Hope this helps (?).

Thanks
Jason

On Thu, Aug 8, 2013 at 5:43 AM, Attila-Mihaly Balazs <dify...@gmail.com> wrote:

Hello all,

I had a chance to rerun the test with Gil's modifications and also some changes though up by me: https://gist.github.com/cdman/6177268

The gist of it (sorry for the pun) is that I don't seem to get differences with this test.

And sorry for harp on this, but I still don't get why CO is something to worry about, so I would appreciate a "for dummies" explanation:

- I get that no percentile is a guarantee for the maximum value (as it demonstrated by Gil's customer service metaphor) and you should always agree on both with the business sponsors when creating a system

- On the other hand I find it to be just a flaw in a theoretical model where we insist that we want to "poke" the system every X ms (or maybe us) and failing to do so deviates from our test plan. However what would such a test model? Are we trying to write a hard-realtime system? :-) (or maybe a trading system as Peter has described where we have X ms to process each network packet). In this case we should be worried about maximums. Other than that systems are very undeterministic and we're not showing anything by trying to create a very rigid benchmark runner. The only usecase I can think of if for some reason (marketing?) we are trying to show that the system can handle "a sustained load of X events / second" - btw if we make such a claim, what is the difference between showing down X events at once and waiting until the next second vs trickling them? The math works out the same.

- Finally, an observation: I'm no math person (just ask any of my university math teachers :-)), but AFAIK StdDev / Mean are only useful if we suppose that the underlying distribution is a normal one. However latency distributions (again, AFAIK) usually are not normal and as such mean / stddev are not useful. As such I am surprised that HdrHistogram prints them out.

Attila

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Michael Barker

unread,

Jan 13, 2015, 4:49:46 PM1/13/15

to mechanica...@googlegroups.com

To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.

I don't think this is correct. CO can be the result of other aspects of the system, e.g. a single thread (perhaps the network one) could cause TCP back pressure causing the load generator, delaying the sending of requests. Thus creating coordinated omission.

Gyula Csom

unread,

Jan 13, 2015, 5:37:00 PM1/13/15

to mechanica...@googlegroups.com

2015. január 13., kedd 22:49:46 UTC+1 időpontban mikeb01 a következőt írta:

To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.

I don't think this is correct. CO can be the result of other aspects of the system, e.g. a single thread (perhaps the network one) could cause TCP back pressure causing the load generator, delaying the sending of requests. Thus creating coordinated omission.

I'm not focusing on the "why" (could be a network issue as you put or something), I'm focusing on the effect: whether it is a global event effecting the whole system (this case the whole benchmark), or a local one, effecting only one, or a few tasks (this case a particular thread).

Also note that, my argument is only relevant if multiple test threads are running in parallel.

For instance your sample seems to be a global event/effect: since it hits the load generator as a whole, all (parallel) threads will be delayed and hence CO fires. Meanwhile... a local effect could be for instance a cache miss for a particular thread, then it should not effect anything but that particular thread. Hence to my understanding... CO does not fire this case. But I could be wrong again:-)

Gil Tene

unread,

Jan 13, 2015, 5:42:59 PM1/13/15

to mechanica...@googlegroups.com

On Tuesday, January 13, 2015 at 12:10:09 PM UTC-8, Gyula Csom wrote:

Hi Jason,

Although it's a bit old thread, let me add a few thoughts/questions. From the previous explanation(s)...

To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.

Any effect that spans more than a single operation applies. The only situation where omission that results from longer than expected operation will *not* lead to coordinated omission is when the longer than expected operation has no correlation whatsoever to other operations taking longer than normal.

Thus, the effect doesn't have to be global, doesn't have to be stop-the-world, and does not have to be cross-thread. There are many examples of non-global and non-stop-the-world situations that would cause a higher-than-normal likelihood of a nearby (in time) operations taking longer than is typical. E.g. it is enough to have some thrashing situation cause a 10x slowdown in the operation of a single thread in a 10 thread work pool for CO to be significant.

Hence, besides the testing method, this might be related to the (probably unknown) reasons causing the unexpected latencies (high over the average). Iff the effect is global then CO fires and breaks the unaltered stats. If the effect is local (delays only one or some thread) then probably CO doesn't, the unaltered stat might work.

When it comes to test systems, "might work" is a funny term. If your test system didn't do the test that you wanted it to do, the result "might" still mean something. And it might not. I you bet on the not, you'll probably win most of the time.

A simple way to look at things is this: Had all responses been instantaneous, the test system would have done measurements at certain points in time. You can think of this as the "test plan". Whenever the test system fails to precisely execute the test plan as a result of longer-than-normal operation time (e.g. had to wait for the long response before sending the next request at the "on plan" time) CO is extremely likely. The only time CO will not occur in such situations are ones where there is no higher likelihood of slow operations being measured near other slow operations than near any other operation.

Although the event 'as is' might not be observable, its effect (whether it is global or local) might be detected, at least in a statistical way. We just have to check whether the high latency points of different threads are correlated to each other or not. If they tend to happen at the same time then one would suspect global effects, if not then probably they represent local events.

"Tend" or not does't matter. Even if the slowdowns "tend" to be unrelated (e.g. only 1 in 1000 slow operations happen right after another slow one), if their likelihood is any higher than normal (e.g. if slow operation happen only 1 in 10,000) the increase likelihood of slow operations being bunched closer in time is enough for the results to be off. The only "safe" situation in one where there is no additional likelihood of slow things happening near each other in time. And establishing that with any confidence is very hard.

I differentiate between testing and monitoring:

In testing, the simple way to avoid CO is to measure all latencies from the time the test system was supposed to start an operation until the operation completed (as opposed to measuring from the time the test system started the operation until it completed). This can easily done for constant throughput testers, and fairly easily done for test system with much more complex "plans" (e.g. random/stochastic start times still describe a plan). Test systems for which an expected start time for each operation could not be computed are the ones that are hard to correct, and arguably such systems are broken to begin with (the fix is to have a plan).

In monitoring, the problem is that the "plan" is unknown, and that is where detection and correction can help alleviate the effects of CO. LatencyUtils shows an example of doing that for in-process latency monitoring code: E.g. When using LatencyStats objects to record latencies, a process-global pause detector notifies the individual LatencyStats objects that a verifiable global stall event has occurred that must be compensated for. You can similarly correct for other detectable "pauses" (e.g. like a DB stall). But detecting and correcting for a "slowdown" is much harder to do.

After all I guess that before altering the original stats, it would be reasonable to check whether the delays were correlated or not. If this makes sense:-) then I would like to ask, whether there's any (practical or lab) experience regarding this? That is: Is it typical that events causing delays are global ones? Are there any local events at all?

There are certainly cases where individual operations may be slow with no correlation to any other operation (e.g. maybe that was just a very complex query), but my personal experience is that in the real world abnormally long operations tend to be highly correlated in time with other abnormally long operations. Whether it is CPU scheduling artifacts (affecting one thread), lock contentions storms, disk i/o storms, or network storms affecting many threads, or larger scale things like GC pauses, log rolling or flushing, and in-cluster failover events, bad things often affect many operations in a time-correlated way.

Gyula Csom

unread,

Jan 13, 2015, 6:26:53 PM1/13/15

to mechanica...@googlegroups.com

Thanks for your response!

I think I got the meaning... at least some piece of it...

To my understanding, you mean that one must test (in a statistical sense) whether (nearby) low latency events (as random variables) are corellated with each other or not. (For instance, to put it simply for two threads the 'corellation' would mean that P(AB) > P(A)*P(B), where P(A), P(B) is the possibility of low latency events for the A and B thread, and P(AB) is the possibility for two 'nearby' low latency event. If I understood it well, then my intuition says that CO-related correction should be somehow dependent on the factor of correllation... That is CO might not be a binary, 0 or 1 thing.)

Meanwhile as I understand your thoughts... the correllation does not neccessarily mean 'common-cause', that is the higher likelyhood between nearby high latency events could be caused by non-global events/effects.

This is an interesting subject:-)

Benedict Elliott Smith

unread,

Jan 13, 2015, 7:13:04 PM1/13/15

to mechanica...@googlegroups.com

I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happened

As one of the people who've had a spirited hour+ long debate with Gil on this topic, I would like to defend the position briefly: coordinated omission detection is far from perfect (it will by no means detect all sources), and requires various assumptions such as a steady rate of message arrival, a symmetrical distribution of message size/workload, pauses being system-wide... These things are not always true, but even if they were, the same assumptions could be applied as a post process of the raw data from a stream of imperfect measurements (it would be extraordinarily rare to have lengthy pauses be omitted completely from such a steady state system, so the event will ordinarily be present in a non-lossy histogram of latencies), by detecting temporally adjacent outliers and padding their number. In fact LatencyUtils essentially does this, only with explicitly detected and assumed-to-be-global pause overlays instead of applying those assumptions to the raw data (only the manipulation of the data).

The nice thing about this is that we (Cassandra) don't have to explain what our concept of truth is: it may not be perfect, but it is well defined, and can be modified for interpretation however any operator wants. In reality, simply looking at max latency over a given time horizon is usually more than sufficient to help operators spot problems. That's not to say there isn't more to be done to help here (there's lots). But ultimately, if you really care about this being accurate, your applications need to monitor each discrete unit of work themselves. Expecting your datastore with its limited view of the whole system to monitor said whole system is never going to be sufficient.

On Mon, May 5, 2014 at 2:26 AM, Gil Tene <g...@azulsystems.com> wrote:

I started talking about Coordinated Omission in test systems mostly, but I kept finding it everywhere, including in many production monitoring systems. Coordinated Omission is nothing more than a measurement methodology problem. It exists in both test and production environments, and occurs whenever data is omitted ahead of analysis in a way that is coordinated with the [usually bad] behavior of the system. Coordinated omission basically guarantees a huge garbage-in-garbage-out relationship between the data feeding into [downstream] analysis and most operational conclusions that people make based on that data.

It is true that CO is most easily demonstrated in test environments, because test environments commonly have a "plan" they are supposed to execute, and it is easy to demonstrate that the expected plan was not actually executed, and that the results collected are therefore bogus. Avoiding CO in test environments can sometimes be done if a valid asynchronous tester can be build. This is NOT possible for many applications, e.g. ones that include synchronous request-reponse semantics in their protocols, like all those that use HTTP, and most of those that use TCP. Correcting for CO is a separate thing: it can be done to varying degrees depending on what is known and what can be assumed about the system. In test environments with known expected scenario plans usually include enough information for pretty good correction to be possible.

But CO happens in production. All the time. And production monitoring tools that attempt to report on latency behavior often encounter it. It is certainly harder to correct for CO in production, since there is not "plan" that you can use for correction like you can in a test system. But it is easy to show that CO happens (or at least demonstrate an outrageously high likelihood that it happened). E.g. whenever a system freeze seems to magically coincide with a suspicious gap in recorded information, you should suspect CO. The existence of CO in production is sometimes easy to demonstrate even in summary latency reports whenever you notice that reported stats (e.g. Max, 99.9%, and average) are "obviously impossible". It usually takes CO + significant actual production time pauses to make the mess big enough that it is obviously wrong in the summaries, and most of the time production %'ile reporting will be be 2-4 orders of magnitude off from reality before the fallacy is easily demonstrated in summary reports. But I see this so often that I've started doing simple math tests on the common metrics people give me, and I often hit with one of them E.g.:

When your summary report includes have a Max, Average, and total test time, the following test is easy to use the following pseudo logic::

if (reportedAverageLatency < ((maxReportedLatency * maxReportedLatency) / (2 * totalTestLength))) { Suspect Coordinated Omission. // Note: Usually, when this test triggers due to CO. it will not by a small margin... // Logic: // If Max latency was related to a process freeze of some sort (commonly true // for Java apps, for example), and if taking measurements were not coordinated // to avoid measuring during the freeze, then even if all results outside the pause // were 0, the average during the pause would have been Max/2, and the overall // average would be at least as high as stated above. }

This sanity test is valid for any system (test or production) as long as reported Max times are related to system freezes. In Java based systems, this is easy to glean from looking the the max GC pause time in the GC logs. If it correlates in magnitude to the Max reported latency, the test above will reliably tell you whether your measurements include CO.

In addition, In the test includes a Max latency, a total test time, and some percentile level, you can do the following pseudo logic:

// Sanity test reportedLatencyAtPercentileLevel for a given // percentileLevel, maxReportedLatency, and totalTestLength. Double fractionOfTimeInPause = maxReportedLatency / totalTestLength; Double fractionOfTimeAbovePercentileLevel = 1.0 - percentileLevel; // Assumes percentileLevel is a fraction, between 0.0 and 1.0 Double smallestPossibleLatencyAtPercentileLevel = maxReportedLatency * ( 1.0 - (fractionOfTimeAbovePercentileLevel / fractionOfTimeInPause)); if (smallestPossibleLatencyAtPercentileLevel > reportedLatencyAtPercentileLevel) { Suspect Coordinated Omission. }

A common cause for production-time CO has to do with the fact that production time latencies are often reported not based on client perception, but based on internally collected information in what we would (outside of production) call the "system under test". E.g. many servers will collect internal operation latency stats and float them up to consoles. E.g. Cassandra reports the internally observed 99%'lies of read and writes operations, and those numbers are floated up and shown across clusters in monitoring dashboards. The times are collected by measuring time before and after each operation and recording it into a stats bucket, and that measurement technique has classic CO: It will basically ignore of dramatically diminish the reporting of any "freeze" in the process. I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happened. But where such arguments miss the mark is when you look at what people reading the dashboards *think* the %'lies are telling them. I've tested this by asking ops people (without raising suspicion of a "trap") what they think dashboard reports mean, and how they can use them. In my samples, all ops people, developers, and testers I've asked assume that when a console says "99%'lie", is effectively saying something along the lines of "99% of reads took 5msec or less", and that they can use that as a way to either describe the behavior the system had exhibited, or use it to project how it may behave. E.g. they may use this information in capacity planning, or in pre-deployment verification of SLA readiness, or look at current production numbers to judge the operational readiness reports ahead of expected load increases [think "planning for holiday season shopping"]. I all these cases, the numbers they are looking at often drive very bad decisions. Unfortunately, in systems that display this sort of %'lie data (like Cassandra's read and write latency reports for the 99%'lie), the actual meaning of the number displayed for 99% is: "1% of reads took at least this much time, but it could be 1,000x as much as what is shown here. We only measured a lower bound on how bad things were in the remaining %'lie (1-X), not an upper bound on how bad things were in the % we show."

For anyone who doubts this, here is a simple exercise: go watch the %'lie outputs in Cassandra monitoring output. Manually ^Z the Cassandra servers for 6 seconds every minute, and see what your console reports about the cluster's 99%'ile... Now think about what the ops and capacity planning people responsible for service level management of these clusters are thinking when they look at that output.

[Oh, and for those who doubt that this ^Z is a valid test for Cassandra: Read your GC logs. ;-) ]

More comments inline below.

On Sunday, May 4, 2014 3:27:47 PM UTC-7, Nick Yantikov wrote:

Hello Gil

First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).

I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.

Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.

I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.

As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.

Actually, sampling would be fine, and omission would be fine, as long as no coordination happens.

But unfortunately, production monitoring of event latency is just as susceptible to coordination as test measurement is.

Even if you collected all actual production client-observed response times, you would remain exposed. E.g. if you data indicated that during each period of 1000 seconds, each production client reported 1 response taking 500 seconds, and 9999 responses taking 20 msec each, and you did not recognize that there is coordinated omission in the incoming data set, you will erroneously report that the 99.99%'lie observed by clients is 20msec (which would be wrong by a factor of 2,500x). Natural noise and variability in these numbers would not make the conclusion any less wrong.

Attached is a chart of an actual production system doing exactly this sort of reporting (at a different time scale, but with a similar ~2,500x discrepancy), plotted against the reporting of two different level of CO-correction of the same data. All three sets of data were collected at the same time, on the same system, in production. The corrected data sets include a correction technique that is now part of LatencyStats class, meant to drop straight into this sort of measurement idiom (see http://latencyutils.org)

You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.

Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened.

Which is impossible.

For in-process latency measurement in production, that's what I thought, too. But when I tell myself something is "impossible", I often find it hard to resist to urge to prove myself wrong. So I did, tha'st what the LatencyUtils library is about.

It turns out that while correcting for all causes of CO is "very hard", correcting for process freezes in in-process latency measurement is not that hard. And in much of the world I see, CO reporting errors are dominated by such process freezes. The technique I use under the hood in a LatencyStats object leverages a process-wide pause detector with a per-recorded stats interval estimator. The pause detector is pluggable, but the default detector is based on consensus pause observation across a number of detection threads. It uses settable thread counts and thresholds. It is very reliable at detecting pauses of 1msec or more with sleeping threads (which is nearly free to do), and seems to detect process-wide pauses as small as 50usec reliably if you are willing to burn 3-4 cpus to do so (which will likely not happen much). The interval estimator is also pluggable, but the one provided is basically a time-capped moving average estimator, which does a pretty good job.

The answer to the "in order to compensate you have to predict rate of events that might or might not have happened" part above is that when a known process freeze occurs, projecting from the reality that preceded the freeze into the freeze period is a pretty good guess. Certainly much better than ignoring the fact that the freeze erased a whole bunch of known-to-be-terrible results, destroying the validity of the data set.

LatencyUtils only provides partial correction, but that partial correction is often of 99%+ of the error. The bigger and/or more frequent process freezes are in your actual system, the more important the correction becomes, especially when freezes are a significant multiple f the typical measured latency (in which case the correction may often become the only valid signal above ~99%).

In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.

The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.

Let's take your wonderful concierge business example.

If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.

In the specific scenario, I don't see a need for groups-of-10-calls testing. While it will probably work, a very simple test system, as well as a very simple production-time-sanity-checking system can easily measure the right thing (the right thing is "what would your customer's perception of the percentile be?"): just poll the business at random intervals some number of times per day (which should also vary somewhat), and report of the stats of what you see. Just make sure not to coordinate between polling attempts, and your percentile numbers will match what your customer would consider to be a valid representation, and therefore a useful way to gauge the quality of your system.

A scenario like the one I describe (the concierge service) is actually one where you would want you rproduction-time validation to be happening much more frequently than the customer's actual use, so that you will be likely to detect (and potentially correct) issues before your customer does. This sort of "failure is very expensive" driven continuous production validation is actually common in the real world as well. E.g. of algo trading system that only trade a few times a day, it is common practice to do sanity test trades (small or irrelevant) at a significanlty higher rate than the actual trades that matter happen, just to maintain an evaluation of what expected trade behavior may be like (some systems can even loop this knowledge back into the actual trading decisions).

The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.

I agree that percentiles should usually be reported in time intervals, as that's how they are most often stated. When requirements are set, they are usually stated about a worst case time interval. E.g. "The 99.9%'lie should be 200usec or better in any given 15 minute period during the day" as opposed to "in at least one 5 minute interval during the day" or "across the entire day, even if it is much worse during some 15 minute period."

In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.

To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.

Look at LatencyUtils.org, think of how using LatencyStats would apply to something like Cassandra reporting it's production-time latency stats, and let me know if that changes your mind...

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,

Jan 13, 2015, 7:39:25 PM1/13/15

to mechanica...@googlegroups.com

On Tuesday, January 13, 2015 at 3:26:53 PM UTC-8, Gyula Csom wrote:

Thanks for your response!

I think I got the meaning... at least some piece of it...

To my understanding, you mean that one must test (in a statistical sense) whether (nearby) low latency events (as random variables) are corellated with each other or not. (For instance, to put it simply for two threads the 'corellation' would mean that P(AB) > P(A)*P(B), where P(A), P(B) is the possibility of low latency events for the A and B thread, and P(AB) is the possibility for two 'nearby' low latency event.

Replace "low" with "high" (which is what I think you meant anyway) and we agree. And yes, you can say that P(high) = P(1 - low) and it already means that...

If I understood it well, then my intuition says that CO-related correction should be somehow dependent on the factor of correllation... That is CO might not be a binary, 0 or 1 thing.)

In a test system, the first goal should to avoid coordinated omission, not to correct for it. And in a test system, this is very possible. [Unfortunately most test systems don't do that, and have coordinated omission galore]

If correction must be done (mostly because the test system has no viable way to deduce expected start time for operations), then the only reliable corrections I've played with so far are ones that take a 0 or 1 approach. Yes, there are probabilistic levels that can fill in the gaps some beyond the 0/1 level ("correcting more" than the certain corrections that are safe to do when 0/1 conditions are detected), but those get very hard to detect, estimate, and apply.

Meanwhile as I understand your thoughts... the correllation does not neccessarily mean 'common-cause', that is the higher likelyhood between nearby high latency events could be caused by non-global events/effects.

Yes.

The good news is that test systems can always be built to make this a non-issue. Coordinated omission in a test system only happens when the test system mis-represents the start time of an operation, and/or changes it's execution script in response to the actual behavior of the system. When a test system executes "to plan" (i.e. all operations that it planned to execute are actually performed, even if it takes 17 retries and 3 reconnects), and considers that start time to be the planned time (not the send time of the last attempt that actually came back with a result), it does not experience coordinated omission, and it's collected latency results are a good representation of the latency distribution actually experienced.

Monitoring is harder...

Gil Tene

unread,

Jan 13, 2015, 8:30:16 PM1/13/15

to mechanica...@googlegroups.com

On Tuesday, January 13, 2015 at 4:13:04 PM UTC-8, Benedict Elliott Smith wrote:

I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happened

As one of the people who've had a spirited hour+ long debate with Gil on this topic, I would like to defend the position briefly: coordinated omission detection is far from perfect (it will by no means detect all sources), and requires various assumptions such as a steady rate of message arrival, a symmetrical distribution of message size/workload, pauses being system-wide...

Coordinated omission correction is tricky business, for sure. Detection is a bit easier. E.g. if you detect a process-wide-affecting glitch (e.g. a GC pause) that covers a period of time significantly larger than the recent time distance between operations being recorded into some latency stats bucket, you are pretty much assured that an in-process latency measurement around your operation execution will experience CO, and that the stats collected will be way off of what anyone looking at them thinks they represent.

These things are not always true, but even if they were, the same assumptions could be applied as a post process of the raw data from a stream of imperfect measurements (it would be extraordinarily rare to have lengthy pauses be omitted completely from such a steady state system, so the event will ordinarily be present in a non-lossy histogram of latencies),

What you call "extraordinarily rare" is measured as ~20-60% likelihood in my experience. In most systems I see that report latencies measured within the code in the classic [take start time; do op; take end time; record time diff] method, 20-60% of large GC pauses in the GC logs fail to show up in the results.

E.g. in a Cassandra system doing 95% writes and 5% reads, under a non-saturating load, it would be common to have a GC pause triggered by a write operation occur when no read operation is in flight (when the pause started), but with 10s of read operations being attempted during the pause. The read stats would then never see the pause (not even once), and post-correction would not work.

by detecting temporally adjacent outliers and padding their number. In fact LatencyUtils essentially does this, only with explicitly detected and assumed-to-be-global pause overlays instead of applying those assumptions to the raw data (only the manipulation of the data).

LatencyStats does not do perfect correction. It may fail to detect some pauses and other process-global glitches, and may under-correct. But the corrections it does do are very real, and are for known process-global events. The corrected results with LatencyStats are much closer to reality than the uncorrected ones.

But it's true that the right thing may be to over-correct rather than to under-correct.

The nice thing about this is that we (Cassandra) don't have to explain what our concept of truth is: it may not be perfect, but it is well defined, and can be modified for interpretation however any operator wants.

It may be well defined, but it's definition does not match what an operator looking at the 99%'lie output indicator thinks they are looking at... When an operator is looking at a 99%'lie labeled thing that shows 1msec even when 10% of the total wall clock time is spent in 500msec or larger pause times, they get unpleasantly surprised, and usually have words to say about the accuracy or truthfulness of the thing they are being shown. Telling them that the number is right because it's definition is "99% of the operations that have good latencies" (or an equivalent attempt at avoiding the fact that it's simply not showing what they expected you to be measuring) will usually not make them happier.

In reality, simply looking at max latency over a given time horizon is usually more than sufficient to help operators spot problems.

If that's the reality, Then that's what you should be showing. If you believe that your max time measurement is pretty good and that just looking at it is sufficient for operators to spot problems, why not just stick to that? Why add (or replace it with) a 99%'lie-labeled thing that shows the 99%'lie of good results (rather than 99%'lie of results).i

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kirk Pepperdine

unread,

Jan 14, 2015, 3:34:18 AM1/14/15

to mechanica...@googlegroups.com

Hi,

To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.

I think what you are describing is just jitter in the server. This is different from omission, referring to the load injector failing to send a request when it was expected to do so on it’s own and coordinated omission when the load injector fails to send a request due to a dependency on the server. The effect in all three cases is the same, lower than expected load on the server.

Hence, besides the testing method, this might be related to the (probably unknown) reasons causing the unexpected latencies (high over the average). Iff the effect is global then CO fires and breaks the unaltered stats. If the effect is local (delays only one or some thread) then probably CO doesn't, the unaltered stat might work.

You can get a global CO as well as a per use case CO. In the case of single trigger workloads (aka matching engines) it’s likely that you won’t see the per use case CO. However for more complex multi-step use cases it’s very likely that you’ll see a per use case CO.

Although the event 'as is' might not be observable, its effect (whether it is global or local) might be detected, at least in a statistical way. We just have to check whether the high latency points of different threads are correlated to each other or not. If they tend to happen at the same time then one would suspect global effects, if not then probably they represent local events.

CO is always statistically visible as long as you treat the data properly. As Gil says, if you want to hide a problem only report an average. It’s something that I’ve seen and have been showing people for years so to have Gil name this observation is a huge step forward to facilitating discussions about the problem.

After all I guess that before altering the original stats, it would be reasonable to check whether the delays were correlated or not. If this makes sense:-) then I would like to ask, whether there's any (practical or lab) experience regarding this? That is: Is it typical that events causing delays are global ones? Are there any local events at all?

GC pauses in the server will can cause this.

Regards,
Gyula

2013. augusztus 7., szerda 23:37:27 UTC+2 időpontban Jason Koch a következőt írta:
Hi Attila

I'll attempt to explain. Gil has a presentation on it and I'm sure there's a few analogies in these email threads. CO is something to do with your testing methodology, and I'll give an "obvious" example that shows why your tests might not show the real world problems if you're suffering CO.

The challenge is that, in the real world, your consumers are not coordinated with your system. The consumers are independent. They will send a request whether you have responded to someone else's request or not. So ..

Suppose you are using a few N consumer threads in a tester, and each is in a tight loop for 1 million "events". Eg:
for(int i = 0; i < 1000000; i++) { long start = nanotime(); log.info("xxxxxxx"); long duration = nanotime() - start; addToHisto(duration); }

Here's where CO comes in:
Suppose that during the middle of the run, there was some thing that causes the log.info() time to take longer. It could be a hashmap rebuild, a file system buffer full flush, a GC run; it doesn't really matter what it is, just that for some reason processing is stalled. In your test harness, the "consumer" threads wait for the system to recover before sending more requests (they are coordinated). So, only "N" requests would be showing the full pause time (eg 1 second), all other requests would show un-paused time (eg 1 ms).

However in a _real world_ situation, it's unlikely your consumers are going to recognise that your system is under load and stop sending requests for that full second. They're going to keep firing them, and probably a lot of times, during that 1 second. So in the real world, all of these consumers will ALSO see that pause. The issue of CO is how many requests are _truly_ impacted by that 1 second pause versus the N that you've shown in your test report (due to omitted results).

What Gil generally sees are single step use cases that are triggered on a heartbeat so his descriptions are typically from that perspective. What is being discussed here is a multi-step use case where the inter-request arrival rates is randomly distributed. In this case any individual inside a step would feel the pause and that will slow them down and delay them from triggering the next step. IMHO that is a normal condition. The question you want to answer is; will that individual step always complete within the specified SLA? Next, anyone outside of step will step *blindly* to the problem in the server and so those requests would trigger as per normal. But then they are in the same position as those already in a step. Again, if you’ve met the SLA it’s not a problem.

Regards,

Kirk

signature.asc

Benedict Elliott Smith

unread,

Jan 14, 2015, 5:41:16 AM1/14/15

to mechanica...@googlegroups.com

I should make it clear at this point that I speak only for myself, and not for the project as a whole.

bq. What you call "extraordinarily rare" is measured as ~20-60% likelihood in my experience

1. I am unconvinced of this number for cross-cluster (QUORUM) reads measured at the coordinator on common systems (having only 10s of reads is a very light cluster load though, so perhaps...), since one of the coordinators sending work to the node in GC would not be in GC, so the number of missed occurrences cluster-wide should be few. Further, a coordinator request spends a majority of its time in flight to, from, or at the owning nodes, so if either it or one of the owning nodes enters a GC cycle this is most likely to occur whilst requests are in flight and being meausred. So even with a small cluster this holds; with a large cluster and a hash partitioner, at least one coordinator not in GC will have a request out to the paused node in a majority of cases, and will register it. For local only reads (smart routed CL.ONE) I accept this may be a possibility, but you made this claim when we last spoke, when such features weren't around, so I suspect you are looking at the internal latency rather than coordinator latency, or are extrapolating from experience on a single node.

2. Even if this were true, this would still result in enough data points to paint a near-to-true picture.

bq. Then that's what you should be showing

Cassandra doesn't show anything at all. The nodetool offers a complete histogram of latency distribution, measured both internally and at the coordinator. The tool we recommend for testing how your cluster responds to representative workloads is called cassandra-stress, and this displays the .5, .95, .99, .999 and max latency as measured at the client. I stand by this being the best way to establish latencies. For a live system, if you have SLAs, you should be measuring it in your application, and not relying on your data store. If you really want to, though, we provide you with enough information to estimate coordinated omission.

That all said, I'm not unsympathetic to introducing some of these capacities to Cassandra, so operators that want this view can see it. Last we spoke you were planning to investigate plugging into yammer/codahale, so that it could easily be offered as a config option. As yet, no movement :)

...

Gil Tene

unread,

Jan 14, 2015, 12:43:11 PM1/14/15

to mechanica...@googlegroups.com

On Wednesday, January 14, 2015 at 2:41:16 AM UTC-8, Benedict Elliott Smith wrote:

I should make it clear at this point that I speak only for myself, and not for the project as a whole.

I'd also like to make it clear that I'm nnot looking to single out Cassandra in any way. I don't think Cassandra is "special" here, or that it doing something wrong that most other things do right. It is just a very good example of a latency monitoring and reporting problem that is very prevalent, and that people care about. It is a problem that most latency monitoring stats being displayed on most dashboard and ops screens today share.

bq. What you call "extraordinarily rare" is measured as ~20-60% likelihood in my experience

1. I am unconvinced of this number for cross-cluster (QUORUM) reads measured at the coordinator on common systems (having only 10s of reads is a very light cluster load though, so perhaps...), since one of the coordinators sending work to the node in GC would not be in GC, so the number of missed occurrences cluster-wide should be few. Further, a coordinator request spends a majority of its time in flight to, from, or at the owning nodes, so if either it or one of the owning nodes enters a GC cycle this is most likely to occur whilst requests are in flight and being meausred. So even with a small cluster this holds; with a large cluster and a hash partitioner, at least one coordinator not in GC will have a request out to the paused node in a majority of cases, and will register it. For local only reads (smart routed CL.ONE) I accept this may be a possibility, but you made this claim when we last spoke, when such features weren't around, so I suspect you are looking at the internal latency rather than coordinator latency, or are extrapolating from experience on a single node.

The coordinator is just as susceptible to pauses (and related omissions) as anything else is...

We can analyze it and rationalize all we want, but It is easy enough to measure. In measurement you'll find plenty of GC-log reported pauses that have no corresponding latency measurement (not even one big latency) in the reported histograms.

2. Even if this were true, this would still result in enough data points to paint a near-to-true picture.

If you consider not reporting on1/3rd to 2/3rds of pauses and glitches to be near-to-true, then yes, you will likely see some indication that pauses exist, because you will sometimes see single data points at those pauses. Your monitoring max time has about a 1 in 3 chance of showing the actual max time (which is much better than 1 in 1000, I guess). All frequency and percentile data is completely off though when this happens, but since they were already off by multiple orders of magnitude even without missing these pauses, this extra "wrongness" doesn't make much of a difference, I guess.

bq. Then that's what you should be showing

Cassandra doesn't show anything at all. The nodetool offers a complete histogram of latency distribution, measured both internally and at the coordinator.

Well, those complete histograms being offered by nodetool often end up somewhere that people look at. Like a monitoring screen or a dashboard. Or a threshold triggering system. When we speak to people who say "my 99%'ile latency on the writes on the cluster is X", they tend to get that data from these histograms. And that 99%'lie is very very wrong (if what they think they are looking at is what the latency behavior or service levels actually were).

Unfortunately, people looking at monitoring screens usually don't even look at the max time (which has a lower chance of being wrong), and tend too plot and look at percentiles and averages from the histogram because they are there. When I ask why, they'll often say something "the max time often shows me numbers that are strange or out of whack with all the rest of the stats". They are right, the max is often out of whack with the rest of the histogram distribution, but that's because it's the only number in that histogram that is actually correct when those disparities occur...

The tool we recommend for testing how your cluster responds to representative workloads is called cassandra-stress, and this displays the .5, .95, .99, .999 and max latency as measured at the client. I stand by this being the best way to establish latencies.

Um... cassandra-stress will report wrong percentiles across the board due to it's own classic CO issues (the same ones most load testers have). "Wrong percentiles" here means "percentiles values that have absolutely nothing to do with the percentile values experienced by clients under this load, and will usually reported numbers 1-3 orders of magnitude better than what really happens". The fact that those wrong percentiles will tend to match those reported by the nodetool histograms probably makes people think they are both right. But they are both wrong. They agree because they are both wrong in the same way (same CO effects).

To be specific: [due to some other stuff I'm doing] I am staring at a cassandra-stress output right now that shows the following:

op rate : 20004

partition rate : 10354

row rate : 10354

latency mean : 0.4

latency median : 0.3

latency 95th percentile : 0.4

latency 99th percentile : 0.5

latency 99.9th percentile : 1.4

latency max : 320.8

Total operation time : 00:04:09

Your first hint that something is badly off in this report is the fact that the latency max is 2-3 orders of magnitude higher than the 99%'lie and 99.9%'lie values, even though it represents a significant % of the total runtime (0.12%). On it's own, this doesn't prove anything: it is still possible that only a single glitched operation existed in the ~5M ops that were run in the test, with all the rest of the operations being near-perfect. That's possible, but unlikely. Whenever you see this sort of relationship, you are more likely than not starting at Coordinated Omission in the face.

We can dig in by looking at the detailed cassandra-stess interval output and the server gc logs, where we see this:

- The interval logs for the same run show at least 16 intervals with a max latency that is > 200msec (some as big as 320msec).

- The GC log shows GC pauses at matching times slightly lower magnitudes (205 to 290msec).

- (adding the >200msec pause times) the total time spent in GC pauses above 200msec is ~3.2seconds. [Note that this ignores the many smaller glitches. this is an optimistic view].

- Given that the total runtime is 249 seconds (00:04:09), we know from the GC pauses that at least 1.28% of the run was spent in a pause that was at least 200msec in length.

- Since 1.28% of the time was spent within pauses longer than 200msec, the best possible 99%'lie that could be experienced by an uncoordinated client is (200 * 0.28/1.28) which is 43msec (>80x greater than reported by cassandra-stress)

- Similarly, since 1.28% of the time was spent in pauses longer than 200msec, the best possible 99.9%'lie that could be experienced by an uncoordinated client is (200 * 1.19/1.29) which is 184msec (> 130x greater than reported by cassandra-stress).

These are results for a real run, which often take some math to show the CO issue exists. But it is also trivial to demonstrate this problem with my recommended manual calibration test for all load testers: Suspending the tested process with ^Z or SIGSTOP for part of the run (e.g. 1/4 of the total run time) so that loaders that have the CO measurement problem show such glaring omissions that no math is needed to declare "this reported 99%'lie number is utter nonsense".

After proving coordinated omission exists in the report, we can also inspect the code to show where it comes from.

Sure enough, cassandra-stress is a constant rate load generator that does the classic CO measurement mistake of measuring latency from the time an operation starts to the time the same operation completed (rather than from the time the operation was supposed to start until it was completed). The good news is that much like my corrected version of wrk2 fixes this issue in wrk, cassandra-stress can be modified to correctly measure and report latency distribution (when run at a given op rate) by avoiding coordinated omission to begin with. I'll try to find time to play around with a cassandra-stess2 to demonstrate this.

For a live system, if you have SLAs, you should be measuring it in your application, and not relying on your data store. If you really want to, though, we provide you with enough information to estimate coordinated omission.

People do measure it in the application as w whole as well, but when they have SLA-breaking latency behavior in the application as a whole, they look for the latency behaviors of the components to see where they need to spend their time fixing things. If that were not the case, you wouldn't be doing all these measurements, offering those histograms, or showing the read and write latency stats on your monitoring screens.

And nobody wants to estimate coordinated omission. They want to estimate the latency behavior they have.

That all said, I'm not unsympathetic to introducing some of these capacities to Cassandra, so operators that want this view can see it. Last we spoke you were planning to investigate plugging into yammer/codahale, so that it could easily be offered as a config option. As yet, no movement :)

There actually is some movement there. I do have a day job, but other people have done some good work on this. HdrHistogram's newer features (like auto-sizing histograms and the Recorder class for safe interval sampling of histograms) were built with this in mind, in order to make it more convenient to fit into the concept of a Metrics reservoir. And Marshall Pierce (@runswithbricks) built this very cool Metrics variant that incorporates them, and put it on maven central. This variant uses HdrHistiogram for recording, but does not attempt to correct for CO yet. Optional CO-correction (using tricks like LatencyStats or something similar) would be a good next step.

...

Benedict Elliott Smith

unread,

Jan 14, 2015, 1:54:07 PM1/14/15

to mechanica...@googlegroups.com

The coordinator is just as susceptible to pauses (and related omissions) as anything else is...

Sure, but for every coordinator to be affected simultaneously would be less likely (sure, there would be correlations between node GC incidence); since requests are ~randomly proxied on from any given coordinator, it will be extremely unlikely to see a real system fail to spot as much as 60% (or even 20%) of lengthy events. The chance reduction of missing an event is exponential in the size of cluster, although the exact behaviour would be dependent on correlation between GC events. I would change my position if presented with a common cluster configuration in which this is reproducible, but it seems to me you are basing your position on generalized experience with single node application behaviours.

There is still a reporting issue for making use of this information, I agree. My position is that this isn't really an issue the database should try to solve for you. To me this is a problem with monitoring analysis tools, which could apply a coordinated omission overlay with the raw data already available. I don't like doctoring the raw data reported by an application when it can be doctored in post processing with conscious intent, especially when accurate measurement by the application is still the gold standard. That said, I would really like to see better application level access to STW times so that these could be better reported, as currently it is pretty insufficient, and this would help those tools correctly model this (if any are trying).

I am staring at a cassandra-stress output right now that shows the following:

Each simulated client awaits the result of the prior operation before performing another. If you have very few such clients then there will be very few operations being performed, and those %iles are accurate. If you want to measure the performance for an operation arrival rate, more clients than the number of concurrent ops are needed, and a rate limit should be specified. Admittedly there is some work to be done to ensure this would report correctly for arbitrarily large pauses, and to improve the documentation (amongst a hundred other things).

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Gyula Csom

unread,

Jan 14, 2015, 2:22:10 PM1/14/15

to mechanica...@googlegroups.com

Thanks for your response!

I think I got the meaning... at least some piece of it...

To my understanding, you mean that one must test (in a statistical sense) whether (nearby) low latency events (as random variables) are corellated with each other or not. (For instance, to put it simply for two threads the 'corellation' would mean that P(AB) > P(A)*P(B), where P(A), P(B) is the possibility of low latency events for the A and B thread, and P(AB) is the possibility for two 'nearby' low latency event.

Replace "low" with "high" (which is what I think you meant anyway) and we agree. And yes, you can say that P(high) = P(1 - low) and it already means that...

Sorry for the typo!

If I understood it well, then my intuition says that CO-related correction should be somehow dependent on the factor of correllation... That is CO might not be a binary, 0 or 1 thing.)

In a test system, the first goal should to avoid coordinated omission, not to correct for it. And in a test system, this is very possible. [Unfortunately most test systems don't do that, and have coordinated omission galore]

I see! It is an interesting direction...

If correction must be done (mostly because the test system has no viable way to deduce expected start time for operations), then the only reliable corrections I've played with so far are ones that take a 0 or 1 approach. Yes, there are probabilistic levels that can fill in the gaps some beyond the 0/1 level ("correcting more" than the certain corrections that are safe to do when 0/1 conditions are detected), but those get very hard to detect, estimate, and apply.

I've got the same impression (after a quick, hence limited analyis). Attacking the problem seems to be beyond elementary techniques and also my stat knowledge:-)
Meanwhile the linear fit approach (you seem to use) is a simple technique. The constant fit (assuming the worst case only) is even simpler. Another simple technique could be counting high latency events as much as their time span. Ie. if the regular time frequency of the test plan is 't' and a high latency event's time span is 'T' then count it T/t times in the distribution. (However whether it is practical or not I donno.)

Gyula Csom

unread,

Jan 14, 2015, 2:43:41 PM1/14/15

to mechanica...@googlegroups.com

Hi,

Thanks for your answer!

My orientation was refined/corrected due/by the previous messages. My question is not whether the CO phenomenom exists or not, but (1) how it behaves/looks like, (2) how to detect/measure it and (3) how to handle it (in either benchmarks or real life data):-)

For instance an interesting thing is that probably CO is not a binary phenomenom. Though attacking the problem is beyond my knowledge...

Regards,
Gyula

Gil Tene

unread,

Jan 14, 2015, 3:08:13 PM1/14/15

to mechanica...@googlegroups.com

On Wednesday, January 14, 2015 at 10:54:07 AM UTC-8, Benedict Elliott Smith wrote:

The coordinator is just as susceptible to pauses (and related omissions) as anything else is...

Sure, but for every coordinator to be affected simultaneously would be less likely (sure, there would be correlations between node GC incidence); since requests are ~randomly proxied on from any given coordinator, it will be extremely unlikely to see a real system fail to spot as much as 60% (or even 20%) of lengthy events. The chance reduction of missing an event is exponential in the size of cluster, although the exact behaviour would be dependent on correlation between GC events. I would change my position if presented with a common cluster configuration in which this is reproducible, but it seems to me you are basing your position on generalized experience with single node application behaviours.

I'll try to get some multi-node examples of missed GC pauses for you from runs we're doing for other reasons.

There is still a reporting issue for making use of this information, I agree. My position is that this isn't really an issue the database should try to solve for you. To me this is a problem with monitoring analysis tools, which could apply a coordinated omission overlay with the raw data already available. I don't like doctoring the raw data reported by an application when it can be doctored in post processing with conscious intent, especially when accurate measurement by the application is still the gold standard.

That said, I would really like to see better application level access to STW times so that these could be better reported, as currently it is pretty insufficient, and this would help those tools correctly model this (if any are trying).

To establish STW times, you can use my observation-based pause detector without taking on the rest of LatencyUtils/LatencyStats. The SimplePauseDetector uses observation based consensus across multiple threads to establish (with a very high confidence) that a process-wide stall event has occurred. It is typically used in sleeping mode to detect stalls on the order of multiple milliseconds (e.g. sleep for 1 msec, call it a pause when 3 threads all saw an overlapping 10msec stall), but when used in spinning mode it can reliably detect process-wide stalls as small as 50usec [not many people outside of some aggressive application domains will be willing to burn 3 cpus in spinning mode just top do that detection].

I am staring at a cassandra-stress output right now that shows the following:

Each simulated client awaits the result of the prior operation before performing another. If you have very few such clients then there will be very few operations being performed, and those %iles are accurate.

Accurate? Really?

If you had 10 clients trying to do 2Kops/sec each, and each client saw a 10 second total stall in a 40 second run, but completed 60,000 ops at 1msec each in the remaining 30 seconds, you think that those clients (or the people behind them) would consider a report of the 99.99%'lie to be 1msec? Or the 99.9%'lie, or the 99%'lie?

People looking at the behavior of these clients would clearly call they behavior like they see it: the 75%'lie is 1msec, but everything above that is bad. The 99%'lie, 99.9%'item and 99.99%'lie are all pretty close to 10 seconds (10,000x times as high as the numbers that cassandra-stress will currently report for that scenario).

In this specific result set, in which there were 16 stall of 200msec or more in a 249 second run, the logic is exactly the same, and the math I included shows how the cassandra-stress data is off by a factor of "only" 130x.

If you want to measure the performance for an operation arrival rate, more clients than the number of concurrent ops are needed, and a rate limit should be specified.

If you want to report on the latency behavior that would be experienced by 10 clients, you do a test with 10 clients. And the test needs to report what those 10 clients actually see. Real clients in a real world situation that pauses for 10 seconds every 40 seconds will see good latency only below the 75%'ile, and will see terrible 90%'lie 99%'ile, etc. Tests that fail to report that simple fact should not be reporting latency percentiles. The same logic applies for 100 msec pauses every 400 msec. And for 200msec pauses every 15 seconds. Anything that reports a 99.9%'lie of ~1msec in any of those scenarios is wrong by multiple orders of magnitude.

Using pre-planned operation start times in a tester is a good way to make sure it does not fudge the latency numbers in its reports, as it eliminates the coordination at the source.Yes, this is most easily done with constant-rate tests (since the math for determining the expected start time of each op is trivial), but it can be done for any scenario you wish to test. You just need a way to determine the expected start time. The key is to measure latency from the time operations were going to start, and not from the time they ended up being sent due to back pressure.

And since cassandra-stress is a constant rate test, it is fairly easy to fix...

Admittedly there is some work to be done to ensure this would report correctly for arbitrarily large pauses, and to improve the documentation (amongst a hundred other things).

If you catch a yardstick telling you that a meter is an inch, you should simply stop believing everything else it says. At any magnitude. You could try to say "I'll keep using it to measure centimeters, because I'm not absolutely sure that how wrong it is there, and we should fix it to be correct even for arbitrary large lengths"m, but you'd be wasting your time, or worse.

Wrong is wrong. The current cassandra-stress percentile measurement and reporting method is wrong for any pause or stall size. It can be trivially demonstrated to be wrong with large pauses, and also with enough medium sized pauses (as I showed with the math below). It's accuracy doesn't get any better with smaller pauses. It just gets harder to conclusively prove how wrong it really is.

...

Benedict Elliott Smith

unread,

Jan 14, 2015, 3:27:44 PM1/14/15

to mechanica...@googlegroups.com

Accurate? Really?

Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients. These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions. Without a target operation rate, this is a throughput test. To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met. This would then give a true picture up to the limit of oversubscription of clients (i.e. a long enough pause would exhaust the threads used by the simulator, at which point no more measurements would be made). The "improvement" I discuss is to untie the client count from the operation rate, so that there is no threshold beyond which it would fail (although realistically there is always such a threshold, it would just be higher); this is a hangover of legacy access modes (which are all synchronous).

This does not in any way make it inaccurate. You are simply using it incorrectly for the thing you would like to measure, which is why I highlight documentation as a known improvement area, as well as improvements to make it harder to go wrong.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Benedict Elliott Smith

unread,

Jan 14, 2015, 3:34:22 PM1/14/15

to mechanica...@googlegroups.com, bellio...@datastax.com

(It might be worth pointing out you may have some other issues with your utilisation of the tool, from the output you pasted, since you're performing a number of no-ops. Presumably performing reads on data that hasn't been populated. If you want to have a chat about how best to use the tool, feel free to give me a shout. It might help us to produce some better example use cases and shape the documentation as well)

...

Kirk Pepperdine

unread,

Jan 14, 2015, 4:25:15 PM1/14/15

to mechanica...@googlegroups.com

On Jan 14, 2015, at 9:27 PM, Benedict Elliott Smith <bellio...@datastax.com> wrote:

Accurate? Really?

Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients. These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions. Without a target operation rate, this is a throughput test.

You can’t simply put as many unthrottled clients on the system and then call it a throughput test. At some load throughput will back off because the system will start thrashing and throughput will start ramping down. The more load you attempt under those conditions the lower the throughput will get. To do a proper throughput test you need to understand what the limiting resource is and then load to that level. And just because there isn’t any throttling on the injector doesn’t mean there isn’t any CO.

To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met. This would then give a true picture up to the limit of oversubscription of clients

You need to differentiate between operational load and stress. At the point of oversubscription to resources in the server latency will mostly be comprised of dead-time.

Kind regards,

Kirk Pepperdine

signature.asc

Gil Tene

unread,

Jan 14, 2015, 5:19:39 PM1/14/15

to mechanica...@googlegroups.com, bellio...@datastax.com

On Wednesday, January 14, 2015 at 12:27:44 PM UTC-8, Benedict Elliott Smith wrote:

Accurate? Really?

Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients.

When given the option of (for example) "-rate threads=50 limit=20000/s" cassandra-stress is a constant rate test (total ops per sec is 20K, spread across 50 threads, so each thread will do roughly 400 ops/sec. And as you can see from the output below, it actually keeps to that global throughput...

Yes, it can be run without a rate limit, but that would be useless for doing any sort of latency behavior measurements. Latency stats at an all-out throughput saturation test are always nonsense, by definition: they are supposed to be terrible, and if they are not, then the saturation test is not being doe right (it hasn't saturated enough).

These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions.

According to the code, that's true only when no rate limit parameter is given. With a rate limit, the current cassandra-stess code throttles all clients using a global rate limiter here.

Without a target operation rate, this is a throughput test.

Agreed.

To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met.

I haven't seen anywhere where it suggests you need to do that for your latency percentile reports to be accurate, and I don't see anything that invalidates the latency percentile reports if you fail to have enough threads to make that happen.

But this technique is also virtually impossible to do in practice. E.g. when you hit a 700msec pause running at 20Kops/sec, you'd need 14,000 threads to make this work-around work. For the occasional 4 second pause you'd need 80K threads.

This would then give a true picture up to the limit of oversubscription of clients (i.e. a long enough pause would exhaust the threads used by the simulator, at which point no more measurements would be made). The "improvement" I discuss is to untie the client count from the operation rate, so that there is no threshold beyond which it would fail (although realistically there is always such a threshold, it would just be higher); this is a hangover of legacy access modes (which are all synchronous).

It's much simpler to fix than that. And you don't have to create an artificially high client thread count to do it. The overall stress tool looks like it has good, sound logic. It's just the time recording that has bad math in it. All that is needed in the code is:

A. for the rate limit enforcement to be changed to use thread-local limiters (rate limited at rate/thread-count) rather than a single global shared limiter. This will establish the ability to compute a clear expected start time for each operation in each thread without changing what the test actually does.

B. For the operation latencies to be measured from the time the rate limiter would have allowed the operation through to the time it completed. [An example of rate limiting logic and matching expected time computation can be found in wrk2 here and here]

C. For the Timer class to collect the time stats in something better than it's current statistical sample (e.g. an HdrHistogram will do well under the hood here for lossless capture, and will also help report fine grained percentiles later).

I'll try to give that a stab over the weekend.

This does not in any way make it inaccurate. You are simply using it incorrectly for the thing you would like to measure,

You said earlier in this thread that "...The tool we recommend for testing how your cluster responds to representative workloads is called cassandra-stress, and this displays the .5, .95, .99, .999 and max latency as measured at the client." How is trying to use the recommended tool for this stated purpose wrong? I'm just feeding it a rate and reading the displayed latencies percentiles. The reason that the percentile values will always be way off (if there is even one GC pause or another server glitch in the run) is simply built in to the current code.

which is why I highlight documentation as a known improvement area, as well as improvements to make it harder to go wrong.

Fixing this is easy. I'll work up a correctly-measuring version and post when it is done. You can see before/after chart examples of how big I expect the discrepancies to be in the wrk2 readme. wrk2 fixes the exact same latency measurement problem in wrk, so I expect the effect on reported latency distributions to be very similar.

...

Benedict Elliott Smith

unread,

Jan 14, 2015, 5:46:58 PM1/14/15

to mechanica...@googlegroups.com, Benedict Elliott Smith

I'm just feeding it a rate

Actually, you're feeding it a rate limit. To model a target rate requires the method I outlined. With the 320ms max pause seen, a total of 6400 threads are needed, and validating you had enough requires only some very limited math. I agree this is suboptimal, and is the precise aspect I was referring to that needs improvement. Whilst your suggestion is one possible solution, the other is to move to clients submitting requests asynchronously - which is the encouraged modern access method, so is something we need to migrate to anyway. However, using 6.4k threads (even 20k) is workable in the meantime, and does achieve the desired ends, with some slight inefficiency on the client side.

I haven't seen anywhere where it suggests you need to do that for your latency percentile reports to be accurate, and I don't see anything that invalidates the latency percentile reports if you fail to have enough threads to make that happen.

Like I said, our documentation is imperfect - in fact, it is entirely missing. The tool also needs improvement in a number of areas, and this is indeed one of them. Right now, using it requires care, and understanding of its limitations.

Of course, if you knock up a version that delivers this improvement in the meantime, that's also great, and I'd be delighted to incorporate it.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Benedict Elliott Smith

unread,

Jan 14, 2015, 5:58:57 PM1/14/15

to mechanica...@googlegroups.com

You can’t simply put as many unthrottled clients on the system and then call it a throughput test. At some load throughput will back off because the system will start thrashing and throughput will start ramping down.

This seems to be a different discussion; one of semantics, or an abstract discussion of the art of benchmarking. But to indulge regardless: a throughput test does not necessarily test for the maximum throughput as you seem to be implicitly assuming; it only tests throughput under certain conditions (i.e. unthrottled client number in this instance, although I was incorrectly assuming Gil was not providing a rate parameter).

And just because there isn’t any throttling on the injector doesn’t mean there isn’t any CO.

I think perhaps this is another issue of semantics: the precise meaning of "coordinated omission." Since Gil coined the term, I guess he can define it for us, but since I've never seen a dictionary definition I have taken it to mean the omission of samples measuring work, not the omission of work itself. The former is a failure of monitoring or measurement, and the latter is an issue of defining your experimental setup.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Jan 14, 2015, 6:15:51 PM1/14/15

to mechanica...@googlegroups.com, bellio...@datastax.com

On Wednesday, January 14, 2015 at 2:46:58 PM UTC-8, Benedict Elliott Smith wrote:

I'm just feeding it a rate

Actually, you're feeding it a rate limit.

That rate limit parameter translates to the average sustained rate of actual execution in the actual code. not the peak rate. That's why the outcome is so close to what is asked for. The rate limiter used by the code uses a permit based system. Permits are issued at the given rate, and work consumes those permits as it goes. Work is only held back if no permits are available (and will be allowed to continue when more permits become available with time). So while the argument appear to specify a limit, it is actually specifying the actual rate.

To model a target rate requires the method I outlined. With the 320ms max pause seen, a total of 6400 threads are needed, and validating you had enough requires only some very limited math. I agree this is suboptimal, and is the precise aspect I was referring to that needs improvement.

Unless you can predict the max pause for the run, you would only know the number of threads needed after the fact (after a pause has occurred, and you have missed measurements). And since a reasonable max pause to expect is multiple seconds, you are looking at many tens of thousands of threads. As you note, an async approach would be better, but you'd still need to deduce the number of concurrent operations to keep in flight.

Whilst your suggestion is one possible solution, the other is to move to clients submitting requests asynchronously - which is the encouraged modern access method, so is something we need to migrate to anyway.

Note that wrk was already using async. request processing, and still needed the correction I'm talking about. That's because testing a server with a lot more connection that it really will handle doesn't give you a good way of establishing the latency behavior on the real connection scenario: It may be much worse than reality (thrashed on connections and less efficient as a result) or it may be much better than reality (e.g. miss out on effects for connection latency, buffering, back pressure, and other network effects that occur within a single connection). Creating 10,000 concurrent requests when the real world you are testing for only has 20 concurrent clients operating at a high rate does not end up testing the same scenario.

However, using 6.4k threads (even 20k) is workable in the meantime, and does achieve the desired ends, with some slight inefficiency on the client side.

So the math is : MinThreadCount = MaxLatencyExperienced * OpsPerSec

Gets a little big for 100K ops per sec and 5 second pauses...

But in any case, based on this math, the current tool should basically flag all latency percentiles as bogus whenever the max latency encountered is larger than ThreadCount / OpsPerSec.

I haven't seen anywhere where it suggests you need to do that for your latency percentile reports to be accurate, and I don't see anything that invalidates the latency percentile reports if you fail to have enough threads to make that happen.

Like I said, our documentation is imperfect - in fact, it is entirely missing. The tool also needs improvement in a number of areas, and this is indeed one of them. Right now, using it requires care, and understanding of its limitations.

Of course, if you knock up a version that delivers this improvement in the meantime, that's also great, and I'd be delighted to incorporate it.

Will do.

...

Benedict Elliott Smith

unread,

Jan 14, 2015, 6:35:22 PM1/14/15

to mechanica...@googlegroups.com

Creating 10,000 concurrent requests when the real world you are testing for only has 20 concurrent clients operating at a high rate does not end up testing the same scenario.

This is a different area of improvement necessary for cassandra-stress. In fact right now you're always simulating only 1 client at high rate, since every request goes through the same TCP connection (per node, and by default; if testing the legacy mode, this is not the case). Asynchronous requests are just a different model for modulating the rate, one that is not bound by a resource as constrained as thread count, and that makes it easy to achieve a target rate, instead of a rate limit.

That rate limit parameter translates to the average sustained rate of actual execution in the actual code. not the peak rate.

Only if it can keep up, which it can't without enough threads. Though we're going around in circles now. I think we've pretty much reached accord on matters of substance.

--

Gil Tene

unread,

Jan 14, 2015, 6:42:45 PM1/14/15

to mechanica...@googlegroups.com

On Wednesday, January 14, 2015 at 2:58:57 PM UTC-8, Benedict Elliott Smith wrote:

I think perhaps this is another issue of semantics: the precise meaning of "coordinated omission." Since Gil coined the term, I guess he can define it for us, but since I've never seen a dictionary definition I have taken it to mean the omission of samples measuring work, not the omission of work itself. The former is a failure of monitoring or measurement, and the latter is an issue of defining your experimental setup.

When I originally coined the term, I was thinking of specific test system problems (e.g. the back pressure issue as seen in JMeter). I've since realized it applies much more broadly in test systems (to many sampling techniques that assume normal distributions, and to constant throughput testers), and that it also applies outside of test systems (i.e. in monitoring and reporting systems). Simply browsing graphite dashboard images on the web will give you plenty of evidence of CO in the wild.

Luckily, the term has a clear and plain meaning that doesn't require it to be redefined to describe how the effect happens in those other places.

The plain meaning of "Coordinated Omission" is the coordinated (as opposed to un-coordinated) omission of results from a data set used to report on behavior. There is nothing wrong with omitting results randomly (as long as min and max values are correctly tracked), but omission that biases towards the erasure of bad results is coordinated omission.

In a test system it can be caused by omitting results for work that actually happened (e.g. biased sampling), or by omitting work (e.g. back pressure), or by moving work from a time that would show a bad result to a time that would show a good result without accounting for the original intended start time (e.g. what most constant rate load generators do). I call this last effect (moving the work to a better time) "cheating twice".

In a monitoring system it can be caused by recording latencies only when the system is working, and not recording it during stalls.

And I'm sure it can happen in other places and in other way that I haven't thought to specify yet.

But in all cases the effect is the same: If the data in your stats buckets is [potentially highly] skewed, any stats you derive from those buckets will be wrong by definition. It's plain GIGO (Garbage In Garbage Out).

On 14 January 2015 at 21:25, Kirk Pepperdine <ki...@kodewerk.com> wrote:

On Jan 14, 2015, at 9:27 PM, Benedict Elliott Smith <bellio...@datastax.com> wrote:

Accurate? Really?

Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients. These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions. Without a target operation rate, this is a throughput test.

You can’t simply put as many unthrottled clients on the system and then call it a throughput test. At some load throughput will back off because the system will start thrashing and throughput will start ramping down. The more load you attempt under those conditions the lower the throughput will get. To do a proper throughput test you need to understand what the limiting resource is and then load to that level. And just because there isn’t any throttling on the injector doesn’t mean there isn’t any CO.

To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met. This would then give a true picture up to the limit of oversubscription of clients

You need to differentiate between operational load and stress. At the point of oversubscription to resources in the server latency will mostly be comprised of dead-time.

Kind regards,
Kirk Pepperdine

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.