Google Groups

Re: Coordinated Omission and its proposed solutions


Mirolyub Hristov Aug 11, 2013 1:59 AM
Posted in group: mechanical-sympathy
Your comment about my assumption that the system is serially blocking is actually spot on. I come from a game development perspective and the most important latency we measure is frame latency which is indeed serially blocking.

So to take your example. The system has 20ms response rate and can easily do 10000 reqs/sec, so we can model it as a 0.1ms serially blocking part and a 19.9ms parallelizable part during normal operation. Your assumption is also that most real world pauses above that are the result of adding more work to the serial part, so 1 request would see a 10 second pause, but 1000 concurrent requests would also see a 10 second pause each. Is that a good description?

I don't understand what you mean that the choice of time units will affect the recording. In one case do bucket += latency, in the other bucket += 1000 * latency, but that's just a constant factor across the board so it won't affect the histogram shape at all.

The idea of postprocessing should actually work really well even if the measurement rate changes. You can just keep a second HdrHistogram which records only those samples that are above the measurement rate, and then correct only those in post processing. That will require twice the current memory, but keep the time per measurement constant.

10 август 2013, събота, 18:45:50 UTC+3, Gil Tene написа:
Thoughts inline.

On Saturday, August 10, 2013 4:58:35 AM UTC-7, Mirolyub Hristov wrote:
Hello,

Coordinated Omission is the measurement error which is introduced by naively
recording synchronous requests, sorting them and reporting the result as the
percentile distribution of the request latency.

This is a good description of a specific manifestation of CO. E.g. when percentiles are reported.

However the CO issue itself is is completely contained in the "[naively] recording" part of the statement. Any recording method, synchronous or asynchronous, which results in [even partially] coordinating sample times with the system under test, and as a result avoids recording some some of the [originally] intended samples will exhibit CO. Synchronous methods tend to naturally exhibit this sort of naiivity when intended sample time are not kept far apart from each other (which in practice seems to currently mean always).

CO will affect all stats describing the latency behavior except the min and max values. Not just percentiles. The Mean, Std. Deviation , Kurtosis are all just as affected by non-random omission of data. The effect of CO is most noticeable in percentiles, and more noticeable the higher the percentile level being reported is.
 
One proposed solution implemented in HdrHistrogram is to know the expected
measurement rate and introduce virtual results which approximate the lost
measurements while waiting for a long response. The virtual results are linearly
interpolated between the actual response time and the expected measurement rate.

Exactly.
 

This solution has two major assumptions.

First, that there is an expected measurement rate, or in other words that the
sampling rate is uniform.

Yes. I'd highlight that this is an *optional* recording+correction capability of HdrHistogram, with a separate recording API call used when a user wishes to apply this "insertion of virtual results via linear interpolation" correction mode. As you note, this mode is only useful when a known expected interval exists. It is also important to note that HdrHistogram's normal recording modes make no assumptions or attempt to correct for CO. [This makes it easy to use 2 HdrHistograms for comparison of CO-correction with non-corrected data].
 
Second, that the cause for slow responses is in some sense external to the
system being measured.

I don't think I would use the term "external" to describe this, or call the cause external to the system under test. Quite the opposite. This correction method assumes that all abnormally slow responses are not caused by anything outside the system under test. To be clear on terminology, I consider any actual disruption to the system-under-test's ability to process requests (including things like swapping, ^Z, a GC pause, or temporary loss of network connectivity) to be something inside the system under test.

I would say that the assumption build into this correction method is that the cause is both internal and systemic in the test system. I.e. the assumption is that the system under test is the one that caused the slow response, and that the slow response is not localized to the single slow operation performed. I.e. that had additional operations been attempted during the time the tester was waiting for the slow response, some degree of lineary-abnormal-slowness would have been experiences by those additional attempts as well. 
 
For example, if I measure every hour how much time it takes me to buy a beer
from a store which is not open on weekends, then this solution will correctly
identify the real latency. The 2-day pauses are an external cause for the store
being slow.

For this reason I will call this the External solution.

I would describe being shut down on a 2 day weekend as part of the behavior of the system-under-test in this example, and therefore internal and systemic (in my terminology).

If the tester took a 2 day weekend off and didn't test the store's responsiveness during the weekend, that would be an external cause. Such a purely external pause would not cause coordinated omission (just omission), as long as it is not coordinated with the store's schedule. E.g. if the store remained open and we didn't test, we'd just be missing some presumably random samples, and our samples would still be valid and useful.

However, if the tester's time off (or inability to measure for any reason) non-randomly coincides with the store's closing time (or even with the stores peak line length period), that's where you would get coordinated omission. If that happened, you samples will become very wrong and any subsequent math done on them becomes mostly useless.
 
Now let's assume something different. If the long pauses are caused by a
periodic pause inherent in the requests, for example every 1000th request causes
a GC pause, then we can solve Coordinated Omission in a different way.

There are several potentially valid ways to try to correct for CO once it occurs. So lets analyze this one...
 
Rather than recording the count of all requests with a certain response time, we
can record the amount of time spent waiting for a request with a certain
latency.

More simply stated, instead of doing:

  bucket[BucketIndex(latency)] += 1;

We can do:

  bucket[BucketIndex(latency)] += latency;

Let's call this the Internal solution, because it assumes that there is a cause
inherent in the system for the slow response.

One issue I see here is that due to the N^2 nature of the recording, the arbitrary choice of linear time units (second, millisecond, nanosecond) will dramatically affect the magnitudes recorded. This seems intuitively wrong, as at the end, reported magnitudes per latency level obviously need to be completely independent of the unit chosen. This can probably be compensated for with some normalization technique that would bring things back to linear in some way.

Even with some sort of normalization, another key issue I would expect with this method is that it probably assumes that the system under test was a serially blocking system. I.e. that each request blocks all other requests from being served. This is built-in due to the fact that response lengths that do not exhibit back pressure on the tester (and don't caused CO) will be treated the same as ones that do cause CO...  E.g. even a short/"normal" one that would not prevent a synchronous tester from issuing the next request in time would get a non-linear count. Many/most SUTs can handle and serve requests concurrently. E.g. a SUT that normally takes 20msec to respond (due to the actual work and internal wait times involved in processing a request) could also quite normally do so happily for 10,000 requests per second through the use of thread pools or asynchronous processing techniques which both leverage multiple cores and significantly interleave internally blocking work (like a query to a DB or a wait for an external K/V store) within each core.

But the most important issue I see with this method is that it cannot (doesn't try to) distinguish between "good" and "bad" sampling, and will therefor probably dramatically over-corrcet, both on perfect testers that exhibit no actual CO, and on real-world testers that exhibit significant CO (but probably still work fine for 90%+ of the samples they make). See more discussion below.
 

The Internal solution has one obvious advantage, which is that it is constant in
both time and memory access per request.

That would be a very interesting quality, especially the constant in time one.

HdrHistogram is constant in memory already. But while it is constant in time for regular value recording, CO-corrected recordings will show a log(N) cost behavior. Not as bad as linear, bad still not constant (it's not linear in cost because the internal representation can be though of logarithmic buckets with linear sub-buckets).

Doing CO-correction in constant time would be interesting, so this is certainly making me think...

One way I can see for doing that would be to do the CO-correction not at recording time (where I would care about constant time behavior), but as a post-processing step. This would be similar to your "interpret them as a product" notion below, but rather than use the product math (which I see issues with), I could just do linear CO-correction using the same logic (given an expected interval to work with), producing a CO-corrected histogram copy outside of the critical recoding path. This would be fairly straight forward to add to HdrHistogram, but it would still carry the following assumptions:

1. An expected interval exists and is known at post-processing time.

2. Expected behavior that takes longer than the expected interval should be corrected by assuming that the samples skipped would have seen linearly decreasing times. [I think there is sound logic for why this is the right thing to do even when you are not sure this is not a single glitch, as "not being sure" is a reason to be aggressive in correction (and conservative in reporting as a result)].

3. The expected interval is constant across the entire run, for all recorded values. [This one is new for this post-processing mode, as the correct-during-recording mode I currently use allows each recorded value to provide it's own expected interval].

I could probably build something that doesn't include assumption #3 if we recorded the expected interval supplied in each recording, but that would either need to be summarized (making some assumptions about at least partial non-variablity), or will result in maintaining growing state. 
 

An obvious disadvantage is that the bucket can overflow if it is not large
enough to hold the total time the system is being measured.

A third solution is to record the bucket counts as normal, and instead to
interpret them as the product of their count and their bucket period. So, when
reporting the result, we use bucket[i] * BucketTime(i) for the reported value.
This solution will not be accurate if the bucket size is large, so I will call
it the Approximate Internal solution.

It seems to me that both the Internal and the Approximate Internal solutions are
better than the External solution, if we can safely assume that the causes are
inherent in the system. They can be used with a nonuniform measurement rate and
they keep the time and memory access per measurement constant. However, I do not
know if the assumptions of the External solution are not better in the real
world.

Any thoughts?

Unfortunately, if/when CO occurs, I think that proper correction *must* take into account the tester's intended behavior (without CO). Correcting "across the board" will either dramatically over-correct or dramatically under-correct.

Take this simple thought exercise: A tester is only susceptible to CO if it experiences responses that are longer than the tester's intended interval between requests. E.g. At an extreme, imagine a latency-characterizing synchronous tester built to statistically sample a SUT every 1 hour (per tester thread). The SUT would be concurrently loaded by another load generator, but that load-generator's CO-affected latency recordings would be properly thrown away if not corrected. Such a sampling tester would not experience CO unless it saw a response that took longer than an hour.
[Note: Obviously, it would take a while for such a tester to establish a 99.9% percentile with any degree of confidence. Per Peter's (statistically sound, I think) N^2 rule, it would take ~1,000,000 samples to gain that confidence, so it would take a while even if the sampling tester used 3,600 threads to do it's job (It still take 1M seconds. I'm using some extremes, but this "I don't want to wait 1,000,000 seconds" reasoning explains why people often opt for too-short sampling intervals, leading to CO]

While this hypothetical tester is still susceptible to CO in extreme (>1hr response) cases, results shorter than an hour would not cause it to exhibit CO, and should not be corrected. Only results longer than an hour need correction.

And before someone says "What could take longer than an hour, we really don't need to correct for that", I'll point out that without CO-correction for longer-lthan-1-hour results, even this extremely stretched tester would still report a a better than 99% (24x7) 1 second response time and availability record for a web system that took a 48-hour weekend outage every weekend.