Coordinated Omission and its proposed solutions

552 views
Skip to first unread message

Mirolyub Hristov

unread,
Aug 10, 2013, 7:58:35 AM8/10/13
to mechanica...@googlegroups.com
Hello,

Coordinated Omission is the measurement error which is introduced by naively
recording synchronous requests, sorting them and reporting the result as the
percentile distribution of the request latency.

One proposed solution implemented in HdrHistrogram is to know the expected
measurement rate and introduce virtual results which approximate the lost
measurements while waiting for a long response. The virtual results are linearly
interpolated between the actual response time and the expected measurement rate.

This solution has two major assumptions.

First, that there is an expected measurement rate, or in other words that the
sampling rate is uniform.

Second, that the cause for slow responses is in some sense external to the
system being measured.

For example, if I measure every hour how much time it takes me to buy a beer
from a store which is not open on weekends, then this solution will correctly
identify the real latency. The 2-day pauses are an external cause for the store
being slow.

For this reason I will call this the External solution.

Now let's assume something different. If the long pauses are caused by a
periodic pause inherent in the requests, for example every 1000th request causes
a GC pause, then we can solve Coordinated Omission in a different way.

Rather than recording the count of all requests with a certain response time, we
can record the amount of time spent waiting for a request with a certain
latency.

More simply stated, instead of doing:

  bucket[BucketIndex(latency)] += 1;

We can do:

  bucket[BucketIndex(latency)] += latency;

Let's call this the Internal solution, because it assumes that there is a cause
inherent in the system for the slow response.

The Internal solution has one obvious advantage, which is that it is constant in
both time and memory access per request.

An obvious disadvantage is that the bucket can overflow if it is not large
enough to hold the total time the system is being measured.

A third solution is to record the bucket counts as normal, and instead to
interpret them as the product of their count and their bucket period. So, when
reporting the result, we use bucket[i] * BucketTime(i) for the reported value.
This solution will not be accurate if the bucket size is large, so I will call
it the Approximate Internal solution.

It seems to me that both the Internal and the Approximate Internal solutions are
better than the External solution, if we can safely assume that the causes are
inherent in the system. They can be used with a nonuniform measurement rate and
they keep the time and memory access per measurement constant. However, I do not
know if the assumptions of the External solution are not better in the real
world.

Any thoughts?

Gil Tene

unread,
Aug 10, 2013, 11:45:50 AM8/10/13
to mechanica...@googlegroups.com
Thoughts inline.


On Saturday, August 10, 2013 4:58:35 AM UTC-7, Mirolyub Hristov wrote:
Hello,

Coordinated Omission is the measurement error which is introduced by naively
recording synchronous requests, sorting them and reporting the result as the
percentile distribution of the request latency.

This is a good description of a specific manifestation of CO. E.g. when percentiles are reported.

However the CO issue itself is is completely contained in the "[naively] recording" part of the statement. Any recording method, synchronous or asynchronous, which results in [even partially] coordinating sample times with the system under test, and as a result avoids recording some some of the [originally] intended samples will exhibit CO. Synchronous methods tend to naturally exhibit this sort of naiivity when intended sample time are not kept far apart from each other (which in practice seems to currently mean always).

CO will affect all stats describing the latency behavior except the min and max values. Not just percentiles. The Mean, Std. Deviation , Kurtosis are all just as affected by non-random omission of data. The effect of CO is most noticeable in percentiles, and more noticeable the higher the percentile level being reported is.
 
One proposed solution implemented in HdrHistrogram is to know the expected
measurement rate and introduce virtual results which approximate the lost
measurements while waiting for a long response. The virtual results are linearly
interpolated between the actual response time and the expected measurement rate.

Exactly.
 

This solution has two major assumptions.

First, that there is an expected measurement rate, or in other words that the
sampling rate is uniform.

Yes. I'd highlight that this is an *optional* recording+correction capability of HdrHistogram, with a separate recording API call used when a user wishes to apply this "insertion of virtual results via linear interpolation" correction mode. As you note, this mode is only useful when a known expected interval exists. It is also important to note that HdrHistogram's normal recording modes make no assumptions or attempt to correct for CO. [This makes it easy to use 2 HdrHistograms for comparison of CO-correction with non-corrected data].
 
Second, that the cause for slow responses is in some sense external to the
system being measured.

I don't think I would use the term "external" to describe this, or call the cause external to the system under test. Quite the opposite. This correction method assumes that all abnormally slow responses are not caused by anything outside the system under test. To be clear on terminology, I consider any actual disruption to the system-under-test's ability to process requests (including things like swapping, ^Z, a GC pause, or temporary loss of network connectivity) to be something inside the system under test.

I would say that the assumption build into this correction method is that the cause is both internal and systemic in the test system. I.e. the assumption is that the system under test is the one that caused the slow response, and that the slow response is not localized to the single slow operation performed. I.e. that had additional operations been attempted during the time the tester was waiting for the slow response, some degree of lineary-abnormal-slowness would have been experiences by those additional attempts as well. 
 
For example, if I measure every hour how much time it takes me to buy a beer
from a store which is not open on weekends, then this solution will correctly
identify the real latency. The 2-day pauses are an external cause for the store
being slow.

For this reason I will call this the External solution.

I would describe being shut down on a 2 day weekend as part of the behavior of the system-under-test in this example, and therefore internal and systemic (in my terminology).

If the tester took a 2 day weekend off and didn't test the store's responsiveness during the weekend, that would be an external cause. Such a purely external pause would not cause coordinated omission (just omission), as long as it is not coordinated with the store's schedule. E.g. if the store remained open and we didn't test, we'd just be missing some presumably random samples, and our samples would still be valid and useful.

However, if the tester's time off (or inability to measure for any reason) non-randomly coincides with the store's closing time (or even with the stores peak line length period), that's where you would get coordinated omission. If that happened, you samples will become very wrong and any subsequent math done on them becomes mostly useless.
 
Now let's assume something different. If the long pauses are caused by a
periodic pause inherent in the requests, for example every 1000th request causes
a GC pause, then we can solve Coordinated Omission in a different way.

There are several potentially valid ways to try to correct for CO once it occurs. So lets analyze this one...
 
Rather than recording the count of all requests with a certain response time, we
can record the amount of time spent waiting for a request with a certain
latency.

More simply stated, instead of doing:

  bucket[BucketIndex(latency)] += 1;

We can do:

  bucket[BucketIndex(latency)] += latency;

Let's call this the Internal solution, because it assumes that there is a cause
inherent in the system for the slow response.

One issue I see here is that due to the N^2 nature of the recording, the arbitrary choice of linear time units (second, millisecond, nanosecond) will dramatically affect the magnitudes recorded. This seems intuitively wrong, as at the end, reported magnitudes per latency level obviously need to be completely independent of the unit chosen. This can probably be compensated for with some normalization technique that would bring things back to linear in some way.

Even with some sort of normalization, another key issue I would expect with this method is that it probably assumes that the system under test was a serially blocking system. I.e. that each request blocks all other requests from being served. This is built-in due to the fact that response lengths that do not exhibit back pressure on the tester (and don't caused CO) will be treated the same as ones that do cause CO...  E.g. even a short/"normal" one that would not prevent a synchronous tester from issuing the next request in time would get a non-linear count. Many/most SUTs can handle and serve requests concurrently. E.g. a SUT that normally takes 20msec to respond (due to the actual work and internal wait times involved in processing a request) could also quite normally do so happily for 10,000 requests per second through the use of thread pools or asynchronous processing techniques which both leverage multiple cores and significantly interleave internally blocking work (like a query to a DB or a wait for an external K/V store) within each core.

But the most important issue I see with this method is that it cannot (doesn't try to) distinguish between "good" and "bad" sampling, and will therefor probably dramatically over-corrcet, both on perfect testers that exhibit no actual CO, and on real-world testers that exhibit significant CO (but probably still work fine for 90%+ of the samples they make). See more discussion below.
 

The Internal solution has one obvious advantage, which is that it is constant in
both time and memory access per request.

That would be a very interesting quality, especially the constant in time one.

HdrHistogram is constant in memory already. But while it is constant in time for regular value recording, CO-corrected recordings will show a log(N) cost behavior. Not as bad as linear, bad still not constant (it's not linear in cost because the internal representation can be though of logarithmic buckets with linear sub-buckets).

Doing CO-correction in constant time would be interesting, so this is certainly making me think...

One way I can see for doing that would be to do the CO-correction not at recording time (where I would care about constant time behavior), but as a post-processing step. This would be similar to your "interpret them as a product" notion below, but rather than use the product math (which I see issues with), I could just do linear CO-correction using the same logic (given an expected interval to work with), producing a CO-corrected histogram copy outside of the critical recoding path. This would be fairly straight forward to add to HdrHistogram, but it would still carry the following assumptions:

1. An expected interval exists and is known at post-processing time.

2. Expected behavior that takes longer than the expected interval should be corrected by assuming that the samples skipped would have seen linearly decreasing times. [I think there is sound logic for why this is the right thing to do even when you are not sure this is not a single glitch, as "not being sure" is a reason to be aggressive in correction (and conservative in reporting as a result)].

3. The expected interval is constant across the entire run, for all recorded values. [This one is new for this post-processing mode, as the correct-during-recording mode I currently use allows each recorded value to provide it's own expected interval].

I could probably build something that doesn't include assumption #3 if we recorded the expected interval supplied in each recording, but that would either need to be summarized (making some assumptions about at least partial non-variablity), or will result in maintaining growing state. 
 

An obvious disadvantage is that the bucket can overflow if it is not large
enough to hold the total time the system is being measured.

A third solution is to record the bucket counts as normal, and instead to
interpret them as the product of their count and their bucket period. So, when
reporting the result, we use bucket[i] * BucketTime(i) for the reported value.
This solution will not be accurate if the bucket size is large, so I will call
it the Approximate Internal solution.

It seems to me that both the Internal and the Approximate Internal solutions are
better than the External solution, if we can safely assume that the causes are
inherent in the system. They can be used with a nonuniform measurement rate and
they keep the time and memory access per measurement constant. However, I do not
know if the assumptions of the External solution are not better in the real
world.

Any thoughts?

Unfortunately, if/when CO occurs, I think that proper correction *must* take into account the tester's intended behavior (without CO). Correcting "across the board" will either dramatically over-correct or dramatically under-correct.

Take this simple thought exercise: A tester is only susceptible to CO if it experiences responses that are longer than the tester's intended interval between requests. E.g. At an extreme, imagine a latency-characterizing synchronous tester built to statistically sample a SUT every 1 hour (per tester thread). The SUT would be concurrently loaded by another load generator, but that load-generator's CO-affected latency recordings would be properly thrown away if not corrected. Such a sampling tester would not experience CO unless it saw a response that took longer than an hour.
[Note: Obviously, it would take a while for such a tester to establish a 99.9% percentile with any degree of confidence. Per Peter's (statistically sound, I think) N^2 rule, it would take ~1,000,000 samples to gain that confidence, so it would take a while even if the sampling tester used 3,600 threads to do it's job (It still take 1M seconds. I'm using some extremes, but this "I don't want to wait 1,000,000 seconds" reasoning explains why people often opt for too-short sampling intervals, leading to CO]

While this hypothetical tester is still susceptible to CO in extreme (>1hr response) cases, results shorter than an hour would not cause it to exhibit CO, and should not be corrected. Only results longer than an hour need correction.

And before someone says "What could take longer than an hour, we really don't need to correct for that", I'll point out that without CO-correction for longer-lthan-1-hour results, even this extremely stretched tester would still report a a better than 99% (24x7) 1 second response time and availability record for a web system that took a 48-hour weekend outage every weekend.




Martin Thompson

unread,
Aug 10, 2013, 5:59:10 PM8/10/13
to

Second, that the cause for slow responses is in some sense external to the
system being measured.

I don't think I would use the term "external" to describe this, or call the cause external to the system under test. Quite the opposite. This correction method assumes that all abnormally slow responses are not caused by anything outside the system under test. To be clear on terminology, I consider any actual disruption to the system-under-test's ability to process requests (including things like swapping, ^Z, a GC pause, or temporary loss of network connectivity) to be something inside the system under test.

I would say that the assumption build into this correction method is that the cause is both internal and systemic in the test system. I.e. the assumption is that the system under test is the one that caused the slow response, and that the slow response is not localized to the single slow operation performed. I.e. that had additional operations been attempted during the time the tester was waiting for the slow response, some degree of lineary-abnormal-slowness would have been experiences by those additional attempts as well. 

I think there are some subtleties to this which are important for a soft real-time system, e.g. a financial exchange.   Let's assume a load generator is sending in events/messages, and is external to the system under test.  The load generator can pause at any point in the request response cycle. 

REPEAT
 1. Create request
 2. Take timing
 3. Send request
 4. Receive response
 5. Take timing
 6. Process response
UNTIL done condition

If the load generator pauses anywhere between steps 2 and 5 then this is an issue for the system under test.  I think a much better solution is to not have the load generator recording timings, it just sends requests at a specified rate.  I would then intercept network traffic with a probe/tap that can timestamp network packets accurately, e.g. large buffer network adapter with timestamping support.  With this approach I can accurately measure the delta from request to response in the system under test.  I can then apply CO for requests that should have occurred when the load generator has paused for some reason.  If the load tester is doing the measurements then you have a dirty experiment.

This is obviously a more sophisticated approach but if latency really matters then it needs to be considered.

Regards,
Martin...

Gil Tene

unread,
Aug 10, 2013, 1:44:38 PM8/10/13
to mechanica...@googlegroups.com
Recording timing on the wire will certainly help. As long as the system doing the recording doesn't have it's own glitchy issues, you'll get an honest account for the timing of the requests that did happen. If the load tester has no backoff affecting it, timing on the wire would be enough on it's own. If there is any sort of backoff mechanism involved (and if TCP is inolved, there will be), then some sort of correction-for-CO can be done based on the observed timing and the knowledge of what the tester was supposed to be doing if backoff had not occured.

It a shame that most enterprise testing environments can't afford to do this sort of thing (collect and correlate latencies from accurate wire-tapping-based observation). But arguably ones that can, should. (e.g. financial systems with lots at stake).

On Saturday, August 10, 2013 9:57:09 AM UTC-7, Martin Thompson wrote:

Second, that the cause for slow responses is in some sense external to the
system being measured.

I don't think I would use the term "external" to describe this, or call the cause external to the system under test. Quite the opposite. This correction method assumes that all abnormally slow responses are not caused by anything outside the system under test. To be clear on terminology, I consider any actual disruption to the system-under-test's ability to process requests (including things like swapping, ^Z, a GC pause, or temporary loss of network connectivity) to be something inside the system under test.

I would say that the assumption build into this correction method is that the cause is both internal and systemic in the test system. I.e. the assumption is that the system under test is the one that caused the slow response, and that the slow response is not localized to the single slow operation performed. I.e. that had additional operations been attempted during the time the tester was waiting for the slow response, some degree of lineary-abnormal-slowness would have been experiences by those additional attempts as well. 

I think there are some subtleties to this which are important for a soft real-time system, e.g. a financial exchange.   Let's assume a load generator is sending in events/messages, and is external to the system under test.  The load generator can pause at any point in the request response cycle. 

REPEAT
 1. Create request
 2. Take timing
 3. Send request
 4. Receive response
 5. Take timing
 6. Process response
UNTIL done condition

If the load generator pauses anywhere between steps 2 and 5 then this is an issue for the system under test.  I think a much better solution is to not have the load generator recording timings, it just sends requests as a specified rate.  I would then intercept network traffic with a probe/tap that can timestamp network packets accurately, e.g. large buffer network adapter with timestamping support.  With this approach I can accurately measure the delta from request to response in the system under test.  I can then apply CO for requests that should have occurred when the load generator has paused for some reason.  If the load tester is doing the measurements then you have a dirty experiment.

Martin Thompson

unread,
Aug 10, 2013, 1:51:11 PM8/10/13
to mechanica...@googlegroups.com
If your network adapter can do the timestamping and has sufficient buffer capacity then it does not matter if the recording system pauses in system/user land.  There are numerous cards that can do this.

Writing a probe to take this data and process it is something I've had fun doing in the past.  It is also something I've considered doing again as an open source offering as I believe it would greatly help many industries.  I could then focus on the really interesting problems :-)

Gil Tene

unread,
Aug 10, 2013, 2:41:11 PM8/10/13
to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com
When TCP is involved (on a link that carries many seemingly unrelated and non blocking orders, for example) the total buffer depth you have before CO kicks in is just the sum of the send buffer in the client's kernel and the receive buffer on the servers kernel. That's typically going total less than 128KB unless it it tweaked, and even then it won't be that much more.

A 10msec scheduler event (through a trivial cpu contention blip for example) is enough to cause this to back pressure to make CO effects happen on systems that move more than 12MBytes/sec on a socket.

On sockets that use even a small fraction of a 10GigE, Even a 1-2 msec hiccup on a server app will back-pressure the tester. Once that happens,  CO is in play, and no matter where you collect timing you will need to correct or account for it for numbers to reflect reality.

Getting the timing from the wire or from cards will certainly help believe the timing though, which makes using them for CO-correction more accurate.

Sent from Gil's iPhone
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/WnOTco6ek4I/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Martin Thompson

unread,
Aug 10, 2013, 5:55:26 PM8/10/13
to
If you are playing in the low-latency space without tuning kernel buffers then...how can I put it without being rude? ;-)
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Aug 10, 2013, 5:59:04 PM8/10/13
to <mechanical-sympathy@googlegroups.com>
I agree, but even if you grew the server receive buffers to 1-2MB per socket, you'd still be looking at very little buffering time with modern message rates before back pressure would prevent traffic from arriving arrive at your NIC, and therefore it will get time-stamped without accounting for the actual blocking/waiting time. This is true for pretty much any reliable, in-order delivery, single pair mechanism (TCP, SCTP, RDMA, iWarp, etc. etc.). A huge receive buffer (like a GB) would allow you to track things right, but in my experience nobody sets network buffers big enough for that.

-- Gil.

On Aug 10, 2013, at 2:46 PM, Martin Thompson <mjp...@gmail.com>
 wrote:

If you playing in the low-latency space without tuning kernel buffers then...how can I put it without being rude? ;-)
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

Martin Thompson

unread,
Aug 10, 2013, 6:02:38 PM8/10/13
to mechanica...@googlegroups.com
You are absolutely right.  We need people to take things back to basics and do some arithmetic.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Aug 10, 2013, 6:11:02 PM8/10/13
to mechanica...@googlegroups.com
BTW if you are capturing all network traffic you also see retransmissions. This is important to diagnosing problems.  A network probe is such a useful tool.

Kirk Pepperdine

unread,
Aug 11, 2013, 4:55:22 AM8/11/13
to mechanica...@googlegroups.com
On 2013-08-10, at 6:57 PM, Martin Thompson <mjp...@gmail.com> wrote:


Second, that the cause for slow responses is in some sense external to the
system being measured.

I don't think I would use the term "external" to describe this, or call the cause external to the system under test. Quite the opposite. This correction method assumes that all abnormally slow responses are not caused by anything outside the system under test. To be clear on terminology, I consider any actual disruption to the system-under-test's ability to process requests (including things like swapping, ^Z, a GC pause, or temporary loss of network connectivity) to be something inside the system under test.

I would say that the assumption build into this correction method is that the cause is both internal and systemic in the test system. I.e. the assumption is that the system under test is the one that caused the slow response, and that the slow response is not localized to the single slow operation performed. I.e. that had additional operations been attempted during the time the tester was waiting for the slow response, some degree of lineary-abnormal-slowness would have been experiences by those additional attempts as well. 

I think there are some subtleties to this which are important for a soft real-time system, e.g. a financial exchange.   Let's assume a load generator is sending in events/messages, and is external to the system under test.  The load generator can pause at any point in the request response cycle. 

REPEAT
 1. Create request
 2. Take timing
 3. Send request
 4. Receive response
 5. Take timing
 6. Process response
UNTIL done condition

If the load generator pauses anywhere between steps 2 and 5 then this is an issue for the system under test.  I think a much better solution is to not have the load generator recording timings, it just sends requests as a specified rate.  I would then intercept network traffic with a probe/tap that can timestamp network packets accurately, e.g. large buffer network adapter with timestamping support.  With this approach I can accurately measure the delta from request to response in the system under test.  I can then apply CO for requests that should have occurred when the load generator has paused for some reason.  If the load tester is doing the measurements then you have a dirty experiment.

This is obviously a more sophisticated approach but if latency really matters then it needs to be considered.

I completely concur with this approach though in practice it's not so easy to implement. There are already more than enough hard problems to be solved....

I'm just catching up on this thread but one of the issues I have is that this all seems to be predicated on the injector working at a fixed rate. In most cases having a rate of injection that is not Poisson distributed causes it's own problem in the test and those problems will be reflected in the response times. By simply looking at GC logs I can, in most cases, separate benchmarks from live systems and then infer the quality of the harness and/or the benchmark its' self. 

Regards,
Kirk

Mirolyub Hristov

unread,
Aug 11, 2013, 4:59:05 AM8/11/13
to mechanica...@googlegroups.com
Your comment about my assumption that the system is serially blocking is actually spot on. I come from a game development perspective and the most important latency we measure is frame latency which is indeed serially blocking.

So to take your example. The system has 20ms response rate and can easily do 10000 reqs/sec, so we can model it as a 0.1ms serially blocking part and a 19.9ms parallelizable part during normal operation. Your assumption is also that most real world pauses above that are the result of adding more work to the serial part, so 1 request would see a 10 second pause, but 1000 concurrent requests would also see a 10 second pause each. Is that a good description?

I don't understand what you mean that the choice of time units will affect the recording. In one case do bucket += latency, in the other bucket += 1000 * latency, but that's just a constant factor across the board so it won't affect the histogram shape at all.

The idea of postprocessing should actually work really well even if the measurement rate changes. You can just keep a second HdrHistogram which records only those samples that are above the measurement rate, and then correct only those in post processing. That will require twice the current memory, but keep the time per measurement constant.

Gil Tene

unread,
Aug 11, 2013, 12:04:50 PM8/11/13
to <mechanical-sympathy@googlegroups.com>
Comments inline.

On Aug 11, 2013, at 1:59 AM, Mirolyub Hristov <mir...@gmail.com>
 wrote:

Your comment about my assumption that the system is serially blocking is actually spot on. I come from a game development perspective and the most important latency we measure is frame latency which is indeed serially blocking.

So to take your example. The system has 20ms response rate and can easily do 10000 reqs/sec, so we can model it as a 0.1ms serially blocking part and a 19.9ms parallelizable part during normal operation. Your assumption is also that most real world pauses above that are the result of adding more work to the serial part, so 1 request would see a 10 second pause, but 1000 concurrent requests would also see a 10 second pause each. Is that a good description?

Yes. That's a good description.

And those 1000 concurrent requests will each miss measuring and recording 498 other "bad" (9980msec, 9960msec, ... 40msec) response times in such a case, all of which would have been measured had the testing threads not politely waited for the 10 second gap to be over. All subsequent math on the data is then skewed towards "looking way too good" by this coordinated omission of data.

I don't understand what you mean that the choice of time units will affect the recording. In one case do bucket += latency, in the other bucket += 1000 * latency, but that's just a constant factor across the board so it won't affect the histogram shape at all.

I was an intuitive statement, as The bucket[BucketIndex(latency)] += latency method intuitively seemed wrong to me because time units are involved in both the index selection and in the count, creating some sort of non-linearity and sensitivity to some sort of time unit selection, making me nervuous about how it will behave with varying scenarios and unit selections. It's easiest to demonstrate this in a sensitivity to the sampling rate (but I there may be other non-linearity effects as well):

Lets observe this sampling rate sensitivity by going through a simple exercise: imagine we use linear buckets of 10 msec in size each for collecting counts. We (synchronously) record a hypothetical periodic system that responds perfectly in 1 msec for 1 second, and then stalls for 1 seconds (and repeats this 2 second periodic cycle). Now lets measure this same system at three different sampling rates (10Hz and 100Hz, 1000Hz):

- With no correction at all:
@10 Hz we'd have 10x as many counts in the [0-10msec] bucket as in the the [1000-1010msec] bucket.
@100 Hz we'd have 100x as many counts in the [0-10msec] bucket as in the the [1000-1010msec] bucket.
@1000 Hz we'd have 1000x as many counts in the [0-10msec] bucket as in the the [1000-1010msec] bucket.

[Obviously wrong, exhibits the classic CO effect we are talkign about, with the effect growing as sampling rates grow]

- With linear correction (like the one used by HdrHistogram's optional API):
@10 Hz 10 counts per period in the [0-10msec] bucket, and 1 count per cycle in each of the following 10 buckets: [1000-1010], [900-910], [800-810]... ,[100-110].
@100 Hz 100 counts per period in the [0-10msec] bucket, and 1 count per cycle in each of the following 100 buckets: [1000-1010], [990-1000], [980-990]... ,[10-20].
@1000 Hz 1000 counts per period in the [0-10msec] bucket, and 10 count per cycle in each of the following 100 buckets: [1000-1010], [990-1000], [980-990]... ,[10-20].

[Short of some obvious "quantization" at lower sample rates, all three of these will end up with similar averages and percentiles given enough sampling time, and the existence of natural noise will take out some of the quantization].

- With a bucket[BucketIndex(latency)] += latency recording method, using 1 msec units:
@10Hz 10 counts per period (1 msec * 10 occurrence) in the [0-10msec] bucket, and 1,000 counts in the [[1000-1010msec] bucket 
[A 1:100 ratio]
@100Hz 100 counts per period (1 msec * 100 occurrence) in the [0-10msec] bucket, and 1,000 counts in the [[1000-1010msec] bucket 
[A 5:1 ratio]
@100Hz 1000 counts per period (1 msec * 1000 occurrence) in the [0-10msec] bucket, and 1,000 counts in the [[1000-1010msec] bucket 
[A 1:1 ratio]

[Over-correction for anything but back-to-back testing, over-correction grows at lower and lower sampling rates].
 
So while at the "back-to-back" loading level this method seems to establish a 1:1 ratio (which is not quite right but sort of close to with 2x of stuff), the amount of correction is clearly sensitive to the sampling period. This means that whatever stats you compute are highly sensitive to the measurement technique. Even if you had a formulae to correct/normalize that, the sensitivity would cause a lot of noise, as real gaps are never as perfect as this scenario, and the above shows that the result can be highly sensitive to small variations in those gaps.


The idea of postprocessing should actually work really well even if the measurement rate changes. You can just keep a second HdrHistogram which records only those samples that are above the measurement rate, and then correct only those in post processing. That will require twice the current memory, but keep the time per measurement constant.

Yes, but we wouldn't really need the separate histogram for outliers-only. The correction can be validly done on a single source histogram (knowing that correction wasn't done at recording time) as long as a planned/expected interval is provided,  which would need to be common across the entire recording. 1x the memory (during measurement) and time per measurement is still constant. The correction work would produce a corrected copy of the originally recorded histogram at post-processing time.

I think I'll build such a post--processing corrector as an additional API call in a future version of HdrHistogram. Thanks for the idea!

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/WnOTco6ek4I/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

pron

unread,
Aug 11, 2013, 5:23:04 PM8/11/13
to mechanica...@googlegroups.com
Gil, could you please explain if and how CO can occur when the load generator uses asynchronous IO? TCP backpressure would then cause the requests (that would continue to be generated at a constant rate) to queue up at the client , but wouldn't that still be a valid measurement of the system's latency?

Gil Tene

unread,
Aug 11, 2013, 5:44:21 PM8/11/13
to <mechanical-sympathy@googlegroups.com>
If each request arrived on a separate TCP connection (as would be common on the front end of a web app), and the load generator did not block and kept everything going asynchronously such that no reduction or backoff in request rate  happened, AND the timing was measured by the load generator as the gap from when it sent the requests until it got the response, then no CO would occur.

However, it is quite common for the system-under-test to communicate with the outside world using a set of long lived TCP streams that are shared by many unrelated requests, as some sort of muxing up-stream is very common thing. E.g. in trading systems, messaging systems, and in most cases anywhere that is one step back from the front end of a web system that actually terminates short lived client sockets. In all such cases the shared reliable link will respond to a stalled server by filling up the buffers on the line (server and client receive buffers, etc.), quickly coming to the point where no more requests can be enqueued into it at the load generator side. At that point, even an asynchronous load generator will stop sending requests until there is room in the buffers again. In a blocking setup, the writes would simply block. In an asynchronous setup, the asynchronous writer would in one way or another wait for any it's output sockets to become writeable (e.g. in Java NIO waiting for an OP_WRITE), and until some are, it won't be sending any more data.

So even an asynchronous tester will exhibit CO if it can't continually keep pumping data towards the server for some reason.

-- Gil.

On Aug 11, 2013, at 2:23 PM, pron <ron.pr...@gmail.com>
 wrote:

Gil, could you please explain if and how CO can occur when the load generator uses asynchronous IO? TCP backpressure would then cause the requests (that would continue to be generated at a constant rate) to queue up at the client , but wouldn't that still be a valid measurement of the system's latency?

pron

unread,
Aug 11, 2013, 5:53:27 PM8/11/13
to mechanica...@googlegroups.com
> In an asynchronous setup, the asynchronous writer would in one way or another wait for any it's output sockets to become writeable (e.g. in Java NIO waiting for an OP_WRITE), and until some are, it won't be sending any more data.

It won't be sending any more requests, but if the requests wait for the socket to clear in some Executor's queue, and the client records the time the request was generated (rather then the time it was written to the socket), wouldn't that by fine even when a single TCP connection is used? Sure, in that case, much of the latency would be due to the request waiting in the queue at the client end, but still, it would be a valid measurement for the system's responsiveness, wouldn't it?
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jason Koch

unread,
Aug 11, 2013, 7:39:37 PM8/11/13
to mechanica...@googlegroups.com
Yes I think the situation here with CO really occurs when you have a higher number of "real world" consumer threads/processes than you do independent threads/processes in our test environment. For example, if you have 10 test threads and 100 real world concurrent users, you're going to be impacted by CO in your results. However, if you have 10 test threads and only 5 real world users you're probably not impacted (at least as far as application level impacts, some network/system impacts would not be modeled).

So - a suitable workaround to decrease (not eliminate) CO impacts would be to increase the number of test threads and/or use asynchronous/concurrent requests.

For an integration level test, rather than use 10 threads at 10000 requests per second, use 1000 requests at 100 requests per second or even 10000 threads at 1 request per second. You need to ensure local threading does not cause coordination impacts, but at least your measurements should be cleaner.

Gil - is this reasonable or am I missing something here? Obviously this approach will only work for some class of application.




--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Aug 11, 2013, 7:47:11 PM8/11/13
to <mechanical-sympathy@googlegroups.com>
Yes. If you have a queue in your tester, and the queue is Deep (enough to hold as many minutes of unsent requests as needed and never fill up), and the tester records the time that it put things on the queue as the start time. You would not suffer from CO.

On Aug 11, 2013, at 2:53 PM, pron <ron.pr...@gmail.com>
 wrote:

To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Aug 11, 2013, 7:52:10 PM8/11/13
to <mechanical-sympathy@googlegroups.com>
If the test system is one that talks to many (variable number of) clients, as is the case with web apps, then having many more testers than actual clients and making each tester have a much longer think time for each step compared to what an actual session would do (e.g. keep several minutes of spacing between any two requests in any single client) is a way around introducing CO to the data.

There are many applications where this is not an option. E.g. the number of connections served is often fixed and sessions multiplexed well ahead of the server, or the server may be serving a handful of very high speed request streams. In such situations, increasing the number of client threads is not an option.

On Aug 11, 2013, at 4:39 PM, Jason Koch <jas...@bluedevel.com>
 wrote:

To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Michael Barker

unread,
Aug 11, 2013, 11:08:35 PM8/11/13
to mechanica...@googlegroups.com
> Yes. If you have a queue in your tester, and the queue is Deep (enough to
> hold as many minutes of unsent requests as needed and never fill up), and
> the tester records the time that it put things on the queue as the start
> time. You would not suffer from CO.

You also have to consider the buffer itself. In my ping-pong latency
tests that compare LinkedBlockingQueue to the Disruptor, I am seeing
up 4+ ms latency spikes for LinkedBlockingQueue. So as your desired
throughput rates start to hit the ~1000 tx/sec range then even the
buffer that you put in place can become a factor.

Mike.

pron

unread,
Aug 12, 2013, 10:08:26 AM8/12/13
to mechanica...@googlegroups.com
Right. And I guess that as the buffer fills up you might get GC pauses on the client which are coordinated with the load on the server (i.e, not independently distributed), which means CO again...

ymo

unread,
Aug 13, 2013, 3:50:37 PM8/13/13
to mechanica...@googlegroups.com
We had to make a completely separate http sniffer (based on wireshark api) at one point to be able to show the existence of this "coordinated omission" . Not only was the time stamp (sending and receiving time) more correct it also proved that the generator was suffering from the delay caused by the SUT ( system under test).

If the protocol is based on http (or at least above tcp) it is much easier to do the time stamp and response analysis at the notwork level. This would be much more accurate than trying to compensate "coordinated omission" by some sort of algorithm.

Gil Tene

unread,
Aug 13, 2013, 4:42:10 PM8/13/13
to mechanica...@googlegroups.com
Wire-based timestamps are certainly better for recording response times or latencies. But if the load generator is delaying requests due to back-pressure from the SUT, getting time mesaureed right on the wire won't help. When such coordination between the tester and SUT exists, the only two things you can do to get data that is valid to analyze are:

1. Correct the data set for the coordinated omission using some sort of algorithm.

or:

2. Get the tester to do it's job right by never delaying due to SUT back-pressure.

ymo

unread,
Aug 13, 2013, 5:42:32 PM8/13/13
to mechanica...@googlegroups.com
Our simple solution was "If something hurts then stop doing it" Meaning that if the generator was not able to keep up we would just have more machines. However my point was that you need to have a separate network probe to tell you if you are getting coordinated omissions or not. Relying on the generator (alone) is usually error prone.

Howard Chu

unread,
Aug 13, 2013, 7:39:10 PM8/13/13
to mechanica...@googlegroups.com


On Saturday, August 10, 2013 3:11:02 PM UTC-7, Martin Thompson wrote:
BTW if you are capturing all network traffic you also see retransmissions. This is important to diagnosing problems.  A network probe is such a useful tool.

Wish I could still find cheap desktop ethernet hubs somewhere. The proliferation of ethernet switches makes it a lot harder to use a spare PC as a network monitor these days. My last hub bit the dust recently, an old Linksys 10/100 unit.

Michael Barker

unread,
Aug 13, 2013, 9:48:47 PM8/13/13
to mechanica...@googlegroups.com
My Draytek Switch/VDSL/Wifi box allows for a mirror port to be set up.  Haven't tried it out though.

Mike.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Aug 14, 2013, 11:14:02 AM8/14/13
to mechanica...@googlegroups.com
When the generator can't keep up, that's not coordinated omission. It's just omission. Coordinated omission happens when the generator is holding back waiting for an answer from the SUT, and by doing so missing out of planned samples that would have seen the co-inciding-in-time bad behavior of the SUT. Coordinated omission can (and usually does) happen even on load testers that are 1% utilized.

Adding test machines (or tester threads on current under-utilized machines) will only help if you made the "think time" interval between requests that the latency-reporting tester threads send much much larger than the numbers people typically use. So large that no amount of bad behavior by the SUT would ever be larger than your think time. If you want go to that extreme (e.g. a gap of at the very least 5 minutes between any two requests a tester thread sends) you would typically avoid coordinated omission. With thread-based testers, all it takes is tens of thousands of tester threads and lots and lots of patience (even with 30,000 threads, you can only reasonably measure about 100 latencies per second with 5 minute intervals).

BTW, while avoiding rapid-fire testing in thread (like back-to-back requests, or 100 requests per thread) does help a bit, thinking that a 1-10 second think time is "big" is a common mistake I see, as a single event over that size will completely skew all your data and all results based on that data. 

Gil Tene

unread,
Aug 16, 2013, 1:06:40 PM8/16/13
to mechanica...@googlegroups.com
Per the idea that Mirolyub gave me below for post-processing CO in HdrHistogram to keep the recording time constant in the fast path, I updated HdrHistogram to include a copyCorrectedForCoordinatedOmission() method. Now the caller has a choice between correcting at record time or as a post-recording copy. Results will be virtually identical if the same expectedIntervalBetweenSamples is passed (but not exactly the same due to slight round offs on the recorded values used in the correction, caused by pixelization of value buckets).

Enjoy.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages