Making sure coordinated omission is impossible

329 views
Skip to first unread message

ymo

unread,
Mar 31, 2014, 1:05:51 AM3/31/14
to mechanica...@googlegroups.com
I am testing a queue implementation that need to process a certain number of elements as fast as possible. The queues worker threads update an atomic long for the number of elements processed. this basically represents an ever growing sequence number. A logger thread sleeps for say 1000ms then logs 1) current time (gettimemillisecond)  and 2) the latest sequence number in each queue. All the operations per second calculation is is based off that logged time-stamp *after* the test is over.

Since i am logging this on the worker threads side i want to make sure that my assumption of this never suffering from coordinated omission is impossible. Is that assumption correct ?

Peter Lawrey

unread,
Mar 31, 2014, 5:22:50 AM3/31/14
to mechanica...@googlegroups.com
AFAIK co-ordinate omission only really matters when measuring individual latencies.  If you are measuring throughput, (or average latency) it doesn't matter so much.  This is because throughput (or average latency) is very good at hiding even very long pauses in the system.  e.g. if you run a test for an hour and it has 5 minute GC pauses, you can still get a high throughput (and low average latency) as long as it goes fast enough between pauses.

The latency measurement problem is; when you have a pause in the system, your producers stop generating messages which will have high latency timings. 

What I do is send data at a fixed rate e.g. one million per second, and include the time the message should have been sent not when it was actually sent. Obviously it should never be actually send before it should, only after. For each message, I also time when the message is finished with and the difference is the latency I use.  If the producer is delayed for any reason, this is reflected in my numbers look worse.

BTW Gil Tene has a number of talks on Co-ordinate Omission which are well worth seeing.


On 31 March 2014 00:05, ymo <ymol...@gmail.com> wrote:
I am testing a queue implementation that need to process a certain number of elements as fast as possible. The queues worker threads update an atomic long for the number of elements processed. this basically represents an ever growing sequence number. A logger thread sleeps for say 1000ms then logs 1) current time (gettimemillisecond)  and 2) the latest sequence number in each queue. All the operations per second calculation is is based off that logged time-stamp *after* the test is over.

Since i am logging this on the worker threads side i want to make sure that my assumption of this never suffering from coordinated omission is impossible. Is that assumption correct ?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nitsan Wakart

unread,
Mar 31, 2014, 8:20:28 AM3/31/14
to mechanica...@googlegroups.com
You can use HdrHistogram to record the latency at each consumer, which will give you a better idea about latency distribution than avg. latency. It also supports data correction for an expected arrival interval. The histograms are additive, so you could collect all the consumer histograms at the end of your run and summarize the data.

ymo

unread,
Mar 31, 2014, 4:14:39 PM3/31/14
to mechanica...@googlegroups.com


On Monday, March 31, 2014 5:22:50 AM UTC-4, Peter Lawrey wrote:
AFAIK co-ordinate omission only really matters when measuring individual latencies.  If you are measuring throughput, (or average latency) it doesn't matter so much.  This is because throughput (or average latency) is very good at hiding even very long pauses in the system.  e.g. if you run a test for an hour and it has 5 minute GC pauses, you can still get a high throughput (and low average latency) as long as it goes fast enough between pauses.


Exactly .. i did not realize i was testing throughput !!!
 
The latency measurement problem is; when you have a pause in the system, your producers stop generating messages which will have high latency timings. 

What I do is send data at a fixed rate e.g. one million per second, and include the time the message should have been sent not when it was actually sent. Obviously it should never be actually send before it should, only after. For each message, I also time when the message is finished with and the difference is the latency I use.  If the producer is delayed for any reason, this is reflected in my numbers look worse.


What do you do to make sure you are sending exactly at a fixed rate ? 

BTW Gil Tene has a number of talks on Co-ordinate Omission which are well worth seeing.



I did see them ... thanks a lot :-)
 
On 31 March 2014 00:05, ymo <ymol...@gmail.com> wrote:
I am testing a queue implementation that need to process a certain number of elements as fast as possible. The queues worker threads update an atomic long for the number of elements processed. this basically represents an ever growing sequence number. A logger thread sleeps for say 1000ms then logs 1) current time (gettimemillisecond)  and 2) the latest sequence number in each queue. All the operations per second calculation is is based off that logged time-stamp *after* the test is over.

Since i am logging this on the worker threads side i want to make sure that my assumption of this never suffering from coordinated omission is impossible. Is that assumption correct ?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

tm jee

unread,
Apr 1, 2014, 1:56:12 AM4/1/14
to mechanica...@googlegroups.com
| BTW Gil Tene has a number of talks on Co-ordinate Omission which are well worth seeing.

Pete, do you happen to have those talks links? Tia.

Peter Lawrey

unread,
Apr 1, 2014, 5:55:58 AM4/1/14
to mechanica...@googlegroups.com

There is a few versions available and I am not sure which is best.

Search for "how not to measure latency"

On 1 Apr 2014 00:56, "tm jee" <tmj...@gmail.com> wrote:
| BTW Gil Tene has a number of talks on Co-ordinate Omission which are well worth seeing.

Pete, do you happen to have those talks links? Tia.

--

Gil Tene

unread,
Apr 1, 2014, 10:32:27 PM4/1/14
to mechanica...@googlegroups.com
I keep updating the talk, but keep using the titles "How NOT to measure latency" or "Understanding Latency and Response Time". My latest version (Qcon London earlier this month) should hopefully go up on InfoQ soon. In the meantime, you can find others (including a ~ 1 year old version) at: http://www.infoq.com/author/Gil-Tene

On Tuesday, April 1, 2014 2:55:58 AM UTC-7, Peter Lawrey wrote:

There is a few versions available and I am not sure which is best.

Search for "how not to measure latency"

On 1 Apr 2014 00:56, "tm jee" <tmj...@gmail.com> wrote:
| BTW Gil Tene has a number of talks on Co-ordinate Omission which are well worth seeing.

Pete, do you happen to have those talks links? Tia.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Apr 1, 2014, 10:45:54 PM4/1/14
to mechanica...@googlegroups.com
As Nitsan suggests, recording the individual latency of each and every (measured with nanoTime immediately ahead and after the operation being measured) into an HdrHistogram and reporting the accumulated stats at the end will work well. If you want HdrHistogram to correct for coordinated omission (which will absolutely occur whenever the JVM pauses on you), you can tell it what the expected interval time is. This works well when you are feeding your test with a known input rate, but is harder to do when you are running a variable rate, or an "as fast as I can" test, since for these situations you end up having to estimate the intervals according to actual activity. There is also the problem of pause occurring outside of your timing window, which will cause coordinated omission, but which HdrHistogram won't know how to correct because you would never even have a single high latency item measured for the pause...

This is what I built LatencyUtils for. It lets you use a LatencyStats object, into which you log your individually measured latencies. Under the hood, LatencyStats has a built in interval estimator, and a process-wide pause detector, which together allow it to correct for coordinated omission caused by any (detected) pauses in the process. LatencyStats reports results with HdrHistograms: you can get both the raw and corrected histograms. In addition, LatencyStats conveniently includes interval sampling mechanism (using atomic double-buffered flipping of the histograms you record into), so you can observe stable interval histogram values while wait-free recording keep going on.

So I'd use a LatencyStats object to collect what you want, and then report the histograms that result (either for each interval, or the total accumulated, or both).

ymo

unread,
Apr 2, 2014, 5:02:54 PM4/2/14
to mechanica...@googlegroups.com
Gil. Thank you so much for your input. 

How does LatencyUtils compare to something like JMH. Are they complementary or you would only use one of them ?

My use case is that i have a bunch of threads generating input into a queue in a ""as fast as I can" rate. The input queues as well as draining them is driven by JMH threads. Assuming i can get the latency on a per call basis from JMH where would you put LatencyUtils/HdrHistogram in this scenario ?

Peter Lawrey

unread,
Apr 2, 2014, 5:24:57 PM4/2/14
to mechanica...@googlegroups.com
"as fast as I can" translates to "I don't care how bad the latencies are"

If you have an upper bound on acceptable latency e.g. it could be as high as minutes or as low as sub-milli-second, you need to find out what throughput you can sustain and achieve the latency requirements.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 2, 2014, 6:05:08 PM4/2/14
to mechanica...@googlegroups.com
More like what is my resulting throughput/latency if i do not throttle the incoming queues. I do not want to fix the incoming queues throttling issues for now. But i hear your argument )

Ultimately what you want is what they do in hw testing:
1) be able to generate the requests in a fixed rate fashion
2) measure the resulting throughput/latency
3.1) if it passes the requirement increase the "range" of incoming nbr of requests by half
3.2) if it does not pass the requirement divide the "range" of incoming nbr of requests by two
4) start over.

Basically a binary search for the ultimate throughput/latency. I am definitely not there ... )

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Apr 2, 2014, 7:20:29 PM4/2/14
to mechanica...@googlegroups.com
The force is strong with this one.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ymo

unread,
Apr 2, 2014, 11:39:23 PM4/2/14
to mechanica...@googlegroups.com
"do or do not .. there is no try"

Peter Lawrey

unread,
Apr 3, 2014, 12:58:54 AM4/3/14
to mechanica...@googlegroups.com

.... said the master of latency measuring.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Apr 3, 2014, 3:20:45 AM4/3/14
to mechanica...@googlegroups.com

On Apr 3, 2014, at 12:05 AM, ymo <ymol...@gmail.com> wrote:

> More like what is my resulting throughput/latency if i do not throttle the incoming queues. I do not want to fix the incoming queues throttling issues for now. But i hear your argument )

Sorry to say but if you’re using a queue you’re by default throttling… queuing or any form of scheduling suggests I have more work than resources and therefore I need to control access to those resources. That is throttling.

Regards,
Kirk

Peter Lawrey

unread,
Apr 3, 2014, 6:54:48 AM4/3/14
to mechanica...@googlegroups.com

A queue is fine if you time from when the task is *added* to the queue, not when it is removed.

On 3 Apr 2014 02:20, "Kirk Pepperdine" <ki...@kodewerk.com> wrote:

On Apr 3, 2014, at 12:05 AM, ymo <ymol...@gmail.com> wrote:

> More like what is my resulting throughput/latency if i do not throttle the incoming queues. I do not want to fix the incoming queues throttling issues for now. But i hear your argument )

Sorry to say but if you're using a queue you're by default throttling... queuing or any form of scheduling suggests I have more work than resources and therefore I need to control access to those resources. That is throttling.

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Apr 3, 2014, 6:59:48 AM4/3/14
to mechanica...@googlegroups.com
On Apr 3, 2014, at 12:54 PM, Peter Lawrey <peter....@gmail.com> wrote:

A queue is fine if you time from when the task is *added* to the queue, not when it is removed.


I’m not arguing against queues, In fact I like queues.. some of my best friends are queues… all I’m just saying that they act as a throttle. :-)

Regards,
Kirk

ymo

unread,
Apr 3, 2014, 7:13:20 AM4/3/14
to mechanica...@googlegroups.com
I started with "as fast as I can" rate ... Soon i am going to run out of English words to explain this ) 

For a lack of better words i would say that i am trying to do both of these in this order:

1) On the test side: be able to generate the requests in a fixed rate fashion and record the individual latencies *correctly*
2) On the app side: be able to service the requests in a fixed rate fashion and apply popper back pressure on incoming requests. Doing it properly is what i mean when i say throttling *issues*

I am still at bullet number one here !

Peter Lawrey

unread,
Apr 3, 2014, 7:15:15 AM4/3/14
to mechanica...@googlegroups.com

I agree. If your latencies look high put a queue in front of your work load and not only will your numbers look better but you can see an actual improvement in throughput in many cases.

Peter Lawrey

unread,
Apr 3, 2014, 7:19:03 AM4/3/14
to mechanica...@googlegroups.com

You need to ensure that applying back pressure is realistic. For example if you an exchange you cant apply back pressure and get every one on the exchange to wait for you.  If you have web user's you can do this but I would only consider this a last resort. Better to give good service a high percentage of the time.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 3, 2014, 7:25:14 AM4/3/14
to mechanica...@googlegroups.com
Peter, exactly what i want. But i am starting with the test side so that at least i can measure my worst cases. There is not point trying to fix the service If i cant measure/test this *correctly* 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Peter Lawrey

unread,
Apr 3, 2014, 7:35:32 AM4/3/14
to mechanica...@googlegroups.com

Right but the problem you can have is if you can't actually do the rate you try to test the average latency will be some multiple of how long you run the test for.

In this case I suggest a very low rate to start with, increasing geometrically and see at what point the worst case or high percenticle latencies appear to be getting out of hand. This is the point at which you should start tuning so they get back to something reasonable.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 3, 2014, 7:47:57 AM4/3/14
to mechanica...@googlegroups.com
Cool .. will do that !

As always, I want to say that appreciate and thank you for your valuable input )


On Thursday, April 3, 2014 7:35:32 AM UTC-4, Peter Lawrey wrote:

Right but the problem you can have is if you can't actually do the rate you try to test the average latency will be some multiple of how long you run the test for.

In this case I suggest a very low rate to start with, increasing geometrically and see at what point the worst case or high percenticle latencies appear to be getting out of hand. This is the point at which you should start tuning so they get back to something reasonable.

On 3 Apr 2014 06:25, "ymo" <ymol...@gmail.com> wrote:
Peter, exactly what i want. But i am starting with the test side so that at least i can measure my worst cases. There is not point trying to fix the service If i cant measure/test this *correctly* 

On Thursday, April 3, 2014 7:19:03 AM UTC-4, Peter Lawrey wrote:

You need to ensure that applying back pressure is realistic. For example if you an exchange you cant apply back pressure and get every one on the exchange to wait for you.  If you have web user's you can do this but I would only consider this a last resort. Better to give good service a high percentage of the time.

On 3 Apr 2014 06:13, "ymo" <ymol...@gmail.com> wrote:
I started with "as fast as I can" rate ... Soon i am going to run out of English words to explain this ) 

For a lack of better words i would say that i am trying to do both of these in this order:

1) On the test side: be able to generate the requests in a fixed rate fashion and record the individual latencies *correctly*
2) On the app side: be able to service the requests in a fixed rate fashion and apply popper back pressure on incoming requests. Doing it properly is what i mean when i say throttling *issues*

I am still at bullet number one here !


On Thursday, April 3, 2014 3:20:45 AM UTC-4, Kirk Pepperdine wrote:

On Apr 3, 2014, at 12:05 AM, ymo <ymol...@gmail.com> wrote:

> More like what is my resulting throughput/latency if i do not throttle the incoming queues. I do not want to fix the incoming queues throttling issues for now. But i hear your argument )

Sorry to say but if you’re using a queue you’re by default throttling… queuing or any form of scheduling suggests I have more work than resources and therefore I need to control access to those resources. That is throttling.

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Peter Lawrey

unread,
Apr 3, 2014, 11:21:08 AM4/3/14
to mechanica...@googlegroups.com
Fixed rate is a good starting point and most likely all you need. 
However, unless you know you will have a fixed rate in production, e.g because you have data coming down a speed limited line like 100 Mb or 1 Gb,  you may want to model something closer to production either by using real timestamps from recording in production or modelling burst of data e.g. batches of random size with some random interval.  Whether this really matters or not depends on your use case.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Aleksey Shipilev

unread,
Apr 4, 2014, 4:36:44 AM4/4/14
to mechanica...@googlegroups.com, mechanica...@googlegroups.com
JMH already records latencies in SampleTime mode with HdrHistogram-like fashion.

But, reading this thread resurrects the idea to introduce more or less standard load generators to model the incoming load with something different from "push as fast as you can"

-Aleksey

On 03.04.2014, at 1:02, ymo <ymol...@gmail.com> wrote:

Gil. Thank you so much for your input. 

How does LatencyUtils compare to something like JMH. Are they complementary or you would only use one of them ?

My use case is that i have a bunch of threads generating input into a queue in a ""as fast as I can" rate. The input queues as well as draining them is driven by JMH threads. Assuming i can get the latency on a per call basis from JMH where would you put LatencyUtils/HdrHistogram in this scenario ?


On Tuesday, April 1, 2014 10:45:54 PM UTC-4, Gil Tene wrote:
As Nitsan suggests, recording the individual latency of each and every (measured with nanoTime immediately ahead and after the operation being measured) into an HdrHistogram and reporting the accumulated stats at the end will work well. If you want HdrHistogram to correct for coordinated omission (which will absolutely occur whenever the JVM pauses on you), you can tell it what the expected interval time is. This works well when you are your test with a known input rate, but is harder to do when you are running a variable rate, or an "as fast as I can" test, since for these situations you end up having to estimate the intervals according to actual activity. There is also the problem of pause occurring outside of your timing window, which will cause coordinated omission, but which HdrHistogram won't know how to correct because you would never even have a single high latency item measured for the pause...

This is what I built LatencyUtils for. It lets you use a LatencyStats object, into which you log your individually measured latencies. Under the hood, LatencyStats has a built in interval estimator, and a process-wide pause detector, which together allow it to correct for coordinated omission caused by any (detected) pauses in the process. LatencyStats reports results with HdrHistograms: you can get both the raw and corrected histograms. In addition, LatencyStats conveniently includes interval sampling mechanism (using atomic double-buffered flipping of the histograms you record into), so you can observe stable interval histogram values while wait-free recording keep going on.

So I'd use a LatencyStats object to collect what you want, and then report the histograms that result (either for each interval, or the total accumulated, or both).

On Monday, March 31, 2014 1:14:39 PM UTC-7, ymo wrote:


On Monday, March 31, 2014 5:22:50 AM UTC-4, Peter Lawrey wrote:
AFAIK co-ordinate omission only really matters when measuring individual latencies.  If you are measuring throughput, (or average latency) it doesn't matter so much.  This is because throughput (or average latency) is very good at hiding even very long pauses in the system.  e.g. if you run a test for an hour and it has 5 minute GC pauses, you can still get a high throughput (and low average latency) as long as it goes fast enough between pauses.


Exactly .. i did not realize i was testing throughput !!!
 
The latency measurement problem is; when you have a pause in the system, your producers stop generating messages which will have high latency timings. 

What I do is send data at a fixed rate e.g. one million per second, and include the time the message should have been sent not when it was actually sent. Obviously it should never be actually send before it should, only after. For each message, I also time when the message is finished with and the difference is the latency I use.  If the producer is delayed for any reason, this is reflected in my numbers look worse.


What do you do to make sure you are sending exactly at a fixed rate ? 

BTW Gil Tene has a number of talks on Co-ordinate Omission which are well worth seeing.



I did see them ... thanks a lot :-)
 
On 31 March 2014 00:05, ymo <ymol...@gmail.com> wrote:
I am testing a queue implementation that need to process a certain number of elements as fast as possible. The queues worker threads update an atomic long for the number of elements processed. this basically represents an ever growing sequence number. A logger thread sleeps for say 1000ms then logs 1) current time (gettimemillisecond)  and 2) the latest sequence number in each queue. All the operations per second calculation is is based off that logged time-stamp *after* the test is over.

Since i am logging this on the worker threads side i want to make sure that my assumption of this never suffering from coordinated omission is impossible. Is that assumption correct ?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Apr 4, 2014, 2:56:38 PM4/4/14
to mechanica...@googlegroups.com
Now ... I just fell off my chair .. is it already Christmas ? Aleksey .. that would be awesome .... i wish i could help !!!

Maybe this requires another thread on how one would go about doing that ? 
Reply all
Reply to author
Forward
0 new messages