Gil,
Do you think coordinated omission will ever happen for closed system?
I guess CO happens just for open systems when the number of clients is infinite.
In case the number of clients is limited (e.g. just a single client), CO does not happen.
Regards,
Vladimir Sitnikov
I've been harping for a while not about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> Actually, Vladimir, it's the opposite. The higher the thread count is in your test system, the longer the natural "think time" that each thread will normally model will be, and the less impactful coordinated omission will be.
Ok. Good point. I meant the following: CO happens when you try to model an open system with a closed system. Lots of load testers use limited number of threads, thus are closed.
>A single client system is the most likely to exhibit this problem, but most multi-threaded testers exhibit it as well.
In case your _real_ system is single client, no CO happens. Say, a specific task is performed by a single person. In case his request stuck there is noone to fire similar requests, thus no omissions, thus no CO. The case 'single person is not enough to generate required amount' is clear and it has nothing to do with CO compensation.
As far as I understand, there is no rule of thumb 'always use CO compensation in HdrHistogram'.
In case actual load rate matches the required one no compensation is required.
Regards,
Vladimir Sitnikov
--
Your observation on CO is really interesting and I have to admit I've made the mistake many times myself in the past. More recently I've been trying to be more aware of what can potentially get hidden in measurement and it occurred to me that while your view of CO does surface a whole class of missing observations when measuring, it does also tend to assign the best case scenario to them. Let me try and explain this.In the case of the Distruptor latencies tests, for which I'm guilty :-), we inject a new event once per microsecond and then measure the end-to-end time averaged per hop over a 3 stage pipeline. The CO issues arises when something stalls the injector from sending its next event at the 1us boundary. This could be GC, TCP back pressure, or any number of things. If I understand CO as you put it correctly, any events that did not get sent because the injector is stalled should be included in the percentiles. This does get us a lot closer to true percentiles. However it does not take into account the effect those missing events would have imposed on the system. For example, what about the queueing effects, cache pressures, potential buffer exhaustion, etc.??? If in reality the actual number of events got injected into the system to account for CO then the system may even have collapsed under the load.The more I dig into this subject the more I see evidence that not only is CO happening at the point of stalls in a system when doing latency testing, we are also deluding ourselves to how good our systems actually are compared to reality. To me the evidence is suggesting that when we do latency testing (with load testers) on our systems then our measurements are reflecting a much better picture than reality can actually be. If we are measuring actually latencies for all events in our production systems, especially with multiple points of injection, then we are getting a much more realistic picture.While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.
On the bright side of this, if the measurement technique, even when flawed, is showing a significant improvement between 2 systems then things are moving in the right direction even if the percentiles are a work of fiction. Well done to the log4j v2 folk for taking a big step in the right direction.
Martin...
I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).
I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.
Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.
Interesting remark.The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.
E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.
CO is a testing methodology and reporting problem, and has nothing to do with the system under test and what it is used for, so no type of system is immune, only types of testers are. Even real systems with a single real would client are susceptible.
The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result. The stats can often be orders of magnitude off. E.g. 35,000x off for the 99.99%'ile as I show in the example above, or the 99.9%'ile being reported as better than the real 99%'lie, etc.. CO happens for all testers that avoid sending requests when any form of back-pressure occurs (usually in the form of some previous request not completing before a new one was supposed to be sent according to the testing model).A simple way to demonstrate the CO problem on a "real system with a single client" would be this hypothetical:
Imagine that you have a high end concierge business with a single client, and that single client typically calls you on the phone about 10 times a day to perform some transaction (e.g. trade a stock, or check his account balances, buy a shirt, check the weather). In order to keep your customer happy, and avoid losing them to the internet businesses you compete with, you decide that you want to provide them with good customer service, which to you amounts to an actual person answering the phone within 3 rings, 99.9% of the time, as long as they call any time between 9AM and 5PM pacific time.
You decide to regularly measure your business performance to establish whether or not your behavior meets your goals (of 99% response within 3 rings), and to help you decide whether you need to hire additional people to answer the phone, or maybe replace someone if they are lazy.So you build a test system. The test system is simple: During a day that your customer is away on vacation and won't be calling you, you ring the business once every minute during the entire 9AM to 5PM period, and check how many rings it took before someone answered each time. You than compute the 99%'lie of that set of samples, and if that 99%'ile is 3 rings or better, you are performing within expectations. If it's not, you know that you need to improve the way your business works somehow (replace someone, or hire additional people to cover each other).You do the test, and it shows that your business really does answer the phone within 3 rings more than 99% of the time. In fact, most of the time the phone was answered in 1 or 2 rings, and of all the times your test system called, it took more than 3 rings only once. You feel happy. You tell your wife things are going great. You give your employees bonuses for over-performing.The next day your client fires you. He tried to call during the lunch hour, and nobody was there. In fact, this has been happening for a week now, and he just can't believe your outright dishonesty and false advertisement of your services.What happened in the above scenario is simple: You testing methodology experienced Coordinated Omission. You dialed the business once a minute for the entire day, and in 420 out of the 421 dialing tests you made, the phone was promptly answered within 3 rings or less. That's 99.76%! That's great. What you missed is that your single switch operator, the one that started last week and didn't get proper training, thought that she gets 1 hour off for lunch every day, and at 12 noon she left her desk and went to have lunch with her friends across the street. Being a conscientious worker, she was back at her desk promptly at 1PM, answering the phone that has been ringing for a while.When your test system encountered this, it recorded a single, 1800-ring phone call attempt at noon, followed by a 2 ring call at 1PM. Because it was busy waiting for the phone to be answered between 12 and 1, the test system missed 59 opportunities to call during lunch. Had it made those calls, it would have found that they all took longer than 3 rings to answer, that your better-than-3-rings call-answering percentile is only 87.5%, and that your 99%'ile answering time is actually 1,650 rings, and not not 3.And had you known that, you probably would have added capacity to your business so that when employees go out to lunch (or take bathroom breaks, or pause to take out the garbage), there is someone there to cover for them and answer the phone.On Saturday, August 3, 2013 11:20:01 PM UTC-7, Vladimir Sitnikov wrote:
--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:Interesting remark.The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.I would restrict this to behaviour in the test bed, not the component/app/what ever being testedE.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.
-- Kirk
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> Why would a quote machine be hit ?
manual order entry cannot send faster than one order each ~1-2 seconds. A quote machine oftens sends several quotes/second.
So if I test for manual order entry client with a single threaded program sending synchronous an order each 5 seconds, i don't have to adjust results (if max outlier is <5s).If I test for quote machine client with a single threaded program sending synchronous 10 quotes/second, i have to add 'missing testcases' because of blocked test program not sending quotes while waiting. Else data would report 1 latency incident instead of like 20.
Am Sonntag, 4. August 2013 16:03:33 UTC+2 schrieb Kirk Pepperdine:
On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:
Interesting remark.
The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.I would restrict this to behaviour in the test bed, not the component/app/what ever being tested
E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.
-- Kirk
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
--
While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.Yes, you have an overhead error on the front end and on error on the back end of any timed interval. Since the typical use case is get time(); doSomething(); getTime(), this is the equivalent of the cost of a single call to the timer. The error is in the backing out from getting the timer value in the first call + getting the timer value in the second call. Do you have any idea on the effects of distance between the two timing events in this use case?
First of all, I would like to clarify that those measurements are not done by me but rather by some Log4j committers. I just found the page (by somebody linking to it) and ran the tests myself because I wanted to see if there is room for improvement (TL;DR - this rate of logging is reaching the limit of the memory bandwidth, however latencies could be made more consistent - please reply in the other thread if you want to discuss this).
Getting back to CO:- you can see the actual test class here: https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/test/java/org/apache/logging/log4j/core/async/perftest/RunLog4j2.java - there are two methods (runThroughputTest and runLatencyTest) so I assume that the throughput graphs and latency graphs are independent (were collected in separate runs)
- for the latency case it is indeed measuring the cost of calling logger.log and also adjusting for the cost of calling nanoTime. Waiting after each logging statement is accomplished using busy-wait.Now getting back to your description of CO: do I understand correctly that the basic problem can be described as "percentiles are not guarantees for maximum values"? And isn't the solution as simple as just including the maximum value into the discussion? Ie. we say "if 99.99% of the time we respond in 100 msec and we always respond in < 1sec, we make money"? I also think this is in line with Martin's methodology of talking with clients :-)
--