Gil,
Do you think coordinated omission will ever happen for closed system?
I guess CO happens just for open systems when the number of clients is infinite.
In case the number of clients is limited (e.g. just a single client), CO does not happen.
Regards,
Vladimir Sitnikov
I've been harping for a while not about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> Actually, Vladimir, it's the opposite. The higher the thread count is in your test system, the longer the natural "think time" that each thread will normally model will be, and the less impactful coordinated omission will be.
Ok. Good point. I meant the following: CO happens when you try to model an open system with a closed system. Lots of load testers use limited number of threads, thus are closed.
>A single client system is the most likely to exhibit this problem, but most multi-threaded testers exhibit it as well.
In case your _real_ system is single client, no CO happens. Say, a specific task is performed by a single person. In case his request stuck there is noone to fire similar requests, thus no omissions, thus no CO. The case 'single person is not enough to generate required amount' is clear and it has nothing to do with CO compensation.
As far as I understand, there is no rule of thumb 'always use CO compensation in HdrHistogram'.
In case actual load rate matches the required one no compensation is required.
Regards,
Vladimir Sitnikov
--
Your observation on CO is really interesting and I have to admit I've made the mistake many times myself in the past. More recently I've been trying to be more aware of what can potentially get hidden in measurement and it occurred to me that while your view of CO does surface a whole class of missing observations when measuring, it does also tend to assign the best case scenario to them. Let me try and explain this.In the case of the Distruptor latencies tests, for which I'm guilty :-), we inject a new event once per microsecond and then measure the end-to-end time averaged per hop over a 3 stage pipeline. The CO issues arises when something stalls the injector from sending its next event at the 1us boundary. This could be GC, TCP back pressure, or any number of things. If I understand CO as you put it correctly, any events that did not get sent because the injector is stalled should be included in the percentiles. This does get us a lot closer to true percentiles. However it does not take into account the effect those missing events would have imposed on the system. For example, what about the queueing effects, cache pressures, potential buffer exhaustion, etc.??? If in reality the actual number of events got injected into the system to account for CO then the system may even have collapsed under the load.The more I dig into this subject the more I see evidence that not only is CO happening at the point of stalls in a system when doing latency testing, we are also deluding ourselves to how good our systems actually are compared to reality. To me the evidence is suggesting that when we do latency testing (with load testers) on our systems then our measurements are reflecting a much better picture than reality can actually be. If we are measuring actually latencies for all events in our production systems, especially with multiple points of injection, then we are getting a much more realistic picture.While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.
On the bright side of this, if the measurement technique, even when flawed, is showing a significant improvement between 2 systems then things are moving in the right direction even if the percentiles are a work of fiction. Well done to the log4j v2 folk for taking a big step in the right direction.
Martin...
I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).
I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.
Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.
Interesting remark.The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.
E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.
CO is a testing methodology and reporting problem, and has nothing to do with the system under test and what it is used for, so no type of system is immune, only types of testers are. Even real systems with a single real would client are susceptible.
The CO methodology problem amounts to dropping or ignoring bad results from your data set before computing summary statistics on them, and reporting very wrong stats as a result. The stats can often be orders of magnitude off. E.g. 35,000x off for the 99.99%'ile as I show in the example above, or the 99.9%'ile being reported as better than the real 99%'lie, etc.. CO happens for all testers that avoid sending requests when any form of back-pressure occurs (usually in the form of some previous request not completing before a new one was supposed to be sent according to the testing model).A simple way to demonstrate the CO problem on a "real system with a single client" would be this hypothetical:
Imagine that you have a high end concierge business with a single client, and that single client typically calls you on the phone about 10 times a day to perform some transaction (e.g. trade a stock, or check his account balances, buy a shirt, check the weather). In order to keep your customer happy, and avoid losing them to the internet businesses you compete with, you decide that you want to provide them with good customer service, which to you amounts to an actual person answering the phone within 3 rings, 99.9% of the time, as long as they call any time between 9AM and 5PM pacific time.
You decide to regularly measure your business performance to establish whether or not your behavior meets your goals (of 99% response within 3 rings), and to help you decide whether you need to hire additional people to answer the phone, or maybe replace someone if they are lazy.So you build a test system. The test system is simple: During a day that your customer is away on vacation and won't be calling you, you ring the business once every minute during the entire 9AM to 5PM period, and check how many rings it took before someone answered each time. You than compute the 99%'lie of that set of samples, and if that 99%'ile is 3 rings or better, you are performing within expectations. If it's not, you know that you need to improve the way your business works somehow (replace someone, or hire additional people to cover each other).You do the test, and it shows that your business really does answer the phone within 3 rings more than 99% of the time. In fact, most of the time the phone was answered in 1 or 2 rings, and of all the times your test system called, it took more than 3 rings only once. You feel happy. You tell your wife things are going great. You give your employees bonuses for over-performing.The next day your client fires you. He tried to call during the lunch hour, and nobody was there. In fact, this has been happening for a week now, and he just can't believe your outright dishonesty and false advertisement of your services.What happened in the above scenario is simple: You testing methodology experienced Coordinated Omission. You dialed the business once a minute for the entire day, and in 420 out of the 421 dialing tests you made, the phone was promptly answered within 3 rings or less. That's 99.76%! That's great. What you missed is that your single switch operator, the one that started last week and didn't get proper training, thought that she gets 1 hour off for lunch every day, and at 12 noon she left her desk and went to have lunch with her friends across the street. Being a conscientious worker, she was back at her desk promptly at 1PM, answering the phone that has been ringing for a while.When your test system encountered this, it recorded a single, 1800-ring phone call attempt at noon, followed by a 2 ring call at 1PM. Because it was busy waiting for the phone to be answered between 12 and 1, the test system missed 59 opportunities to call during lunch. Had it made those calls, it would have found that they all took longer than 3 rings to answer, that your better-than-3-rings call-answering percentile is only 87.5%, and that your 99%'ile answering time is actually 1,650 rings, and not not 3.And had you known that, you probably would have added capacity to your business so that when employees go out to lunch (or take bathroom breaks, or pause to take out the garbage), there is someone there to cover for them and answer the phone.On Saturday, August 3, 2013 11:20:01 PM UTC-7, Vladimir Sitnikov wrote:
--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:Interesting remark.The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.I would restrict this to behaviour in the test bed, not the component/app/what ever being testedE.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.
-- Kirk
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> Why would a quote machine be hit ?
manual order entry cannot send faster than one order each ~1-2 seconds. A quote machine oftens sends several quotes/second.
So if I test for manual order entry client with a single threaded program sending synchronous an order each 5 seconds, i don't have to adjust results (if max outlier is <5s).If I test for quote machine client with a single threaded program sending synchronous 10 quotes/second, i have to add 'missing testcases' because of blocked test program not sending quotes while waiting. Else data would report 1 latency incident instead of like 20.
Am Sonntag, 4. August 2013 16:03:33 UTC+2 schrieb Kirk Pepperdine:
On 2013-08-04, at 12:30 PM, Rüdiger Möller <moru...@gmail.com> wrote:
Interesting remark.
The skew induced by CO also depends on type of Jitter (locking induced jitter might hit a single client only, GC induced hits all clients) and the frequency of test events vs outlier duration.I would restrict this to behaviour in the test bed, not the component/app/what ever being tested
E.g. a manual trading client sending max 500 orders a day, won't be able to hit a GC outlier of 400ms twice. An algotrading or quote machine client will be affected much harder.
-- Kirk
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
--
While on the subject of confessing measurement sins. I, and I'm sure many on this list, have measured the cost of calling System.nanoTime() by repeating calling it in a tight loop on one thread. If you do this you get 35-65ns between calls depending on processor speed and what latest version of Linux and JVM you are using. On Windows it often does not advance for 10s of calls at a time. This is very misleading because you can typically add at least a "dirty hit" cache snoop cost, or much worse on a multi-socket server. In a realistic scenario so you need to be assuming >100ns cost per call with a fair bit of variance.Yes, you have an overhead error on the front end and on error on the back end of any timed interval. Since the typical use case is get time(); doSomething(); getTime(), this is the equivalent of the cost of a single call to the timer. The error is in the backing out from getting the timer value in the first call + getting the timer value in the second call. Do you have any idea on the effects of distance between the two timing events in this use case?
First of all, I would like to clarify that those measurements are not done by me but rather by some Log4j committers. I just found the page (by somebody linking to it) and ran the tests myself because I wanted to see if there is room for improvement (TL;DR - this rate of logging is reaching the limit of the memory bandwidth, however latencies could be made more consistent - please reply in the other thread if you want to discuss this).
Getting back to CO:- you can see the actual test class here: https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/test/java/org/apache/logging/log4j/core/async/perftest/RunLog4j2.java - there are two methods (runThroughputTest and runLatencyTest) so I assume that the throughput graphs and latency graphs are independent (were collected in separate runs)
- for the latency case it is indeed measuring the cost of calling logger.log and also adjusting for the cost of calling nanoTime. Waiting after each logging statement is accomplished using busy-wait.Now getting back to your description of CO: do I understand correctly that the basic problem can be described as "percentiles are not guarantees for maximum values"? And isn't the solution as simple as just including the maximum value into the discussion? Ie. we say "if 99.99% of the time we respond in 100 msec and we always respond in < 1sec, we make money"? I also think this is in line with Martin's methodology of talking with clients :-)
--
Ishaaq
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Ishaaq
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. You can find examples of me talking about this, with some detailed explanation of the problem in my "How Not to Measure Latency" talk (The Coordinated Omission part starts around at around 33:50).I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements.Attila's results are posted at http://logging.apache.org/log4j/2.x/manual/async.html#Performance, and while they demonstrate vastly superior throughput and latency behavior compared to other loggers (including log4j, obviously), I see an issue with the reported numbers for the 99.99% latencies, (and probably for the 99%). This gripe probably applies to how the LMAX disruptor numbers are reported for 99.99% as well, but there I don't have enough data within what's posted to prove it.Basically, I think that the 99.99% observation computation is wrong, and demonstrably (using the data in the graph data posted) exhibits the classic "coordinated omission" measurement problem I've been preaching about. This test is not alone in exhibiting this, and there is nothing to be ashamed of when you find yourself making this mistake. I only figured it out after doing it myself many many times, and then I noticed that everyone else seems to also be doing it but most of them haven't yet figured it out. In fact, I run into this issue so often in percentile reporting and load testing that I'm starting to wonder if coordinated omission is there in 99.9% of latency tests ;-)Here are the basic assumptions, and the basic math:Assumptions:1. I am *assuming* that in the graphs titled "Async Logging Latency Histogram" and "Async Logging Max Latency of 99.99% of Observations", the latency being measured is the latency of the call to the logger. I.e. the time that the thread spent inside the call (as opposed to a time measured between the call and some asynchronous event recorded concurrently with the thread's execution). This assumption is key to the following math, so if it's wrong, the rest of the below is basically male cow dung.2. The test description notes that "After each call to Logger.log, the test waits for 10 microseconds * threadCount before continuing", but doesn't mention how this wait is achieved (e.g. Thread.sleep() or spin). In any case, since 64 * 10 is 640usec, I'll round up and assume that a wait is for 1msec (analysis below will be even worse if the wait is shorter than that).Analysis:- The normal observation interval for this test is ~1 msec. I.e. each thread running the test would "normally" issue a new logging message once every ~1 msec. (the ~1msec wait plus the very tight ~1-5usec average Logger.log() call latency).- As long as the Logger.log() call latency is lower than ~1msec (which is the vast majority of the time), things are "fine". However, when Logger.log() call latencies much larger than ~1msec occur (e.g. the graph shows 10 recorded calls that took 536msec each), each of those calls also "freezes" the testing thread for a duration much longer that the normal interval between two observations.- Omitted observations: Effectively, each such "freeze" omits observations that would have normally happened had it not been for the back-pressure created by the call latency. I.e. Had an observer continued to issue logging calls at the normal observation interval during this freeze, additional results would have been included in the overall stats computations. E.g. for each of those ~536msec freezes, there are ~535 missing observations.- Coordinated Omission: Omission on it's own is not a big problem, as long as the omission is random. E.g. If we randomly threw away 3 million of those 5 million latency measurements each logging thread was doing, the statistical results would probably not be materially affected. However, when omission is coordinated with observed events, it can dramatically skew the statistical analysis of the remaining results. In the case of this sort of coordinated omission, we are only omitting the "very bad results" (results that are larger than 1msec. I.e. larger than 200x the average latency). This has a huge impact on the eventual percentile calculation.- Based on the 10 observations of 526msec each alone (and ignoring the 50+ other huge results next to them), we can deduce that an actual observer would have seen 5,350 additional results [10 x (536 -1)] ranging linearly between 1 msec and 535 msec (with a median of ~268msec).- If we add those 5,350 results to the 5,000,000 results that the thread recorded, they would represent >0.1% of the overall data, and half way into that additional data (at 0.05% of overall data) we would see the median result of the missing set, which is 268msec.- Therefor, based on the 10 outlier in the graph alone, your 99.95%'ile is at least 268msec.- Similarly, the 99.99%'ile is at least 582 msec (the 90%'ile of the missing data that represents 0.01% of the total observations).That's at least 35,000x higher than reported...There is good news here, too. HdrHistogram has a recording mode that automatically corrects for coordinated omission if you know what your expected interval between measurements is (and here it is ~msec). If you add an HdrHistogram to such a test and record your call latencies using the recordValues(value, expectedIntervalBetweenValueSamples) form, the Histogram will automatically re-created the missing measurements. It is often useful to compare the output of two HdrHistogram's one of which uses the correcting form and one of which that just records the raw values, as the differences in output at the higher percentiles is often "impressive".Foot note: anyone want to guess what that 582 msec 99.99%'ile is dominated by?
On Thursday, August 8, 2013 2:02:36 AM UTC-7, Remko Popma wrote:...
A few questions:
- I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?
I think such corrections are valid, but that one should be careful with establishing the estimated cost of nanoTime that is actually safe to subtract. Specifically, I believe that it is valid and safe to subtract the minimum measured cost of nanoTime from all measurements, but not the average cost. So in establishing the cost, I'd run a long enough loop of two back-to-back calls to nanoTime, and look for the minimum experienced gap between them. And yes, the minimum could end up being 0, which would mean that's the maximum safe amount to shave off your latency measurement, because it would mean that you don't actually know that the measurement cost was not zero at your sample point.
- I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?
Take a look at the percentile distribution diagram style I use in jHiccup. Rather than choosing one or two percentiles to report, I like to plot the entire percentile spectrum. The Excel (or other plotting setup) work to make this chart happen can be a bit tricky (I can detail it and provide an Excel sheet example for anyone interested), but it's worth it presentation-wise. While showing the distribution in counts-per-bucket is useful (keep doing that), I find that the most informative "relevant to operational and business requirements" chart is a percentile distribution chart that looks like the below, as it allows you to directly compare the reported behavior against requirements (an SLA, for example, would be represented as a simple step-function thing on such a graph):The output from HdrHistogram's histogram.getHistogramData().outputPercentileDistribution() was specifically designed so that a few tens of percentile report lines still provide enough dynamic range and percentile resolution that it can be taken straight into a plot like the above, and keep the plot nice, smooth and accurate. This is exactly where reports from linear or logarithmic histogram buckets fail, as quantization kills the graphs, and that (I think) is why most people report histograms in counts-per-bucket format instead of like the above...Note: This specific plot may be recognized by some on this list, as it is someone else's actual test, data and plot (go ahead and identify yourself if you want ;-). It was not prepared by me, except that I anonymized all the labels.
Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.
The cleanest way to do a CO-correction vs. non-corrected comparison (like Mike B. did) is to collect and report two separate histograms (one with CO-correction and one without) in the same run, on the same data. I.e. measure only once but record twice. This will take any cross-test noise (of which there is usually plenty) out of the picture, and makes the CO (and CO correction) effect very clear and easy to do math on. Text output from histogram.getHistogramData().outputPercentileDistribution(System.out, 5 /* steps per half */, 1000.0 /* unit /* value scaling */); is enough to see the difference, and a plot like the above would make it very clear visually.I'd obviously love to see Zing numbers too, of course. I think those will be much more relevant after you settle on whether or not the CO correction is something that is needed.
...
A few questions:
- I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?
- I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?
Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
On Thursday, August 8, 2013 2:02:36 AM UTC-7, Remko Popma wrote:
...
A few questions:
- I copied a lot of performance test ideas from the Disruptor test code, including the idea to measure the cost of System.nanoTime() and subtract that cost from the measurement. I did not see that idea discussed, so I assume that is still valid, correct?
I think such corrections are valid, but that one should be careful with establishing the estimated cost of nanoTime that is actually safe to subtract. Specifically, I believe that it is valid and safe to subtract the minimum measured cost of nanoTime from all measurements, but not the average cost. So in establishing the cost, I'd run a long enough loop of two back-to-back calls to nanoTime, and look for the minimum experienced gap between them. And yes, the minimum could end up being 0, which would mean that's the maximum safe amount to shave off your latency measurement, because it would mean that you don't actually know that the measurement cost was not zero at your sample point.
- I actually considered only using the histogram diagram to show Log4j2 latency and not including the "average latency" and "Latency of 99.99% of Observations" diagrams. Unfortunately the histogram gets too busy when comparing more than two results, where the "average" and 99.99% diagrams allowed me to compare the disruptor-based async loggers to a number of other loggers (logback, log4j-1.x and log4j2's queue-based async appender). Thoughts?
Take a look at the percentile distribution diagram style I use in jHiccup. Rather than choosing one or two percentiles to report, I like to plot the entire percentile spectrum. The Excel (or other plotting setup) work to make this chart happen can be a bit tricky (I can detail it and provide an Excel sheet example for anyone interested), but it's worth it presentation-wise. While showing the distribution in counts-per-bucket is useful (keep doing that), I find that the most informative "relevant to operational and business requirements" chart is a percentile distribution chart that looks like the below, as it allows you to directly compare the reported behavior against requirements (an SLA, for example, would be represented as a simple step-function thing on such a graph):
The output from HdrHistogram's histogram.getHistogramData().outputPercentileDistribution() was specifically designed so that a few tens of percentile report lines still provide enough dynamic range and percentile resolution that it can be taken straight into a plot like the above, and keep the plot nice, smooth and accurate. This is exactly where reports from linear or logarithmic histogram buckets fail, as quantization kills the graphs, and that (I think) is why most people report histograms in counts-per-bucket format instead of like the above...Note: This specific plot may be recognized by some on this list, as it is someone else's actual test, data and plot (go ahead and identify yourself if you want ;-). It was not prepared by me, except that I anonymized all the labels.
Attila, thanks for re-running the test. Apologies that I have not committed the code that prints out the detailed numbers you need to create the latency histogram, at the moment it only prints the percentiles. I will try to rectify this.Gil, I will try to run the 64-thread latency test twice, once with and once without the CO correction you suggested. I will post the results here. If possible I'll also do a comparison between the Oracle JVM and Zing.
The cleanest way to do a CO-correction vs. non-corrected comparison (like Mike B. did) is to collect and report two separate histograms (one with CO-correction and one without) in the same run, on the same data. I.e. measure only once but record twice. This will take any cross-test noise (of which there is usually plenty) out of the picture, and makes the CO (and CO correction) effect very clear and easy to do math on. Text output from histogram.getHistogramData().outputPercentileDistribution(System.out, 5 /* steps per half */, 1000.0 /* unit /* value scaling */); is enough to see the difference, and a plot like the above would make it very clear visually.
I'd obviously love to see Zing numbers too, of course. I think those will be much more relevant after you settle on whether or not the CO correction is something that is needed.-- Gil.
> - percentiles are not guarantees for maximum values. Measure maximum values
> if that's what you need.
>
> - if you measure percentiles, measuring it this way (send a request / wait /
> send an other request) can give you an overly optimistic estimation of the
> percentile because it is a more "kind" distribution than the random one you
> will have in production. Or to put it otherwise: if you use such benchmarks
> for SLAs, your clients will prove you quickly long.
On the other hand, if you run the same test, but you don't measure
latency but throughput (that is, how many sends you can do in the unit
of time, perhaps with wait = 0), then the result does not suffer of
CO: if you have a GC pause between 2 requests, the total number of
requests will be less and the throughput will be less (not measuring
percentiles here).
Also, if you measure latency asynchronously so that your test has one
thread running the requests in a send + wait loop, but other threads
measuring the response arrival time, then also in that case there is
no CO: a GC pause in the sends will not affect the roundtrip time
measured, and a GC pause in the responses will be included in the
measurements, and the percentiles will be right.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
--
Hi Gil,I watched the presentation - really good.I have some thoughts though to check -in the video at minute 35 (just to ref. back) co-ordinated omission can be caused by a load generator like jmeter or grinder (~ 37:45) both of which I've used in the past.The test requirement as I understand it is to test a request rate of 100 requests /sec.
Absolutely agree with the analysis in terms of (1) identifying there is an issue with the data and (2) fixing it where possible. I come at it from a slightly different direction though.(1)the workload model : "1 user" - this frames it as a closed (as in feedback loop) workload model where as you said each subsequent request is blocked by the current one. The requirement is an arrival rate which is an open workload model where each request is independent.
From there it's not going to work as the load tool selected will be wrong - eg. most (web) load generators are vuser per thread implementation.This paper which I stumbled across funnily enough from the jmeter plugins website seems to sum up in similar terms the catastrophic impact of getting the workload model wrong, for latency which you have covered and system utilisation/other effects.
(2)In terms of how to fix the situation described - ideally get an open workload model load generator (the paper identifies some).
Given the difficulty of implementing a truely open generator, and the fact that you have identified that some (spec) are not in fact open at all, I suspect that some thorough testing of the tool would be needed before trusting the results and what bounds they are reliable within.
A lot of systems/components service open workloads so seems important to get this right.
Related - on your hdrhistogram - how do you think that compares with Ted Dunning's T-Digest? I think they are both trying to achieve similar goals(?) but different implementations, I guess there will be different performance characteristics when measuring at high rates. accuracy wise would be interested in comparison https://github.com/tdunning/t-digest
Thanks,Alex
On Thu, Apr 17, 2014 at 4:17 PM, Gil Tene <g...@azulsystems.com> wrote:
Not yet (for academic papers). Some folks at UIUC are potentially working on one.
You can find some discussion in my Latency and Response time related presentations (e.g. http://www.infoq.com/presentations/latency-pitfalls), in the posting here and elsewhere.
On Thursday, April 17, 2014 6:35:08 AM UTC-7, Alex Averbuch wrote:Are there any published papers about Coordinated Omission?
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Sorry for the multi-day pause preceding my response (traveling and bigger down in work). See some commentary inline.
On Monday, April 21, 2014 3:14:16 PM UTC-7, alex bagehot wrote:Hi Gil,I watched the presentation - really good.I have some thoughts though to check -in the video at minute 35 (just to ref. back) co-ordinated omission can be caused by a load generator like jmeter or grinder (~ 37:45) both of which I've used in the past.The test requirement as I understand it is to test a request rate of 100 requests /sec.Not quite. The key thing is that this is not about a test, or test conditions.
It's about describing the behavior of a system,
and about how the description resulting from measuring with coordinated omission
can fall very far away from the actual behavior as it happens. I describe a system exhibiting a specific, known, well defined behavior, whose proper description can be easily agreed on. I then ask a simple question: How will your measurement tools describe the behavior of this system?
The specific known behavior description is simple: The system is able to perfectly handle 100 requests per second answering each in 1msec,
but it is artificially stalled for a period 100 seconds every 200 seconds (i.e. 100 seconds of perfect operation, followed by 100 seconds of a complete stall, repeat). The easy to agree upon description of the behavior of such a system by any external observer (open or closed would not matter) is this: The 50%'lie is 1 msec, the Max and 99.99%'ile will both be very close to ~100 seconds, and the 75%'lie would be ~50 seconds. In addition, it's easy to se that the mean should be around 25 seconds. Any measurement that describes this system in significantly more positive (or negative) terms would be flat out lying about what is going on.
I then demonstrate how coordinated omission, which amounts to preferrably throwing away bad results (as opposed to good results) that would/should have been measured or kept if a valid description of system behavior is to be made, results in dramatically swayed results. As in reporting both the 99.99%'lie and 75%'lie as 1msec (which is 100,000x and 50,000x off from reality), and reporting a mean of 10.9msec (which is ~2300x off from reality).
Absolutely agree with the analysis in terms of (1) identifying there is an issue with the data and (2) fixing it where possible. I come at it from a slightly different direction though.(1)the workload model : "1 user" - this frames it as a closed (as in feedback loop) workload model where as you said each subsequent request is blocked by the current one. The requirement is an arrival rate which is an open workload model where each request is independent.There is no requirement stated, or needed, for a test system. There is a description of a known system behavior, and an expected sensible report on it's performance. The test system is supposed to be designed to supply that description, and this discussion is a demonstration of how it does not. A "correct" test system can certainly be built to report on this system's behavior. Any test system that fails to collect the same number of measurements per second during the stall period as it does outside of it will always fail to describe the system behavior. But pretty much any test setup which will be able to keep 10,000 requests in flight would work for this example. Why 10,000? Because that is the number of requests (in this example) that must be in flight during the 100 second stall, if the system is tested with 100 requests per second (closed or open does not matter) outside of the stall.
In the specific example, I demonstrated the problem with a single client tester. But it would separately happen for each client of any multi-client tester that does not issue requests asynchronously, without waiting for responses first.
E.g. if I change the description to "able to handle 100 clients, with 100 requests per second from each client" the numbers would remain exactly the same. And if I changed it to "handles 10 clients totaling 100 requests per second", the numbers of a classic load generator would get 10x closer to reality. Which would still be meaningless when facing being 100,000x off. Basically, each client will be off by roughly the factor between it's expected interval between issuing requests and the system's stall time. If the stall time is larger than the expected interval, the results will be off. By a lot.
And no, stating that there should be enough clients to sustain an "open" system test does not help avoid this mis-representation, unless the world you live in is one where each client sends exactly one request, and never follows up. Very few real world application are like that. Certainly no web application looks like that.
For example, a tcpdump of the https session re-loading this discussion page involves many tens of serial round trips, spread across more than 10 separate https socket connections, and spanning more than 4 seconds. There is no valid "open" load tester way of describing what the experienced response time of a user of this google-groups page would be.
From there it's not going to work as the load tool selected will be wrong - eg. most (web) load generators are vuser per thread implementation.This paper which I stumbled across funnily enough from the jmeter plugins website seems to sum up in similar terms the catastrophic impact of getting the workload model wrong, for latency which you have covered and system utilisation/other effects.The paper does a good job of describing the difference between open and closed load testing systems, and demonstrates how the results for mean response times can vary widely between them. But it focuses on load testers and on mean measurements, neither rod which are really of interest to someone looking at an SLA report, at a bunch opt angry users, or at failed trades.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
if (reportedAverageLatency < ((maxReportedLatency * maxReportedLatency) / (2 * totalTestLength))) { Suspect Coordinated Omission.
// Note: Usually, when this test triggers due to CO. it will not by a small margin...
// Logic:
// If Max latency was related to a process freeze of some sort (commonly true
// for Java apps, for example), and if taking measurements were not coordinated
// to avoid measuring during the freeze, then even if all results outside the pause
// were 0, the average during the pause would have been Max/2, and the overall
// average would be at least as high as stated above.
}
// Sanity test reportedLatencyAtPercentileLevel for a given
// percentileLevel, maxReportedLatency, and totalTestLength.
Double fractionOfTimeInPause = maxReportedLatency / totalTestLength;
Double fractionOfTimeAbovePercentileLevel = 1.0 - percentileLevel; // Assumes percentileLevel is a fraction, between 0.0 and 1.0
Double smallestPossibleLatencyAtPercentileLevel = maxReportedLatency * ( 1.0 - (fractionOfTimeAbovePercentileLevel / fractionOfTimeInPause));
if (smallestPossibleLatencyAtPercentileLevel > reportedLatencyAtPercentileLevel) {
Suspect Coordinated Omission.
}
Hello Gil
First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).
I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.
Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.
I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.
As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.
You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.
Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened.
Which is impossible.
In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.
The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.
Let's take your wonderful concierge business example.
If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.
The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.
In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.
To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.
I started talking about Coordinated Omission in test systems mostly, but I kept finding it everywhere, including in many production monitoring systems.
For anyone who doubts this, here is a simple exercise: go watch the %'lie outputs in Cassandra monitoring output. Manually ^Z the Cassandra servers for 6 seconds every minute, and see what your console reports about the cluster's 99%'ile... Now think about what the ops and capacity planning people responsible for service level management of these clusters are thinking when they look at that output.
[Oh, and for those who doubt that this ^Z is a valid test for Cassandra: Read your GC logs. ;-) ]
More comments inline below.
On Sunday, May 4, 2014 3:27:47 PM UTC-7, Nick Yantikov wrote:
Hello Gil
First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).
I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.
Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.
I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.
As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.
Actually, sampling would be fine, and omission would be fine, as long as no coordination happens.
But unfortunately, production monitoring of event latency is just as susceptible to coordination as test measurement is.Even if you collected all actual production client-observed response times, you would remain exposed. E.g. if you data indicated that during each period of 1000 seconds, each production client reported 1 response taking 500 seconds, and 9999 responses taking 20 msec each, and you did not recognize that there is coordinated omission in the incoming data set, you will erroneously report that the 99.99%'lie observed by clients is 20msec (which would be wrong by a factor of 2,500x). Natural noise and variability in these numbers would not make the conclusion any less wrong.
Attached is a chart of an actual production system doing exactly this sort of reporting (at a different time scale, but with a similar ~2,500x discrepancy), plotted against the reporting of two different level of CO-correction of the same data. All three sets of data were collected at the same time, on the same system, in production. The corrected data sets include a correction technique that is now part of LatencyStats class, meant to drop straight into this sort of measurement idiom (see http://latencyutils.org)
You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.
Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened.
Which is impossible.
For in-process latency measurement in production, that's what I thought, too. But when I tell myself something is "impossible", I often find it hard to resist the urge to prove myself wrong. So I did, that's what the LatencyUtils library is about.It turns out that while correcting for all causes of CO is "very hard", correcting for process freezes during in-process latency measurement is not that hard. And in much of the world I see, CO reporting errors are dominated by such process freezes. The technique I use under the hood in a LatencyStats object leverages a process-wide pause detector with a per-recorded-stats-object interval estimator. The pause detector is pluggable, with a provided default detector is based on consensus pause observation across a number of detection threads. It uses settable thread counts and thresholds. It is very reliable at detecting pauses of 1msec or more using three sleeping threads (which is nearly free to do), and seems to detect process-wide pauses as small as 50usec reliably if you are willing to burn 3-4 cpus spinning non-stop to do so (which most people will likely not be willing to do). The interval estimator is also pluggable, but the one provided is basically a time-capped moving average estimator, which does a pretty good job.The answer to the "in order to compensate you have to predict rate of events that might or might not have happened" point you make above (which is a good way of stating the problem) is that when a known process freeze occurs, projecting from the reality that preceded the freeze into the freeze period is a very good guess. Certainly much better than ignoring the fact that the freeze erased a whole bunch of known-to-be-terrible results, destroying the validity of the data set. Since we just need counts at various latencies, all we need to know is how many missing measurements there were, and what their magnitude would have been, and this is trivial to reconstruct given the freeze size and the expected interval estimation. Using a constant rate projection into the freeze works fine, with no need to model rate variability *within* the freeze, since it would not make much a difference to the impact of correction on the percentiles picture.With all this, LatencyUtils only provides partial correction, but that partial correction is often of 99%+ of the error. The bigger and/or more frequent the actual process freezes are in your actual system, the more important the correction becomes, especially when freezes are a significant multiple of the typical measured latency (in which case the correction may often become the only valid signal above ~99%).
In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.
The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.
Let's take your wonderful concierge business example.
If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.
In the specific scenario, I don't see a need for groups-of-10-calls testing. While it will probably work, a very simple test system, as well as a very simple production-time-sanity-checking system can easily measure the right thing (the right thing is "what would your customer's perception of the percentile be?"): just poll the business at random intervals some number of times per day (which should also vary somewhat), and report of the stats of what you see. Just make sure not to coordinate between polling attempts, and your percentile numbers will match what your customer would consider to be a valid representation, and therefore a useful way to gauge the quality of your system.
A scenario like the one I describe (the concierge service) is actually one where you would want you rproduction-time validation to be happening much more frequently than the customer's actual use, so that you will be likely to detect (and potentially correct) issues before your customer does. This sort of "failure is very expensive" driven continuous production validation is actually common in the real world as well. E.g. of algo trading system that only trade a few times a day, it is common practice to do sanity test trades (small or irrelevant) at a significanlty higher rate than the actual trades that matter happen, just to maintain an evaluation of what expected trade behavior may be like (some systems can even loop this knowledge back into the actual trading decisions).
The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.
I agree that percentiles should usually be reported in time intervals, as that's how they are most often stated. When requirements are set, they are usually stated about a worst case time interval. E.g. "The 99.9%'lie should be 200usec or better in any given 15 minute period during the day" as opposed to "in at least one 5 minute interval during the day" or "across the entire day, even if it is much worse during some 15 minute period."
In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.
To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.
Look at LatencyUtils.org, think of how using LatencyStats would apply to something like Cassandra reporting it's production-time latency stats, and let me know if that changes your mind...
Hi AttilaI'll attempt to explain. Gil has a presentation on it and I'm sure there's a few analogies in these email threads. CO is something to do with your testing methodology, and I'll give an "obvious" example that shows why your tests might not show the real world problems if you're suffering CO.The challenge is that, in the real world, your consumers are not coordinated with your system. The consumers are independent. They will send a request whether you have responded to someone else's request or not. So ..
Suppose you are using a few N consumer threads in a tester, and each is in a tight loop for 1 million "events". Eg:for(int i = 0; i < 1000000; i++) { long start = nanotime(); log.info("xxxxxxx"); long duration = nanotime() - start; addToHisto(duration); }Here's where CO comes in:Suppose that during the middle of the run, there was some thing that causes the log.info() time to take longer. It could be a hashmap rebuild, a file system buffer full flush, a GC run; it doesn't really matter what it is, just that for some reason processing is stalled. In your test harness, the "consumer" threads wait for the system to recover before sending more requests (they are coordinated). So, only "N" requests would be showing the full pause time (eg 1 second), all other requests would show un-paused time (eg 1 ms).However in a _real world_ situation, it's unlikely your consumers are going to recognise that your system is under load and stop sending requests for that full second. They're going to keep firing them, and probably a lot of times, during that 1 second. So in the real world, all of these consumers will ALSO see that pause. The issue of CO is how many requests are _truly_ impacted by that 1 second pause versus the N that you've shown in your test report (due to omitted results).Hope this helps (?).ThanksJasonOn Thu, Aug 8, 2013 at 5:43 AM, Attila-Mihaly Balazs <dify...@gmail.com> wrote:
Hello all,I had a chance to rerun the test with Gil's modifications and also some changes though up by me: https://gist.github.com/cdman/6177268The gist of it (sorry for the pun) is that I don't seem to get differences with this test.And sorry for harp on this, but I still don't get why CO is something to worry about, so I would appreciate a "for dummies" explanation:- I get that no percentile is a guarantee for the maximum value (as it demonstrated by Gil's customer service metaphor) and you should always agree on both with the business sponsors when creating a system- On the other hand I find it to be just a flaw in a theoretical model where we insist that we want to "poke" the system every X ms (or maybe us) and failing to do so deviates from our test plan. However what would such a test model? Are we trying to write a hard-realtime system? :-) (or maybe a trading system as Peter has described where we have X ms to process each network packet). In this case we should be worried about maximums. Other than that systems are very undeterministic and we're not showing anything by trying to create a very rigid benchmark runner. The only usecase I can think of if for some reason (marketing?) we are trying to show that the system can handle "a sustained load of X events / second" - btw if we make such a claim, what is the difference between showing down X events at once and waiting until the next second vs trickling them? The math works out the same.- Finally, an observation: I'm no math person (just ask any of my university math teachers :-)), but AFAIK StdDev / Mean are only useful if we suppose that the underlying distribution is a normal one. However latency distributions (again, AFAIK) usually are not normal and as such mean / stddev are not useful. As such I am surprised that HdrHistogram prints them out.Attila
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.
To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.I don't think this is correct. CO can be the result of other aspects of the system, e.g. a single thread (perhaps the network one) could cause TCP back pressure causing the load generator, delaying the sending of requests. Thus creating coordinated omission.
Hi Jason,
Although it's a bit old thread, let me add a few thoughts/questions. From the previous explanation(s)...
To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.
Hence, besides the testing method, this might be related to the (probably unknown) reasons causing the unexpected latencies (high over the average). Iff the effect is global then CO fires and breaks the unaltered stats. If the effect is local (delays only one or some thread) then probably CO doesn't, the unaltered stat might work.
Although the event 'as is' might not be observable, its effect (whether it is global or local) might be detected, at least in a statistical way. We just have to check whether the high latency points of different threads are correlated to each other or not. If they tend to happen at the same time then one would suspect global effects, if not then probably they represent local events.
After all I guess that before altering the original stats, it would be reasonable to check whether the delays were correlated or not. If this makes sense:-) then I would like to ask, whether there's any (practical or lab) experience regarding this? That is: Is it typical that events causing delays are global ones? Are there any local events at all?
I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happened
I started talking about Coordinated Omission in test systems mostly, but I kept finding it everywhere, including in many production monitoring systems. Coordinated Omission is nothing more than a measurement methodology problem. It exists in both test and production environments, and occurs whenever data is omitted ahead of analysis in a way that is coordinated with the [usually bad] behavior of the system. Coordinated omission basically guarantees a huge garbage-in-garbage-out relationship between the data feeding into [downstream] analysis and most operational conclusions that people make based on that data.It is true that CO is most easily demonstrated in test environments, because test environments commonly have a "plan" they are supposed to execute, and it is easy to demonstrate that the expected plan was not actually executed, and that the results collected are therefore bogus. Avoiding CO in test environments can sometimes be done if a valid asynchronous tester can be build. This is NOT possible for many applications, e.g. ones that include synchronous request-reponse semantics in their protocols, like all those that use HTTP, and most of those that use TCP. Correcting for CO is a separate thing: it can be done to varying degrees depending on what is known and what can be assumed about the system. In test environments with known expected scenario plans usually include enough information for pretty good correction to be possible.But CO happens in production. All the time. And production monitoring tools that attempt to report on latency behavior often encounter it. It is certainly harder to correct for CO in production, since there is not "plan" that you can use for correction like you can in a test system. But it is easy to show that CO happens (or at least demonstrate an outrageously high likelihood that it happened). E.g. whenever a system freeze seems to magically coincide with a suspicious gap in recorded information, you should suspect CO. The existence of CO in production is sometimes easy to demonstrate even in summary latency reports whenever you notice that reported stats (e.g. Max, 99.9%, and average) are "obviously impossible". It usually takes CO + significant actual production time pauses to make the mess big enough that it is obviously wrong in the summaries, and most of the time production %'ile reporting will be be 2-4 orders of magnitude off from reality before the fallacy is easily demonstrated in summary reports. But I see this so often that I've started doing simple math tests on the common metrics people give me, and I often hit with one of them E.g.:When your summary report includes have a Max, Average, and total test time, the following test is easy to use the following pseudo logic::
if (reportedAverageLatency < ((maxReportedLatency * maxReportedLatency) / (2 * totalTestLength))) {Suspect Coordinated Omission.
// Note: Usually, when this test triggers due to CO. it will not by a small margin...
// Logic:
// If Max latency was related to a process freeze of some sort (commonly true
// for Java apps, for example), and if taking measurements were not coordinated
// to avoid measuring during the freeze, then even if all results outside the pause
// were 0, the average during the pause would have been Max/2, and the overall
// average would be at least as high as stated above.
}This sanity test is valid for any system (test or production) as long as reported Max times are related to system freezes. In Java based systems, this is easy to glean from looking the the max GC pause time in the GC logs. If it correlates in magnitude to the Max reported latency, the test above will reliably tell you whether your measurements include CO.In addition, In the test includes a Max latency, a total test time, and some percentile level, you can do the following pseudo logic:
// Sanity test reportedLatencyAtPercentileLevel for a given
// percentileLevel, maxReportedLatency, and totalTestLength.
Double fractionOfTimeInPause = maxReportedLatency / totalTestLength;
Double fractionOfTimeAbovePercentileLevel = 1.0 - percentileLevel; // Assumes percentileLevel is a fraction, between 0.0 and 1.0
Double smallestPossibleLatencyAtPercentileLevel = maxReportedLatency * ( 1.0 - (fractionOfTimeAbovePercentileLevel / fractionOfTimeInPause));
if (smallestPossibleLatencyAtPercentileLevel > reportedLatencyAtPercentileLevel) {
Suspect Coordinated Omission.
}
A common cause for production-time CO has to do with the fact that production time latencies are often reported not based on client perception, but based on internally collected information in what we would (outside of production) call the "system under test". E.g. many servers will collect internal operation latency stats and float them up to consoles. E.g. Cassandra reports the internally observed 99%'lies of read and writes operations, and those numbers are floated up and shown across clusters in monitoring dashboards. The times are collected by measuring time before and after each operation and recording it into a stats bucket, and that measurement technique has classic CO: It will basically ignore of dramatically diminish the reporting of any "freeze" in the process. I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happened. But where such arguments miss the mark is when you look at what people reading the dashboards *think* the %'lies are telling them. I've tested this by asking ops people (without raising suspicion of a "trap") what they think dashboard reports mean, and how they can use them. In my samples, all ops people, developers, and testers I've asked assume that when a console says "99%'lie", is effectively saying something along the lines of "99% of reads took 5msec or less", and that they can use that as a way to either describe the behavior the system had exhibited, or use it to project how it may behave. E.g. they may use this information in capacity planning, or in pre-deployment verification of SLA readiness, or look at current production numbers to judge the operational readiness reports ahead of expected load increases [think "planning for holiday season shopping"]. I all these cases, the numbers they are looking at often drive very bad decisions. Unfortunately, in systems that display this sort of %'lie data (like Cassandra's read and write latency reports for the 99%'lie), the actual meaning of the number displayed for 99% is: "1% of reads took at least this much time, but it could be 1,000x as much as what is shown here. We only measured a lower bound on how bad things were in the remaining %'lie (1-X), not an upper bound on how bad things were in the % we show."
For anyone who doubts this, here is a simple exercise: go watch the %'lie outputs in Cassandra monitoring output. Manually ^Z the Cassandra servers for 6 seconds every minute, and see what your console reports about the cluster's 99%'ile... Now think about what the ops and capacity planning people responsible for service level management of these clusters are thinking when they look at that output.
[Oh, and for those who doubt that this ^Z is a valid test for Cassandra: Read your GC logs. ;-) ]
More comments inline below.
On Sunday, May 4, 2014 3:27:47 PM UTC-7, Nick Yantikov wrote:
Hello Gil
First of all let me solute you for the amount of writing you have done on this topic and how patiently you explain CO concept over and over again. It took me at least 6 hours just to read (and try to make sense of) it all in this thread. I can imaging how much time it took you to write it. Therefore I appreciate in advance the time you put in your response (if you decide to respond).
I agree with many things you said. But since no "new value" can be created by agreeing with you I will try to "poke holes" in your theory or define some boundaries for it.
Let me start with this unpretentious :-) statement "Coordinated omission (the way you described it) does not exist in production environments".
I let myself to be blunt with statement like this as I could not read you with 100% certainty. In some posts you say that CO is entirely testing technique problem in other cases I think you say that CO (and as a result compensating for CO) does have its merits in context of production monitoring.
I am approaching this discussion with production performance monitoring angle (which might be an important distinction). In my opinion production environment is the true source of events that are happening in it and therefore there could be nothing (or at least very little) coordinated in it. The number of use cases in production environment is so large that even processes that may seem to fall under "coordinated omission" scenario may very well be application's features. For example, even if periodic/scheduled processes are paused and delayed it can very well be a business requirement that they skip next execution. Web requests from users are pretty much always uncoordinated.
As far as "omission" goes I feel that as long as you do account for every single monitored event that has happened in production environment (as apposed to event sampling for example) then there is no omission.
Actually, sampling would be fine, and omission would be fine, as long as no coordination happens.
But unfortunately, production monitoring of event latency is just as susceptible to coordination as test measurement is.Even if you collected all actual production client-observed response times, you would remain exposed. E.g. if you data indicated that during each period of 1000 seconds, each production client reported 1 response taking 500 seconds, and 9999 responses taking 20 msec each, and you did not recognize that there is coordinated omission in the incoming data set, you will erroneously report that the 99.99%'lie observed by clients is 20msec (which would be wrong by a factor of 2,500x). Natural noise and variability in these numbers would not make the conclusion any less wrong.
Attached is a chart of an actual production system doing exactly this sort of reporting (at a different time scale, but with a similar ~2,500x discrepancy), plotted against the reporting of two different level of CO-correction of the same data. All three sets of data were collected at the same time, on the same system, in production. The corrected data sets include a correction technique that is now part of LatencyStats class, meant to drop straight into this sort of measurement idiom (see http://latencyutils.org)
You correctly pointed out in other threads that it is fatal for the results of your monitoring to assume latencies are normally distributed. This is true for the rate of events too. You can never assume any particular distribution of rates and they are rarely uniformed in production environments. In addition rate distributions can vary during the day, they can vary due to business cycles (announcement of new product or feature), etc., etc.
Trying to compensate for CO in the context of production environment monitoring makes little sense to me because in order to compensate you have to predict rate of events that might or might not have happened.
Which is impossible.For in-process latency measurement in production, that's what I thought, too. But when I tell myself something is "impossible", I often find it hard to resist to urge to prove myself wrong. So I did, tha'st what the LatencyUtils library is about.It turns out that while correcting for all causes of CO is "very hard", correcting for process freezes in in-process latency measurement is not that hard. And in much of the world I see, CO reporting errors are dominated by such process freezes. The technique I use under the hood in a LatencyStats object leverages a process-wide pause detector with a per-recorded stats interval estimator. The pause detector is pluggable, but the default detector is based on consensus pause observation across a number of detection threads. It uses settable thread counts and thresholds. It is very reliable at detecting pauses of 1msec or more with sleeping threads (which is nearly free to do), and seems to detect process-wide pauses as small as 50usec reliably if you are willing to burn 3-4 cpus to do so (which will likely not happen much). The interval estimator is also pluggable, but the one provided is basically a time-capped moving average estimator, which does a pretty good job.The answer to the "in order to compensate you have to predict rate of events that might or might not have happened" part above is that when a known process freeze occurs, projecting from the reality that preceded the freeze into the freeze period is a pretty good guess. Certainly much better than ignoring the fact that the freeze erased a whole bunch of known-to-be-terrible results, destroying the validity of the data set.LatencyUtils only provides partial correction, but that partial correction is often of 99%+ of the error. The bigger and/or more frequent process freezes are in your actual system, the more important the correction becomes, especially when freezes are a significant multiple f the typical measured latency (in which case the correction may often become the only valid signal above ~99%).
In my opinion spending your time to try to correctly compensate for CO inside production codebase is just a time that is better be spend to understand what the appropriate reporting/alerting is, how to correctly align monitoring with SLA. You have to work with what has really happened in the system and build your monitoring and alerting tools around it.
The problem that does exist is that performance monitoring is either insufficient or is disconnected from SLAs.
Let's take your wonderful concierge business example.
If business owner did have monitoring tools that are correctly aligned with his service level agreement he would have not missed a problem during testing. The disconnect that happened in this example is that business owner promised to answer phone within 3 rings 99% of the time for all calls placed during the day. I would like to emphasize the "during the day" part of last sentence as I think it got neglected in your example. In other words business owner promised to answer within 3 rings 99% of the time for every 10 calls customer may have made. In his test however business owner calculated 99th percentile for 420 calls which is 2 months worth of calls for his business. This is the disconnect between his performance reporting and his own SLA. What he should have done is to calculate 99th percentile for every 10 test calls that were made. Or he should have been more aggressive and have built in some cushion and calculated 99th percentile for every 5-6 calls.
In the specific scenario, I don't see a need for groups-of-10-calls testing. While it will probably work, a very simple test system, as well as a very simple production-time-sanity-checking system can easily measure the right thing (the right thing is "what would your customer's perception of the percentile be?"): just poll the business at random intervals some number of times per day (which should also vary somewhat), and report of the stats of what you see. Just make sure not to coordinate between polling attempts, and your percentile numbers will match what your customer would consider to be a valid representation, and therefore a useful way to gauge the quality of your system.
A scenario like the one I describe (the concierge service) is actually one where you would want you rproduction-time validation to be happening much more frequently than the customer's actual use, so that you will be likely to detect (and potentially correct) issues before your customer does. This sort of "failure is very expensive" driven continuous production validation is actually common in the real world as well. E.g. of algo trading system that only trade a few times a day, it is common practice to do sanity test trades (small or irrelevant) at a significanlty higher rate than the actual trades that matter happen, just to maintain an evaluation of what expected trade behavior may be like (some systems can even loop this knowledge back into the actual trading decisions).
The other way to approach his reporting would be to calculate 99th percentile for every 30 minute time frame (which at the most would have 6 calls in every 30 minute set of results) and plot it as time series. This is essentially how reporting is normally done in software production environments.
I agree that percentiles should usually be reported in time intervals, as that's how they are most often stated. When requirements are set, they are usually stated about a worst case time interval. E.g. "The 99.9%'lie should be 200usec or better in any given 15 minute period during the day" as opposed to "in at least one 5 minute interval during the day" or "across the entire day, even if it is much worse during some 15 minute period."
In addition (however this is off-topic) business owner and all his personnel should have been paged every time phone rings for more than 6-8 times. He should have had a list of all missed calls and proactively should have called his customer back, he should have plotted time series of the calls throughput his business handles, etc. etc.
To summarize. Attempt to solve "Coordinated Omission" problem in context of production monitoring is akin to misdirecting your effort. Since CO compensating cannot be applied reliably and universally it will lead to misleading and confusing results.
Look at LatencyUtils.org, think of how using LatencyStats would apply to something like Cassandra reporting it's production-time latency stats, and let me know if that changes your mind...
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks for your response!
I think I got the meaning... at least some piece of it...
To my understanding, you mean that one must test (in a statistical sense) whether (nearby) low latency events (as random variables) are corellated with each other or not. (For instance, to put it simply for two threads the 'corellation' would mean that P(AB) > P(A)*P(B), where P(A), P(B) is the possibility of low latency events for the A and B thread, and P(AB) is the possibility for two 'nearby' low latency event.
If I understood it well, then my intuition says that CO-related correction should be somehow dependent on the factor of correllation... That is CO might not be a binary, 0 or 1 thing.)
Meanwhile as I understand your thoughts... the correllation does not neccessarily mean 'common-cause', that is the higher likelyhood between nearby high latency events could be caused by non-global events/effects.
I've had interesting discussion with people who say that the reported percentiles are "honest" because those are the time stats for operations that actually happenedAs one of the people who've had a spirited hour+ long debate with Gil on this topic, I would like to defend the position briefly: coordinated omission detection is far from perfect (it will by no means detect all sources), and requires various assumptions such as a steady rate of message arrival, a symmetrical distribution of message size/workload, pauses being system-wide...
These things are not always true, but even if they were, the same assumptions could be applied as a post process of the raw data from a stream of imperfect measurements (it would be extraordinarily rare to have lengthy pauses be omitted completely from such a steady state system, so the event will ordinarily be present in a non-lossy histogram of latencies),
by detecting temporally adjacent outliers and padding their number. In fact LatencyUtils essentially does this, only with explicitly detected and assumed-to-be-global pause overlays instead of applying those assumptions to the raw data (only the manipulation of the data).
The nice thing about this is that we (Cassandra) don't have to explain what our concept of truth is: it may not be perfect, but it is well defined, and can be modified for interpretation however any operator wants.
In reality, simply looking at max latency over a given time horizon is usually more than sufficient to help operators spot problems.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To my understanding the CO issue fires iff (if and only) there's a global effect that delays (almost) all tasks. By 'global effect' I mean kinda 'stop the world event' which effects not just one (or a few) thread but all the threads in the system at some point in time. Otherwise if only one (or a few) thread is effected then it is not an issue, the unaltered stat will work.
Hence, besides the testing method, this might be related to the (probably unknown) reasons causing the unexpected latencies (high over the average). Iff the effect is global then CO fires and breaks the unaltered stats. If the effect is local (delays only one or some thread) then probably CO doesn't, the unaltered stat might work.
Although the event 'as is' might not be observable, its effect (whether it is global or local) might be detected, at least in a statistical way. We just have to check whether the high latency points of different threads are correlated to each other or not. If they tend to happen at the same time then one would suspect global effects, if not then probably they represent local events.
After all I guess that before altering the original stats, it would be reasonable to check whether the delays were correlated or not. If this makes sense:-) then I would like to ask, whether there's any (practical or lab) experience regarding this? That is: Is it typical that events causing delays are global ones? Are there any local events at all?
Regards,
Gyula
2013. augusztus 7., szerda 23:37:27 UTC+2 időpontban Jason Koch a következőt írta:Hi AttilaI'll attempt to explain. Gil has a presentation on it and I'm sure there's a few analogies in these email threads. CO is something to do with your testing methodology, and I'll give an "obvious" example that shows why your tests might not show the real world problems if you're suffering CO.The challenge is that, in the real world, your consumers are not coordinated with your system. The consumers are independent. They will send a request whether you have responded to someone else's request or not. So ..
Suppose you are using a few N consumer threads in a tester, and each is in a tight loop for 1 million "events". Eg:for(int i = 0; i < 1000000; i++) { long start = nanotime(); log.info("xxxxxxx"); long duration = nanotime() - start; addToHisto(duration); }Here's where CO comes in:Suppose that during the middle of the run, there was some thing that causes the log.info() time to take longer. It could be a hashmap rebuild, a file system buffer full flush, a GC run; it doesn't really matter what it is, just that for some reason processing is stalled. In your test harness, the "consumer" threads wait for the system to recover before sending more requests (they are coordinated). So, only "N" requests would be showing the full pause time (eg 1 second), all other requests would show un-paused time (eg 1 ms).However in a _real world_ situation, it's unlikely your consumers are going to recognise that your system is under load and stop sending requests for that full second. They're going to keep firing them, and probably a lot of times, during that 1 second. So in the real world, all of these consumers will ALSO see that pause. The issue of CO is how many requests are _truly_ impacted by that 1 second pause versus the N that you've shown in your test report (due to omitted results).
...
I should make it clear at this point that I speak only for myself, and not for the project as a whole.
bq. What you call "extraordinarily rare" is measured as ~20-60% likelihood in my experience1. I am unconvinced of this number for cross-cluster (QUORUM) reads measured at the coordinator on common systems (having only 10s of reads is a very light cluster load though, so perhaps...), since one of the coordinators sending work to the node in GC would not be in GC, so the number of missed occurrences cluster-wide should be few. Further, a coordinator request spends a majority of its time in flight to, from, or at the owning nodes, so if either it or one of the owning nodes enters a GC cycle this is most likely to occur whilst requests are in flight and being meausred. So even with a small cluster this holds; with a large cluster and a hash partitioner, at least one coordinator not in GC will have a request out to the paused node in a majority of cases, and will register it. For local only reads (smart routed CL.ONE) I accept this may be a possibility, but you made this claim when we last spoke, when such features weren't around, so I suspect you are looking at the internal latency rather than coordinator latency, or are extrapolating from experience on a single node.
2. Even if this were true, this would still result in enough data points to paint a near-to-true picture.
bq. Then that's what you should be showingCassandra doesn't show anything at all. The nodetool offers a complete histogram of latency distribution, measured both internally and at the coordinator.
The tool we recommend for testing how your cluster responds to representative workloads is called cassandra-stress, and this displays the .5, .95, .99, .999 and max latency as measured at the client. I stand by this being the best way to establish latencies.
For a live system, if you have SLAs, you should be measuring it in your application, and not relying on your data store. If you really want to, though, we provide you with enough information to estimate coordinated omission.
That all said, I'm not unsympathetic to introducing some of these capacities to Cassandra, so operators that want this view can see it. Last we spoke you were planning to investigate plugging into yammer/codahale, so that it could easily be offered as a config option. As yet, no movement :)
...
The coordinator is just as susceptible to pauses (and related omissions) as anything else is...
I am staring at a cassandra-stress output right now that shows the following:
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Thanks for your response!
I think I got the meaning... at least some piece of it...
To my understanding, you mean that one must test (in a statistical sense) whether (nearby) low latency events (as random variables) are corellated with each other or not. (For instance, to put it simply for two threads the 'corellation' would mean that P(AB) > P(A)*P(B), where P(A), P(B) is the possibility of low latency events for the A and B thread, and P(AB) is the possibility for two 'nearby' low latency event.Replace "low" with "high" (which is what I think you meant anyway) and we agree. And yes, you can say that P(high) = P(1 - low) and it already means that...
If I understood it well, then my intuition says that CO-related correction should be somehow dependent on the factor of correllation... That is CO might not be a binary, 0 or 1 thing.)In a test system, the first goal should to avoid coordinated omission, not to correct for it. And in a test system, this is very possible. [Unfortunately most test systems don't do that, and have coordinated omission galore]
If correction must be done (mostly because the test system has no viable way to deduce expected start time for operations), then the only reliable corrections I've played with so far are ones that take a 0 or 1 approach. Yes, there are probabilistic levels that can fill in the gaps some beyond the 0/1 level ("correcting more" than the certain corrections that are safe to do when 0/1 conditions are detected), but those get very hard to detect, estimate, and apply.
The coordinator is just as susceptible to pauses (and related omissions) as anything else is...Sure, but for every coordinator to be affected simultaneously would be less likely (sure, there would be correlations between node GC incidence); since requests are ~randomly proxied on from any given coordinator, it will be extremely unlikely to see a real system fail to spot as much as 60% (or even 20%) of lengthy events. The chance reduction of missing an event is exponential in the size of cluster, although the exact behaviour would be dependent on correlation between GC events. I would change my position if presented with a common cluster configuration in which this is reproducible, but it seems to me you are basing your position on generalized experience with single node application behaviours.
There is still a reporting issue for making use of this information, I agree. My position is that this isn't really an issue the database should try to solve for you. To me this is a problem with monitoring analysis tools, which could apply a coordinated omission overlay with the raw data already available. I don't like doctoring the raw data reported by an application when it can be doctored in post processing with conscious intent, especially when accurate measurement by the application is still the gold standard.
That said, I would really like to see better application level access to STW times so that these could be better reported, as currently it is pretty insufficient, and this would help those tools correctly model this (if any are trying).
I am staring at a cassandra-stress output right now that shows the following:Each simulated client awaits the result of the prior operation before performing another. If you have very few such clients then there will be very few operations being performed, and those %iles are accurate.
If you want to measure the performance for an operation arrival rate, more clients than the number of concurrent ops are needed, and a rate limit should be specified.
Admittedly there is some work to be done to ensure this would report correctly for arbitrarily large pauses, and to improve the documentation (amongst a hundred other things).
...
Accurate? Really?
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
...
Accurate? Really?Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients. These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions. Without a target operation rate, this is a throughput test.
To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met. This would then give a true picture up to the limit of oversubscription of clients
Accurate? Really?Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients.
These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions.
Without a target operation rate, this is a throughput test.
To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met.
This would then give a true picture up to the limit of oversubscription of clients (i.e. a long enough pause would exhaust the threads used by the simulator, at which point no more measurements would be made). The "improvement" I discuss is to untie the client count from the operation rate, so that there is no threshold beyond which it would fail (although realistically there is always such a threshold, it would just be higher); this is a hangover of legacy access modes (which are all synchronous).
This does not in any way make it inaccurate. You are simply using it incorrectly for the thing you would like to measure,
which is why I highlight documentation as a known improvement area, as well as improvements to make it harder to go wrong.
...
I'm just feeding it a rate
I haven't seen anywhere where it suggests you need to do that for your latency percentile reports to be accurate, and I don't see anything that invalidates the latency percentile reports if you fail to have enough threads to make that happen.
--You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
You can’t simply put as many unthrottled clients on the system and then call it a throughput test. At some load throughput will back off because the system will start thrashing and throughput will start ramping down.
And just because there isn’t any throttling on the injector doesn’t mean there isn’t any CO.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
I'm just feeding it a rateActually, you're feeding it a rate limit.
To model a target rate requires the method I outlined. With the 320ms max pause seen, a total of 6400 threads are needed, and validating you had enough requires only some very limited math. I agree this is suboptimal, and is the precise aspect I was referring to that needs improvement.
Whilst your suggestion is one possible solution, the other is to move to clients submitting requests asynchronously - which is the encouraged modern access method, so is something we need to migrate to anyway.
However, using 6.4k threads (even 20k) is workable in the meantime, and does achieve the desired ends, with some slight inefficiency on the client side.
I haven't seen anywhere where it suggests you need to do that for your latency percentile reports to be accurate, and I don't see anything that invalidates the latency percentile reports if you fail to have enough threads to make that happen.Like I said, our documentation is imperfect - in fact, it is entirely missing. The tool also needs improvement in a number of areas, and this is indeed one of them. Right now, using it requires care, and understanding of its limitations.Of course, if you knock up a version that delivers this improvement in the meantime, that's also great, and I'd be delighted to incorporate it.
...
Creating 10,000 concurrent requests when the real world you are testing for only has 20 concurrent clients operating at a high rate does not end up testing the same scenario.
That rate limit parameter translates to the average sustained rate of actual execution in the actual code. not the peak rate.
--
I think perhaps this is another issue of semantics: the precise meaning of "coordinated omission." Since Gil coined the term, I guess he can define it for us, but since I've never seen a dictionary definition I have taken it to mean the omission of samples measuring work, not the omission of work itself. The former is a failure of monitoring or measurement, and the latter is an issue of defining your experimental setup.
On 14 January 2015 at 21:25, Kirk Pepperdine <ki...@kodewerk.com> wrote:
On Jan 14, 2015, at 9:27 PM, Benedict Elliott Smith <bellio...@datastax.com> wrote:Accurate? Really?Yes, really. cassandra-stress is not a constant rate test, it only runs a steady number of clients. These clients perform as much work as they can without constraint (and the work can be highly non-uniform also), but each act as an independent sequence of synchronous database actions. Without a target operation rate, this is a throughput test.You can’t simply put as many unthrottled clients on the system and then call it a throughput test. At some load throughput will back off because the system will start thrashing and throughput will start ramping down. The more load you attempt under those conditions the lower the throughput will get. To do a proper throughput test you need to understand what the limiting resource is and then load to that level. And just because there isn’t any throttling on the injector doesn’t mean there isn’t any CO.To test for latencies, you need to specify a rate limit (target rate); to overcome current limitations with stress, you need to oversubscribe the client (thread) count so that during a pause the target rate can still be met. This would then give a true picture up to the limit of oversubscription of clientsYou need to differentiate between operational load and stress. At the point of oversubscription to resources in the server latency will mostly be comprised of dead-time.Kind regards,Kirk Pepperdine
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/icNZJejUHfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.