Cases when Metrics is inaccurate, confusing, and outright wrong.

Nick Yantikov

unread,

Apr 30, 2014, 11:49:20 PM4/30/14

to metric...@googlegroups.com

I created a sample project with some edge cases where Metrics does not report reliable results.

More details with descriptions, results, and charts are here:
https://github.com/ovonick/metrics-edge-cases

Would someone be able to comment on those? Are these bugs in implementation, deficiencies in algorithms, or my misunderstanding and incorrect usage?

Thank you very much
-Nick

Ryan Tenney

unread,

May 1, 2014, 12:06:26 PM5/1/14

to metric...@googlegroups.com

Hi Nick,

Thanks for putting this together! I'll do my best to address each of the scenarios.

Scenario 1: You're correct that the behavior of SlidingTimeWindowReservoir is more intuitive. That doesn't make ExponentiallyDecayingReservoir any less correct, it just needs to be interpreted in the context of the rate. Perhaps we should consider changing the default reservoir type?

Scenario 2: If I understand the behavior of the emulator, the first 9960/10000 executions last 30ms and the remaining 40/10000 last 15000ms? If so, then the observed behavior is correct. You will have seen a spike in the 99.9th percentile, but the 95th percentile will not move unless you change 9960 on line 135 to 9500 or less. The 95th percentile is simply less sensitive to such spikes. Lets say you have 3 readings: 30, 30, 15000, the median (50th percentile) is 30. Similarly, if you have 100 readings, the first 96 of which are 30 and the 97-100th are 15000, the 95th percentile is also 30.

Scenario 3: I'll need to parse this a bit more. I can't speak to the accuracy of the sampling, but I will say that it isn't intended to be 100% accurate, it is intended to be highly performant while being accurate enough for the purpose of monitoring a system's responsiveness.

Ryan

Justin Mason

unread,

May 1, 2014, 12:27:25 PM5/1/14

to metric...@googlegroups.com

Scenario 1 is the issue which I've mentioned before, with a workaround, in this thread: https://groups.google.com/d/msg/metrics-user/0O7Uyd2kx7c/S1mrdaUVHLYJ

This blog post goes into more details. http://taint.org/2014/01/16/145944a.html

https://groups.google.com/forum/#!msg/mechanical-sympathy/I4JfZQ1GYi8/ocuzIyC3N9EJ also suggests some more issues with the operation of ExponentiallyDecayingReservoir, if the incoming data doesn't follow a normal distribution. It may be worth repeating these tests with Gil's LatencyUtils lib to see if it helps.

--j.

--
You received this message because you are subscribed to the Google Groups "metrics-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metrics-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nick Yantikov

unread,

May 2, 2014, 1:38:29 AM5/2/14

to metric...@googlegroups.com

Hello Ryan. Thank you for looking into it.

On Thursday, May 1, 2014 9:06:26 AM UTC-7, Ryan Tenney wrote:

Hi Nick,

Thanks for putting this together! I'll do my best to address each of the scenarios.

Scenario 1: You're correct that the behavior of SlidingTimeWindowReservoir is more intuitive. That doesn't make ExponentiallyDecayingReservoir any less correct, it just needs to be interpreted in the context of the rate. Perhaps we should consider changing the default reservoir type?

I don't know. The disadvantage of SlidingTimeWindowReservoir is that it is an unbound structure. With hundreds of high rate metrics it may become very expensive.

Scenario 2: If I understand the behavior of the emulator, the first 9960/10000 executions last 30ms and the remaining 40/10000 last 15000ms? If so, then the observed behavior is correct. You will have seen a spike in the 99.9th percentile, but the 95th percentile will not move unless you change 9960 on line 135 to 9500 or less. The 95th percentile is simply less sensitive to such spikes. Lets say you have 3 readings: 30, 30, 15000, the median (50th percentile) is 30. Similarly, if you have 100 readings, the first 96 of which are 30 and the 97-100th are 15000, the 95th percentile is also 30.

Yes, indeed I tailored this scenario to fail on 95th percentile. I can also tailor it to fail on 99th percentile. The point however is that I as a consumer of Metrics framework does not necessarily have to know about reservoir, or its size, or any of the internal implementation for that matter. My perspective is that I report response time measurements every minute and I expect 95th percentile to reflect a set of values collected during last minute (more or less). With this perspective in mind 95th percentile for every of the one minute intervals during last 10 minutes of the process run should have been 15000ms and not 30ms.

Additionally exponentially decaying histogram documentation states that "A histogram with an exponentially decaying reservoir produces quantiles which are representative of (roughly) the last five minutes of data". I do not see any decaying over 5 minute interval either. I am curious if this is just a bug in implementation or if this is a deficiency of the underlying algorithm.

Message has been deleted

Nick Yantikov

unread,

May 2, 2014, 11:19:15 AM5/2/14

to metric...@googlegroups.com, j...@jmason.org

Hello Justin.

I just finished reading through the links you posted. Wow. It all very much rhymes with my thoughts and my own findings on the subject.

Thank you for the links

Take care
-Nick

Nick Yantikov

unread,

May 3, 2014, 4:17:38 AM5/3/14

to metric...@googlegroups.com

I updated output and charts in scenario 3 with 99th and 99.9th percentiles. Results are not very soothing with 30% error in 99th percentile and 75% error in 99.9th percentile.