Does percentile metrics follow the rules of summations?

308 views
Skip to first unread message

Gaurav Abbi

unread,
Dec 21, 2016, 5:53:45 AM12/21/16
to mechanical-sympathy
Hi,
We are collecting certain metrics using (Graphite + Grafana) use them as a tool to monitor system health and performance. 

For one of the latency metric, we get the total time as well as the latencies for all the sub-components it is composed of.

We display 99th percentile for all the values. However, if we sum up the 99th percentiles for latencies of sub-components, they do not equate to the 99th percentile of the total time.

Essentially it comes down if the percentiles can follow summation rules. i.e.

if 
a + b + c + d = s

then,
p99(a) + p99(b) + p99(c) + p99(d) = p99(s) ?

Will this hold?

Greg Young

unread,
Dec 21, 2016, 6:09:09 AM12/21/16
to mechanica...@googlegroups.com
no because the 99th percentiles do not necessarily happen at the same time.
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Avi Kivity

unread,
Dec 21, 2016, 6:12:12 AM12/21/16
to mechanica...@googlegroups.com
Right; if the distributiona are completely random, then

p99.99(a then b) = p99(a) + p99(b)

Gil Tene

unread,
Dec 21, 2016, 11:42:12 AM12/21/16
to mechanical-sympathy
Yup. ***IF***. And in the real world they never are. Not even close.

Gil Tene

unread,
Dec 21, 2016, 12:24:02 PM12/21/16
to mechanica...@googlegroups.com
The right way to deal with percentiles (especially when it comes to latency) is to assume nothing more than what it says on the label.

The right way to read "99%'ile latency of a" is "1 or a 100 of occurrences of 'a' took longer than this. And we have no idea how long". That is the only information captured by that metric. It can be used to roughly deduce "what is the likelihood that 'a' will take longer than that?". But deducing other stuff from it usually simply doesn't work.

Specifically things for which projections don't work include:
(A) the likelihoods of higher or lower percentiles of the same metric a
(B) the likelihood of similar values in neighboring metrics (b, c, or d)
(C) the likelihood of a certain percentile of composite operation (a + b + c + d in your example) including the same percentile of a

The reasons for A have to do with the sad fact that latency distributions are usually strongly multi-modal, and tend to not exhibit any form of normal distribution. A given percentile means what it means and nothing more, and projecting from one percentile measurement to another (unmeasured but extrapolated) is usually a silly act of wishful thinking. No amount of wishing that the "shape" of latency distribution was roughly known (and hopefully something close to a normal bell curve) will make it so. Not even close.

The reasons for B should be obvious.

The reasons for C usually have to do with the fact that the things that shape latency distributions in multiple related metrics (e.g. a, b, c, d) often exhibit correlation or anti-correlation.

A common cause for high correlations in higher percentiles is that things being measured may be commonly impacted by infrastructure or system resource artifacts that dominate the causes for their higher latencies. E.g. if a, b, and c are running on the same system and that system experiences some sort of momentery "glitch" (e.g. a periodic internal book keeping operation), their higher percentiles may be highly correlated. Similarly when momentary concentrations and spikes in arrival rates cause higher latencies due to queue buildups, and similarly when the cause of the longer latency is the complexity or size of the specific operation.

Anti-correlation is often seen when the occurrence of a higher latency in one component makes the likelihood of a higher latency in another component in the same sequence less likely that it normally would be. The causes for anti-correlation can vary widely, but one common example I see is when the things performing a, b, c, d utilize some cached state services, and high latencies are dominated by "misses" in those caches. In systems that work and behave like that, it is common to see one of the steps effectively "constructively prefetch" state for the others, making the likelihood off a high-opercentile-causing "miss" in the cache on "a" be much higher than a similar miss in b, c, or d. This "constructive pre-fetching" effect occurs naturally with all sorts of caches, from memcache to disk and networked storage system caches to OS file caches to CPU caches.  
Reply all
Reply to author
Forward
0 new messages