measuring / monitoring latency

5,061 views
Skip to first unread message

isaiah perumalla

unread,
Nov 1, 2013, 6:14:42 AM11/1/13
to mechanica...@googlegroups.com
What do people usually use to monitor latency in prod like environment, with minimal overhead
For my current project I was considering simply timestamping incoming /outgoing messages on the server and recording the latency numbers on a hdrHisTogram, having an embedded http server to send histogram data when requested. What do people think?

Justin Mason

unread,
Nov 1, 2013, 6:54:04 AM11/1/13
to mechanica...@googlegroups.com
This ties in with the other thread on monitoring ;)  The Yammer Metrics classes include a Timer class which maintains an efficient histogram of recorded latency percentiles.  I'd then suggest sending those to graphite to record those percentiles over time.  We use this, and it works well, particularly when you need to monitor things like the 99th percentile latencies for a particular API call.


There are a couple of issues, though:

1. it's efficient, but still brings an overhead, with 2 calls to System.currentTimeMillis() and multiple AtomicLongs being set on each timer update (iirc).  for hotspots, it'd be wise to record only a sample of measurements, instead of measuring every call.  I offered a pull request to add a sampling timer implementation, but it was turned down -- so you'd have to do that yourself I'm afraid :(

2. the decay behaviour of an Exponentially Decaying Reservoir has an impedance mismatch when you are recording latencies into a time-series data store like graphite, since you wind up recording measurements at time T which represent a measurement taken at time T-1.  I worked around this by clearing the timer histograms each time they are written to graphite, which makes the behaviour more intuitive.

--j.




On Fri, Nov 1, 2013 at 10:14 AM, isaiah perumalla <isaiahp...@googlemail.com> wrote:
What do people usually use to monitor latency in prod like environment, with minimal overhead
For my current project I was considering simply timestamping incoming /outgoing messages on the server and recording the latency numbers on a hdrHisTogram, having an embedded http server to send histogram data when requested. What do people think?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Lawrey

unread,
Nov 1, 2013, 8:02:53 AM11/1/13
to mechanica...@googlegroups.com
I would recommend using System.nanoTime() for timings.  Milli-seconds timings are okay for long requests but nanoTime() will also work in this situation.

Justin Mason

unread,
Nov 1, 2013, 8:13:46 AM11/1/13
to mechanica...@googlegroups.com
oops, my mistake -- Timer uses System.nanoTime().

Nathan Fisher

unread,
Nov 1, 2013, 10:53:30 AM11/1/13
to mechanica...@googlegroups.com
I generally route through a reverse proxy. And then track the latency external to the JVM either online using collectd, Zabbix, etc. Or offline/delayed from central log processing servers. If you're doing online from your server give careful consideration to your log format/extraction method. I previously used collectd to monitor online using the log tail module and some regex. The performance tests revealed the regex was backtracking which resulted in an increase of load proportional to the inbound requests. It was fixed by moving the fixed width fields like response code, bytes
--
Nathan Fisher

Gil Tene

unread,
Nov 1, 2013, 12:04:42 PM11/1/13
to mechanica...@googlegroups.com, j...@jmason.org
Putting aside the orthogonal [and dominant] issue of Coordinated Omission, which makes all other measurement issues irrelevant unless fixed, and looking at the Yammer Metrics classes [mentioned below] for data that is presumably otherwise correct:

I'm obviously a bit biased and opinionated, but I see huge issues with measuring latency with any lossy or decaying sampling techniques. The purpose of measuring latency is invariably to understand how it behaves (as opposed to what a single measurement showed). A proper description of latency behavior always needs to include and take into account "rare" outliers, which us exactly what lossy, averaging, or decaying techniques throw away. Most the modeling techniques start with a simple wishful-thinking assumption about how latency might behave, and then try to fit the reality data stream to that [always wrong] assumption. Latency averages are a nonsense number, and standard deviation is meaningless for describing latency. Anyone not looking at their 99%'lie+ and their actual (not lossy sampled) max values is effectively putting on blind fold. Unless, of course, they don't care about 1% of their stuff failing all the time. My recommendation is to collect and plot the entire percentile spectrum. That's what HdrHistogram was built to allow.

From reading the links below, the Yammer Metrics meter and histogram seems to have one histogram type (SlidingTimeWindowReservoir) that is not lossy and does not decay or apply some other statistical model assumption to the data. But it comes with a warning about being unbounded in size, which may limit it's practical use. It seems like fitting in HdrHistogram as a Histogram type to this set would be useful but for time window and uniform (represent the entire lifetime) use cases, as it would provide a fixed space, non-lossy, and non-decaying solution for both. I'd love to connect with whomever is maintaining that stuff, just wish I had more hours in the day...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Nov 1, 2013, 12:25:17 PM11/1/13
to mechanica...@googlegroups.com
If you can wait a week or two, thing will be changing for this question...

While HdrHistogram will let you track your System.nanoTime() measurements efficiently and report on their entire percentile spectrum, you have Coordinated Omission to worry about here. And when measuring and monitoring from within a prod server, it's a doozy. There are two major problems here:

1. HdrHistogram lets you correct for CO when you know what the expected interval between measurements is, but while that's "easy" in synthetic environments when the expected behavior is known, estimating it in a prod environment is harder. It can be solved. E.g. when I have to, I usually use a time-capped moving window average estimator for inter-measurment intervals and feed the estimated interval to HdrHistogram along with every measurement using the recordValueWithExpectedInterval() call.

2. HdrHistogram's CO correction only works when you observe a large latency. The problem is that when measuring inside a server to monitor the latency of some operation, there is a [potentially very high] likelihood that various stalls (swapping, scheduling contention, GC) will occur outside of the timing window. E.g. if you are monitoring the latency of read operations, and read operations only constitute 15% of the work the server does, you have a 85% chance of completely missing the existence of a stall. [current Cassandra read latency reporting is a great example of this problem].

So while #1 can be addressed, #2 is "hard".

The good news is that I'm working on a solution that I hope will be fairly generic and useful in most cases of in-server monitoring. I'm calling it "LatencyStats", and will hopefully be ready to put something sort-of-working up on github in the next week or so. The solution includes LatencyStats tracking objects that register with a common underlying infrastructure, which consists of pluggable pause detectors, interval estimators and separate HdrHistograms to track record latencies and corrections for each LatencyStats object both on a per-interval basis and in an accumulated fashion. All this comes together under the hood to give you a trivial latencyStats.recordLatency(x) API call and easy access to consolidated HdrHistogram data on your selected latency stat both for settable time intervals and for accumulated entire run stats.

LatencyStats. Coming Soon.

Jamie Allen

unread,
Nov 2, 2013, 12:06:47 PM11/2/13
to mechanica...@googlegroups.com
The primary maintainer is Coda Hale. @coda on Twitter. coda...@gmail.com

pron

unread,
Nov 2, 2013, 1:43:28 PM11/2/13
to mechanica...@googlegroups.com, j...@jmason.org
Metrics is not meant to spit out a complete execution profile, but to emit current performance values (i.e latency in the last minute). You can then pipe Metrics's output into a performance analytics software. So Metrics is not concerned with old values at all, because it's intended to help measure performance and produce data – not analyze it.

Gil Tene

unread,
Nov 2, 2013, 3:19:44 PM11/2/13
to <mechanical-sympathy@googlegroups.com>, j...@jmason.org
Unfortunately, there is no such thing as "latency in the last minute". There are metrics that can be computed over the set of latencies collected in the last minute. Metrics like the "average latency in the last minute" and similarly "median latency", "90%'ile", "99.9%'ile". And for each and every one of these computations, when they are based on a false assumption (e.g. that the distribution of latency is normal), the reported output will be off. As in way-way-off.

The actual Metrics package is a fine package, it does what it says it does, but it's use for latency metering is usually completely misguided and plain wrong unless your actual latency behavior follows exhibits a normal distribution pattern, or you use some of the non-default reservoir modes (like the unbounded size SlidingTimeWindowReservoir) that do no suffer from "lets assume the distribution is normal" syndrome. The problem with the way Metrics is used in many places (e.g. by later versions of Cassandra) is that a default mode Timer() [1] is often used to track latency, and then stats like median, 95%'le, 99%'ile, etc. are extracted from the timer and used for reporting. The default internal reservoir used for tracking latency values in Timer is an exponentially decaying one (ExponentiallyDecayingReservoir [2]), which "Uses Cormode et al's forward-decaying priority reservoir sampling method to produce a statistically representative sampling reservoir, exponentially biased towards newer entries.". The code is very clear about the assumptions under which this decaying method is useful: "Creates a new ExponentiallyDecayingReservoir of 1028 elements, which offers a 99.9% confidence level with a 5% margin of error assuming a normal distribution, and an alpha factor of 0.015, which heavily biases the reservoir to the past 5 minutes of measurements. ".

The problem is simple: actual latency behavior virtually NEVER has a normal distribution. Not even close. Latency behavior is usually strong multi-modal, and terrible for curve-fitting attempts. It is one of the best examples for a naturally occurring data set  that is as far from "normal" as you can imagine. Using the 95%'ile, 99%'ile, averages (or any other stat) computed from a sampling mechanism that ONLY provides good results for a normal distribution to describe latency behavior is exactly the what I was referring to in my "Most the modeling techniques start with a simple wishful-thinking assumption about how latency might behave, and then try to fit the reality data stream to that [always wrong] assumption" statement below. You feed it a nonsense assumption and maybe-good data, and you'll always get nonsense results out the other side. They'll be nice, smoothed out nonsense results, but they'll be very disjoint from the reality you are trying to report on.

This is then further compounded by coordinated omission. It's easy to put to the test the overall behavior with my recommended "^Z test". E.g. ^Z a system for 12 seconds and see what it reports. In a system that normally records 100 latencies a second, you'll find that the metrics coming out do nothing to describe "the latency in the last minute".  A latency metric being tracked with a Yammer Timer, when encountering such a 12 second stall in a 100 per second thing, will report what in reality was the last 1 minute's actual 90%'ile as the 99.9%'ile, and a show a 90%'ile that behaves like no stall has happened. In addition, the average latency it shows will have shifted off of the "typical non-stalling" levels by only ~10msec, when in reality the average latency in such a scenario would be at least 2.4 seconds. Unfortunately, I have plenty of documented real world cases showing how real-world results will similarly end up being reported multiple orders of magnitude off from reality.

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/I4JfZQ1GYi8/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Ron Pressler

unread,
Nov 2, 2013, 3:19:50 PM11/2/13
to Gil Tene, mechanica...@googlegroups.com, j...@jmason.org

Could you please explain what important information Metrics can miss?

I understand what you’re saying, but if my history is one minute, or one second even, does it make a difference? Suppose I occasionally get an unacceptable latency - say 3 seconds instead of 50 millis. That number is going to show up as “max” in the Metrics data. If Metrics reports every second, and the effect happens every 10 seconds, I will get an undesired max value every ten seconds. 

If bad latency happens more than once a second, then it will not only effect the max value, but maybe the 99.9th percentile as well. 


I get a new histogram that describes the latency distribution of, say, 10K operations every second. So even if the histogram for those 10K operations is off, I still get a new histogram every second. Could I miss a serious problem in the app? After all, that’s what Metrics is for: production performance monitoring that’s supposed to alert me of trouble; it is not a benchmarking tool. 

Gil Tene

unread,
Nov 2, 2013, 7:14:56 PM11/2/13
to mechanica...@googlegroups.com, Gil Tene, j...@jmason.org
See inline.


On Saturday, November 2, 2013 12:19:50 PM UTC-7, pron wrote:

Could you please explain what important information Metrics can miss?


Pretty much all of it. The Max, Min, the average, the 90%'lie, the 99%'ile. They are all wrong unless latency behavior happens to meet your wish that it look like a normal distribution. I don't know of any information in there that is right... Since it's a lossy sampling filter, the more "rare" the metric is (i.e. max and 99.9% as opposed to median), the higher the likelihood is that it is off by an order of magnitude or more. See math for your example case below. 
 

I understand what you’re saying, but if my history is one minute, or one second even, does it make a difference? Suppose I occasionally get an unacceptable latency - say 3 seconds instead of 50 millis. That number is going to show up as “max” in the Metrics data. If Metrics reports every second, and the effect happens every 10 seconds, I will get an undesired max value every ten seconds.


That's just the thing. You can have one of those happen every 10 seconds, and many of them simply won't show up in the results. Not in the max or anywhere else. It'll be like they never happened. Unless you actually don't care about how often outliers occur, that's a pretty bad thing.

Timer() [by default] uses a a lossy statistical sampler, and it doesn't store or remember each latency result. And then even if it did decide to store it, it has a good chance of overwriting the max abnormal value before your once-per-second sample of the histogram ever sees it. There'll be plenty of outliers that you will never see because it just doesn't keep them. Look at the code for update() in the ExponentiallyDecayingReservoir for details.

ExponentiallyDecayingReservoir is a perfectly legitimate ways to sample a data set that has a roughly normal distribution. But It's also a perfectly assured way to get junk out of data that doesn't. You'll be calling the max(), plotting it in your nice graphs, and believing that the occasional blips on the graph represent what is actually going on, but all the while things likely behave much worse than you think they do.
 

 

If bad latency happens more than once a second, then it will not only effect the max value, but maybe the 99.9th percentile as well.


Again, that all depends on luck, literally, since they use a random number generator to decide whether or not to keep your rare outliers, and being rare, they don't get that many chances to be kept. Many of your outlier will be unlucky enough to never be captured or never survive long enough to be reported.

Separately, since the internal sampling buffer for latency results has only 1028 latency entires, the reported 99.9% and max values are pretty close the same thing (see how SnapShot computes each of them).
 

I get a new histogram that describes the latency distribution of, say, 10K operations every second. So even if the histogram for those 10K operations is off, I still get a new histogram every second. Could I miss a serious problem in the app? After all, that’s what Metrics is for: production performance monitoring that’s supposed to alert me of trouble; it is not a benchmarking tool. 


Good example. Lets work out what Timer() actually does here:

The Timer's internal reservoir only keeps 1028 results. The way it chooses whether or not to place a sample in the set and which one to throw away is based on a priority number that is chosen based on a weighted random number (weighted to give more recent results a higher chance). But since they are aiming at swaying towards the past 5 *minutes*, they use a 1 second granularity in selecting the weight. That means all 10,000 result collected in the same [wall clock] second are all weighted the same, and have exactly the same change of making it into the data set

When it comes to staying around in the data set for a whole second while others come in, it's not an even chance. There will be more sway towards more recent results, as they have to stand in front of the Russian roulette for fewer turns than the older ones in the same second. So when you come in for a sample every 1 second, the more recent part of that second has a higher chance of being represented than the earlier part of the second.

Rather of trying to do deduce all these russian-roulette dependent probabilities from a sequence of 10,000 random events, it's pretty easy to do the" "outside" math: Assuming there is no coordination between when outliers happen in the second and when in that same second you would be sampling the histogram, and knowing that 10,000 results/sec are flowing in and out of a 1,028 entry buffer, the chance of any single measured result still being in that buffer when you sample is even: any result has a ~90% probability of not showing up anywhere in your samples. That obviously applies to outliers just as much as any other result.

In your example of a system logging 10K operations/sec with the histogram being sampled every second, you'll be missing 9 out of each 10 actual outliers. You can have an outlier every second and think you have one roughly every 10. You can have a huge business affecting outlier happening every hour, and think that they are only occurring once a day. 

That's worse than having a blind fold on. I call this mode "a blind fold with a lollipop". The lollipop makes you feel good about the results, since you do see the occasional outlier here and there, and think outliers are actually being reported. Without it, you would have immediately known something was badly wrong.

To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/I4JfZQ1GYi8/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ron Pressler

unread,
Nov 3, 2013, 1:16:10 AM11/3/13
to mechanica...@googlegroups.com, Gil Tene, j...@jmason.org
Huh. I had no idea that the max value is just the largest sample in the buffer, but I looked at the code and you're absolutely right.
So, wouldn't an acceptable solution would be to simple hold the largest seen value separately from the samples? You wouldn't know the distribution, but at least you'll know that something is wrong, and have a sense of the occurrence frequency if you log the metrics often (like, every second or so).
But I guess that in that case, you don't even need Metrics's complicated histogram (that adds contention, too). Just keep track of the max (and maybe the mean), and collect the results often. 

For benchmarking, I use a striped HdrHistogram.


To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

pron

unread,
Nov 3, 2013, 1:42:43 AM11/3/13
to mechanica...@googlegroups.com, Gil Tene, j...@jmason.org
BTW, LongMaxUpdater in jsr166e (or java.utili.concurrent in JDK8) would be a good way to keep track of the max value (and LongAdder can be used to record the average).
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/I4JfZQ1GYi8/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/I4JfZQ1GYi8/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Gil Tene

unread,
Nov 16, 2013, 12:13:45 AM11/16/13
to mechanica...@googlegroups.com
As promised below, here is my first cut at the magic LatencyStats coordinated omission correcting thing: http://latencyutils.github.io/LatencyUtils/

I'll start a separate thread on the subject, I think.

Chris Shellenbarger

unread,
Nov 24, 2013, 1:39:36 PM11/24/13
to mechanica...@googlegroups.com, j...@jmason.org
Gil, thank you for spending the time to post all of this information.  I first heard about Coordinated Omission on the first day of JavaOne this year during your talk and had suspected that the Metrics library suffered from these issues but hadn't had the time to do the appropriate analysis.  I'm now looking forward to running your LatencyUtils side by side with Metrics data in order to see just how big the difference can be.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Adrian Ivan

unread,
May 25, 2015, 3:35:28 AM5/25/15
to mechanica...@googlegroups.com, j...@jmason.org

Hi Chris,

I know it is 2 years later, but did you manage to do the comparison?

Thanks
Reply all
Reply to author
Forward
0 new messages