Anyone use LatencyUtils or HdrHistrogram for measuring latency on a distributed system?

Skip to first unread message


Oct 6, 2015, 4:32:23 PM10/6/15
to mechanical-sympathy
Interested in seeing if others were successful in doing so and what approaches were taken.

Particularly, looking to see if folks have used it to benchmark a distributed cache on multiple nodes.


Gil Tene

Oct 6, 2015, 8:35:43 PM10/6/15
to mechanical-sympathy
I'll let others chime in... I'm biased ;-)


Oct 6, 2015, 10:21:56 PM10/6/15
to mechanical-sympathy
Actually wouldn't mind hearing your input. The examples in the LatencyUtils package work well for a single node but I can't grasp using it across multiple machines.

Marshall Pierce

Oct 6, 2015, 11:06:51 PM10/6/15
Several years ago we did some work on Seagate’s Kinetic drives, and part of that was writing the benchmarking suite used internally to track performance as we tuned the on-drive software. We used HdrHistogram to record latencies, though in case we didn’t need to go multi-node to generate suitable load. It did pretty much what it says on the tin: it records timestamps quite well.

If you’re going for a distributed approach, one way you could do it would be to use a Recorder in each node. Every so often, flip the phase on the Recorder and encode the snapshot histogram into a ByteBuffer, then send that over the wire to whatever is aggregating the results. You could then merge that snapshot into a master Histogram. Gil, any reason why that wouldn’t work?

> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> For more options, visit

Alec Hothan

Oct 7, 2015, 12:55:52 AM10/7/15
to mechanical-sympathy
I'm a little bit biased as well but I have direct experience in that area.

I have been using HdrHistogram in the context of OpenStack to measure how well an OpenStack cloud behaves under load. For those not familiar, OpenStack is an open source cloud OS that manages compute nodes and applications that run in virtual machines such as a company's web server (similar to Amazon Web Service but using open source software).
I was in charge of measuring how much an OpenStack cloud deployment can scale and developed a scale application that would basically load the cloud with a configurable number of web servers and HTTP traffic generators, then pass traffic and measure things like how many requests/sec and throughout could be passed and with what latency. On our largest cloud, we had hundreds of HTTP servers running and hundreds of HTTP traffic generators simulating hundreds of thousand of users on different physical compute nodes. The challenge was to summarize the latency of these thousands of emulated users that were distributed all over the place.
The traffic generator I used was Gil's fork of wrk2 (github open source) that used the C version of HdrHistogram (written by Mike Barker).
That version was simply displaying percentile latency information on the console.
What we did was to fork that version to use the latest HdrHistogram and output on the stdout the encoded histogram (in compressed/base64 form), then have our agent in each compute node lift that encoded histogram and forward it to our aggregator (in our case we use Redis as the transport).
The aggregator/main orchestration app is python based, which is why I ended up porting Gil's HdrHistogram to python (took me much more time than I wanted but it was worth the effort). The aggregator receives all these encoded histograms, decodes them and aggregates them on the fly.
In terms of scale we have tested up to several hundred traffic generators.
In simple runs we would receive as many histograms at the end of the scale run (e.g. after 10 minutes).
In monitoring runs, the traffic generators would send histograms every 5 seconds, resulting in an avalanche of histograms towards the aggregator every 5 seconds.
Because the HdrHistogram decoder has been optimized for speed, we could handle very easily a few hundred of decodes per 5 second period
I have no doubt this can scale to thousands of distributed histogram collection points if the transport used is adequate.

Continuing on this we are considering extending this scale app to support distributed storage access (still in the context of cloud). As far as I know there are very few (if any) tools that do distributed performance measurement with histograms. I think this is extremely useful and provides invaluable information on the behavior of any distributed system under load.

So to answer that question, this works pretty well ;-)

Gil Tene

Oct 7, 2015, 2:37:46 AM10/7/15
to <>
The key thing to realize is that histograms are additive, unlike other forms of latency summary stats (like percentiles). A histogram keeps track of the counts of occurrences of various values. So you can add all the histograms for all 5 second intervals within an hour, and you get the actual histogram for the hour. And you can add all the histograms a for a given time period across a cluster of 100 nodes, and you have the actual histogram for the entire cluster for that time period.

The various  forms of HdrHistogram (Java, C, Python, Erlang...) support a common lossless (and pretty compact and compressed) encoding format that can be used for transporting histograms between machines, for logging them, and even for storing them in a time series database (e.g. So you can aggregate histograms whenever you want to. Since footprint isn't that big, keeping them in non-aggergated forms until you actually need to sum them up (for a time period, or for a set of machines, or both, etc.) is useful. There is also a common logging format for interval histograms (e.g.HistogramLogWriter), tools for extracting summaries from them (HistogramLogProcessor), and some tools people have built to view them (e.g.

As for patterns for actually capturing the data, LatencyStats and the Recorder in HdrHistogram both use a common getIntervalHistogram() pattern. You record your stats in a wait free manner on your critical path, and get clean interval histograms for the recording on a non-critical path. You'd then either log those interval histogram (using a HistigramLogWriter, jHiccup is a good example if how this is done), or send them somewhere else (encode them into a compressed ByteBuffer and send over the network either in binary form or in base64 encoding).

One note I'd put in if you are planning to benchmark a distributed system (as opposed to monitoring it): LatencyStats is useful for correcting Coordinated Omission when it happens, and when your can actually detect the dominant causes (like process pauses), and this correction capability is very useful for monitoring situations. However, when benchmarking you don't have to correct for Coordinated Omission, because you can cleanly avoid Coordinated Omission to begin with. CO Correction will invariably "fill in the blanks" for time periods where we know measurements were skipped, which means it "makes up" data based on good guesses (e.g. projecting the reality that was measured  at nearby times into the time gap where no measurement was done), but CO  Avoidance (which is possible in benchmarking because there is always  supposed to be "a plan" for what should have happened) does not need to project or make up anything. You can see some good examples of CO avoidance in wrk2cassandra-stress2, and YCSB (the one on In all of these, LatencyStats is NOT used, and instead you just use HdrHistograms (or a histogram Recorder) to record the correctly measured response time.

Sent from Gil's iPhone
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit
To unsubscribe from this group and all its topics, send an email to
Reply all
Reply to author
0 new messages