More detail on outliers with HdrHistogram

Richard Cole

unread,

May 12, 2016, 12:59:22 PM5/12/16

to mechanical-sympathy

Hi - I'm using HdrHistogram thanks to watching a number of talks and blog posts by Gil, Martin and Nitsan and have a question (first post, so hoping this is the right forum). Let's assume I need to run 10,000,000+ requests (for whatever a request is in this context) it's easy enough to spot these outliers using the data produced by the histogram but what is the standard pattern for storing more information about those outliers? It's all well having the latency value of the last few percentiles but I'm missing the link between how I can then dig into what those outliers were.

If I'm running 1000 or 10,000 requests I can just log the info of all requests to then be able to diagnose the outliers after, but once I get up to 1,000,000 or 10,000,000 that starts to become a different story.

Does anyone have and ideas or experience solving this kind of thing?

Many thanks in advance,

Richard.

Nitsan Wakart

unread,

May 15, 2016, 5:19:18 AM5/15/16

to mechanica...@googlegroups.com

I'm not sure what you mean by extra information, but I assume you want to know at least when the worst of the lot happened.

HdrHistogram does not know the 'origin' of a value and does not track a timestamp for every latency. This is something you'd have to add by yourself. Given that you probably want to log other info it makes sense to put some 'log details for value larger than X' logic in your code.

HdrHistogram does support a compact logging format which you can use to log histogram intervals. This can be done every second (or every whatever) and the logs can be used to narrow your search for worse offenders to a given interval. You then have to correlate that interval with other logs (application/GC/OS monitoring) to find out more details.

We see people doing either one, and often both.

Nathan Fisher

unread,

May 15, 2016, 10:45:11 AM5/15/16

to mechanica...@googlegroups.com

When you say "extra information" do you mean you're using a single hdrhistogram across all end-points in a server and want to know the particular call that was made to that server so you can classify the outliers by the call path?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nakamura

unread,

May 15, 2016, 12:20:39 PM5/15/16

to mechanica...@googlegroups.com

An idea I heard recently is the idea of storing fine-grained event data locally in-memory for a short period of time. This lets consumers sample after the fact, instead of immediately, when we don't have full information. Then you can do something like say, "Once a minute, I want to sample the histogram, grab the p99, and then get all event data for traces where the latency was longer than the histogram's p99".

I haven't done this before, but the idea is that using this strategy, you can filter based on how interesting the data is, rather than how much of the data there is (ie with probabilistic sampling).

The usefulness of doing aggregation later instead of immediately is that if you want full trace data for a request, you might have to wait for a little while for it to finish before you can collect the rest.

Richard Cole

unread,

May 15, 2016, 12:49:27 PM5/15/16

to mechanical-sympathy

Thanks all for the responses. I think Nitsan's pretty much nailed it but on rereading my original post I see it's a bit unclear so hope to clarify.

Let's assume we are recording the response times for 1000 jobs. A job in this sense is application specific but each job has a unique job ID. Once the load test is complete I have called "recordValue()" 1000 times with a long representing the time of each job. At this point I am able to use the histogram to be able to ascertain the durations of the job outliers but I don't (without following the recommendation of Nitsan) have a way to know which job it was.

I don't know whether someone would say it's outside the scope of HdrHistogram, but what might help is the ability to say:

recordValue(duration, context);

Where context is a simple way of storing more information about the duration that was recorded for later use. In my example above you could store the job id and also something like the timestamp mentioned below. Both of these would then be useful to grep system logs etc to track down the path of the outlier. You could remove the memory concern this might have by only storing the context for the outliers and discarding the lower percentiles.

Richard.

Richard Cole

unread,

May 15, 2016, 12:50:56 PM5/15/16

to mechanical-sympathy

If anyone thinks there's any utility in what I'm suggesting I'd be more than happy to have a go at a PoC.

On Thursday, 12 May 2016 17:59:22 UTC+1, Richard Cole wrote:

Nakamura

unread,

May 15, 2016, 1:14:11 PM5/15/16

to mechanica...@googlegroups.com

What you're describing is a tracing system. It's fine to have a rule for "trace everything slower than X" unless your traffic pattern ever changes–if suddenly X is the p50 instead of the p999, then you're now tracing 500x what you expected to before.

--

Richard Cole

unread,

May 15, 2016, 1:29:42 PM5/15/16

to mechanica...@googlegroups.com

Ah not quite. You would only ever record 2-3 values as it would be the top percentiles so not linked to a threshold per se

Sent from my iPhone

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/ujSu7DAES6g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

May 18, 2016, 11:28:58 AM5/18/16

to mechanical-sympathy

The core problem with "only recording the top percentiles" is that you have no idea whether or not something is in the top percentiles at the time you record it. There are probably several ways to address that problem.

As something like this can certainly be useful. Here are a couple of possible approaches:

A. As Moses Nakamura suggests in his post, you can keep a recording of all recent events, and process that every interval to keep details on events above a certain %'ile (for a %'ile within the interval), or on the top N events. This can be done with a simple double-pass mechanism (first process everything to establish percentiles, then use those percentile levels to filter the list in the next pass). This mechanism will be able to trace and report on all events above a certain percentile without "sampling".

B. If a sampling of one context per latency level is sufficient (which it may often be), we could create a variant of HdrHistogram that stored a single context ID per value level (which only doubles the storage requirements but keeps them fixed and capped). This would allow you to extract a context per value with something like a getSampledContextAtValue() call. [Note that in this mode, only one context is retained for a specific value [within the accuracy level spec'ed for the histogram], but as values often vary and the resolution can be fairly high, you may find multiple samples spread around a wide range of high values].

C. I can see a use for a variant of (B) that keeps two (or more?) context info items per value here. Specifically, a timestamp and a context ID (i.e. useful for "tell me 'what and when' for the top N events"). While this can be done by muxing information into a single 64 bit context ID (e.g. time and id in one word), having a total of 128 bits is probably nicer to work with. [this is a good example of where value types would be really nice to have].

If people think that a value-based-sampling of event IDs is a useful enough thing, I'd be happy to look at adding it to HdrHistogram. E.g. as a ValueContextHistogram class that would support recordValue(long value, long context) and getSampledContextAtValue(long value) methods.I can see possible context retention policies of "keep last", "keep first", or "keep random" as long as we keep one value that is not valid for context (e.g. Long.MIN_VALUE).

On Sunday, May 15, 2016 at 10:29:42 AM UTC-7, Richard Cole wrote:

Ah not quite. You would only ever record 2-3 values as it would be the top percentiles so not linked to a threshold per se

Sent from my iPhone

On 15 May 2016, at 18:13, Nakamura <nny...@gmail.com> wrote:

What you're describing is a tracing system. It's fine to have a rule for "trace everything slower than X" unless your traffic pattern ever changes–if suddenly X is the p50 instead of the p999, then you're now tracing 500x what you expected to before.

On Sun, May 15, 2016 at 9:50 AM, 'Richard Cole' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:

If anyone thinks there's any utility in what I'm suggesting I'd be more than happy to have a go at a PoC.

On Thursday, 12 May 2016 17:59:22 UTC+1, Richard Cole wrote:
Hi - I'm using HdrHistogram thanks to watching a number of talks and blog posts by Gil, Martin and Nitsan and have a question (first post, so hoping this is the right forum). Let's assume I need to run 10,000,000+ requests (for whatever a request is in this context) it's easy enough to spot these outliers using the data produced by the histogram but what is the standard pattern for storing more information about those outliers? It's all well having the latency value of the last few percentiles but I'm missing the link between how I can then dig into what those outliers were.

If I'm running 1000 or 10,000 requests I can just log the info of all requests to then be able to diagnose the outliers after, but once I get up to 1,000,000 or 10,000,000 that starts to become a different story.

Does anyone have and ideas or experience solving this kind of thing?

Many thanks in advance,

Richard.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/ujSu7DAES6g/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Reply all

Reply to author

Forward