Percentiles in summary of Java client

518 views
Skip to first unread message

Abhi

unread,
Apr 1, 2015, 3:18:49 AM4/1/15
to prometheus...@googlegroups.com
I played around with the Java client and went through the code too, looks like percentile calculation is not part of summary in the Java client. Can someone please assert this assumption? If it is not available, is it in the pipeline anytime soon :)?

Brian Brazil

unread,
Apr 1, 2015, 3:43:46 AM4/1/15
to Abhi, prometheus-developers
On 1 April 2015 at 08:18, Abhi <abhy...@gmail.com> wrote:
I played around with the Java client and went through the code too, looks like percentile calculation is not part of summary in the Java client. Can someone please assert this assumption? If it is not available, is it in the pipeline anytime soon :)?

They aren't in the Java simpleclient, have you looked at using Histograms and the histogram_quantile() function in Prometheus?

Brian
 

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abhirama

unread,
Apr 1, 2015, 8:50:48 AM4/1/15
to Brian Brazil, prometheus-developers
As per this, histogram and summaries are not the same - http://prometheus.io/docs/practices/histograms/, am I missing something?
--
Cheers,
Abhi

Brian Brazil

unread,
Apr 1, 2015, 8:53:22 AM4/1/15
to Abhirama, prometheus-developers
On 1 April 2015 at 13:50, Abhirama <abhy...@gmail.com> wrote:
As per this, histogram and summaries are not the same - http://prometheus.io/docs/practices/histograms/, am I missing something?

That's correct. What's your use case? There are advantages and disadvantages to each.

Brian

Julius Volz

unread,
Apr 1, 2015, 11:08:28 AM4/1/15
to Brian Brazil, Abhirama, prometheus-developers
It should be pointed out that there are two Java client library versions, both living in the same repository:


The first one actually supports summaries with quantiles. You might want to try that. I've heard different opinions about each library, but the simpleclient is probably what most people will want to use in the future, unless they need summary quantiles. It would be great if it the new library could add support for quantiles in summaries as well. At SoundCloud, we are still using the old client library for quantiles, as we haven't fully introduced usage of the new "histogram" metric type yet.

The fundamental tradeoff between quantiles based on histograms and summary quantiles is:

- histograms require more computation and number of stored time series on the server side to generate useful quantiles, but they allow aggregation and server-side quantile calculation
- summaries pre-calculate quantiles on the client side, making the server's job easier, but it means that you cannot get a statistically valid aggregation over them (as in, the 90th percentile over *all* instances, not just one). That's what histograms enable.

For details, see:


The last link explains in great detail the tradeoffs of both metric types.

Abhirama

unread,
Apr 1, 2015, 11:33:07 PM4/1/15
to Julius Volz, Brian Brazil, prometheus-developers
Brain Brazil,

I want to compute the median, 90 %tile, 95 %tile, 99 %tile and max response time of my apis. 

My understanding of histogram is that if I have to use it to calculate response times, I have to create buckets for this range when I instantiate the client. For this, I first have to have an idea of the response time percentiles and what happens if this swings wildly say during high load or some other circumstance? Please correct if my understanding is off the mark.

Julius Vos,

Yes, I saw the other client too, simple client looked simple (surprise, surprise :)) when compared to the other one, so I started digging into it. 
--
Cheers,
Abhi

Brian Brazil

unread,
Apr 2, 2015, 3:58:50 AM4/2/15
to Abhirama, Julius Volz, prometheus-developers
On 2 April 2015 at 04:33, Abhirama <abhy...@gmail.com> wrote:
Brain Brazil,

I want to compute the median, 90 %tile, 95 %tile, 99 %tile and max response time of my apis. 

Do you need these percentiles to check that e.g. you're below 300ms latency at the 90th percentile?
 
My understanding of histogram is that if I have to use it to calculate response times, I have to create buckets for this range when I instantiate the client. For this, I first have to have an idea of the response time percentiles and what happens if this swings wildly say during high load or some other circumstance?

If it's outside of the buckets you've specified, you won't get a reasonable idea of what the actual percentile is - however on the other hand if that happens what you generally want to know is that the latency is really bad, which it's sufficient to tell you.

Note that if you want 100% accurate percentiles, an online monitoring system like Prometheus isn't going to do that for you as it'd be computationally prohibitive. What you usually want (and get) from a system like Prometheus is a general idea as to what the percentile is and which direction it's moving in, and in the case of Histogram in a way that you can aggregate and more easily reason about. 

Brian

Abhirama

unread,
Apr 2, 2015, 5:59:10 AM4/2/15
to Brian Brazil, Julius Volz, prometheus-developers
We currently use graphite to graph response times and are used to seeing percentile latencies in the graphs. Reading about the histogram/summary error in Prometheus actually got me thinking how accurate is this, any clue about that?

I understand what you are saying about accuracy, it is just that we are used to seeing latencies.

What is the general guideline around choosing buckets for histograms? Is it just on your SLAs or is there something else too to it?

--
Cheers,
Abhi

Brian Brazil

unread,
Apr 2, 2015, 6:19:59 AM4/2/15
to Abhirama, Julius Volz, prometheus-developers
On 2 April 2015 at 10:59, Abhirama <abhy...@gmail.com> wrote:
We currently use graphite to graph response times and are used to seeing percentile latencies in the graphs. Reading about the histogram/summary error in Prometheus actually got me thinking how accurate is this, any clue about that?

The instrumentation systems tend to have a Summary approach, http://prometheus.io/docs/practices/histograms/#quantiles should cover the differences. If you want to know exactly the accuracy you'd need to dig into your instrumentation's implementation, and how graphite is aggregating the data.

The only system other than Prometheus that I'm aware of that takes appears to take the Histogram approach is New Relic.

Overall, both approaches should be "good enough" in terms of accuracy for most practical online monitoring purposes.


I understand what you are saying about accuracy, it is just that we are used to seeing latencies.

What is the general guideline around choosing buckets for histograms? Is it just on your SLAs or is there something else too to it?

Usually choose 10-20 buckets, which include your SLA points and cover the typical distribution of latencies.
The defaults are intended to cover latencies for a typical web services in the 10ms to 10s range.

Brian
Reply all
Reply to author
Forward
0 new messages