quantile() versus summaries

Matt Bostock

unread,

Sep 14, 2016, 12:27:52 PM9/14/16

to Prometheus Developers

Hi,

I'm trying to understand the use case for the new quantile() aggregation operator [1], particularly as compared to summaries [2]. I understand the difference between histograms and summaries but I'm not sure how quantile() should be used.

Could someone please explain the difference? I'd be happy to add a section to the 'Histograms and summaries' page [3] so that it's clearer to others.

Thanks,

Matt

[1]: https://prometheus.io/docs/querying/operators/#aggregation-operators

[2]: https://prometheus.io/docs/concepts/metric_types/#summary

[3]: https://prometheus.io/docs/practices/histograms/

Julius Volz

unread,

Sep 14, 2016, 5:29:35 PM9/14/16

to Matt Bostock, Prometheus Developers

We have multiple ways of computing quantiles now. Let me explain the different ones:

1) Client-side quantiles via summaries.

If you want to measure many latencies of the same type (like request latencies on an HTTP server instance) and you have observations that are more frequent than the scrape interval, then you could use summaries. The upside is that no quantile-computation is needed in the Prometheus server and no manual buckets need to be specified like with histograms. Downside: you can't aggregate precomputed quantiles across dimensions, so this is only good if you care about only individual instances and no subdimensions, vs. e.g. an entire service latency.

2) histogram_quantile() based on histograms

Same use case as client-side summary quantiles, but the calculation of the quantiles happens during query time and allows for aggregation. For full summary/histogram quantile tradeoffs, see https://prometheus.io/docs/practices/histograms/.

3) quantile_over_time()

This returns the x-th quantile *over time* for each input series. For example, to answer such questions as, "what was my 90th percentile run time of my batch job over the last 7d?", where the batch job run time was saved in a single gauge vs. a quantile or histogram (because it happens much less frequently than the scrape interval, so you don't need to cram multiple observations into one scrape interval via some client-side aggregation).

4) quantile()

This returns the x-th quantile at *one point in time* over *multiple* series. Let's say you have 100 nodes and you want to know the 90th percentile CPU usage % over all of them, then you could use "quantile by(job) (0.9, rate(my_cpu_usage_seconds[5m]))".

Does that make sense?

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julius Volz

unread,

Sep 14, 2016, 5:30:51 PM9/14/16

to Matt Bostock, Prometheus Developers

On Wed, Sep 14, 2016 at 11:29 PM, Julius Volz <juliu...@gmail.com> wrote:

3) quantile_over_time()

This returns the x-th quantile *over time* for each input series. For example, to answer such questions as, "what was my 90th percentile run time of my batch job over the last 7d?", where the batch job run time was saved in a single gauge vs. a quantile or histogram (because it happens

"vs. a summary of histogram" I meant to write.

Brian Brazil

unread,

Sep 16, 2016, 6:03:21 PM9/16/16

to Matt Bostock, Prometheus Developers

On 14 September 2016 at 17:27, Matt Bostock <ma...@mattbostock.com> wrote:

Hi,

I'm trying to understand the use case for the new quantile() aggregation operator [1], particularly as compared to summaries [2]. I understand the difference between histograms and summaries but I'm not sure how quantile() should be used.

The quantiles for summaries are to know the per-event quantile over some time period (defautls to ~5m) on a single server.

Quantile is to be used across gauges, for example to know how much CPU the median server is using right now. It's mostly useful for spotting outliers among your targets.

Brian

Could someone please explain the difference? I'd be happy to add a section to the 'Histograms and summaries' page [3] so that it's clearer to others.

Thanks,
Matt

[1]: https://prometheus.io/docs/querying/operators/#aggregation-operators
[2]: https://prometheus.io/docs/concepts/metric_types/#summary
[3]: https://prometheus.io/docs/practices/histograms/

--

You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Reply all

Reply to author

Forward