A Summary with max and min within the window?

1,147 views
Skip to first unread message

hug...@gmail.com

unread,
Feb 25, 2016, 7:25:15 PM2/25/16
to Prometheus Developers
Hi!

I'm looking for a good way to deal with a gauge with significantly higher sampling rate than my Prometheus scrape interval.

I've played with histograms, but I'm mostly interested in mean anyway, so histograms are overkill for my purposes. This gauge also doesn't change very fast - I just want to know the full range it covered between scrapes. I want to know about the worst outliers.

Consequently I'm thinking providing a Summary with a mean, max and min would be perfect. MaxAge set to scrape interval plus one additional bucket, I suspect, would do the trick of always representing the full range, if scrapes are never late.

In principle I want the 0th and 100th percentile, but these are fairly easy to calculate with perfect accuracy for a given time window. (I'm thinking a queue for relevant samples, first sample in the front queue that's still within MaxAge to be returned when scraped - and new sample joining the back of the queue, kicking out any samples before it that are no longer relevant.)

I suspect this would need a new type, implementing Summary, and is probably fairly easy to add? In the meantime, I suppose I could experiment to see if a 0th and 100th percentile works almost how I want them to. Or read the code to understand the algorithm. :)

Thanks,
Hugo

Brian Brazil

unread,
Feb 25, 2016, 7:32:56 PM2/25/16
to hug...@gmail.com, Prometheus Developers
On 26 February 2016 at 00:25, <hug...@gmail.com> wrote:
Hi!

I'm looking for a good way to deal with a gauge with significantly higher sampling rate than my Prometheus scrape interval.

The scrape interval is the sampling rate with Prometheus.
 
I've played with histograms, but I'm mostly interested in mean anyway, so histograms are overkill for my purposes. This gauge also doesn't change very fast - I just want to know the full range it covered between scrapes. I want to know about the worst outliers.

Consequently I'm thinking providing a Summary with a mean, max and min would be perfect. MaxAge set to scrape interval plus one additional bucket, I suspect, would do the trick of always representing the full range, if scrapes are never late.

The problem is that instrumentation has no knowledge as to what the scrape interval is, so this isn't something that can be offered generally.
 

In principle I want the 0th and 100th percentile, but these are fairly easy to calculate with perfect accuracy for a given time window. (I'm thinking a queue for relevant samples, first sample in the front queue that's still within MaxAge to be returned when scraped - and new sample joining the back of the queue, kicking out any samples before it that are no longer relevant.)

Our experience with similar structures for quantiles is that they're relatively inefficient, as they require mutexes rather than more efficient ways of handling concurrency. Histograms and quantile-less Summaries do not have this problem.

If you're looking for this sort of information, it's probably best to get it via your logs rather than trying to hack it into the  Prometheus instrumentation model.

Brian
 

I suspect this would need a new type, implementing Summary, and is probably fairly easy to add? In the meantime, I suppose I could experiment to see if a 0th and 100th percentile works almost how I want them to. Or read the code to understand the algorithm. :)

Thanks,
Hugo

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Björn Rabenstein

unread,
Feb 26, 2016, 12:30:18 PM2/26/16
to Brian Brazil, hug...@gmail.com, Prometheus Developers
Hi Hugo,

As Brian said, the scrape interval is by design unknown, and more
important: the phase of the scrape, too. With just one additional
bucket, you needed to align your bucket switch to a scrape, but often
you have multiple scrapers (like a 2nd Prometheus server for HA or a
test server).

If you are using the Go client, using the summary with ε=1.0 and ε=0.0
will pretty much do what you want. It's relatively expensive, but the
quantile estimation algorithm is quite efficient, so manually
implementing a max/min estimator for that special use case wouldn't
help a lot. Most time would still be burned in mutex acquisition and
such.

If you don't have crazily high observation frequencies, the summary is
actually fine. Let's say less than a thousand observations per second
or so.

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

hug...@gmail.com

unread,
Feb 28, 2016, 9:18:11 AM2/28/16
to Prometheus Developers, brian....@robustperception.io, hug...@gmail.com
Thanks for the responses. Pardon, my vocabulary was indeed not entirely accurate - when I talked about the "sampling rate" of my gauge, I was talking about Histogram or Summary "Observe" rate. And I could use convention (perhaps with exporting of "window size" together with monitoring) to alert if my Prometheus scrape interval gets too long relative to window size.

But nevermind - for the time being I've decided to just increase my prometheus scrape rate to sample the gauge more often! :) (Even a 5 second scrape should be good for me then, whereas I was previously considering a minute.)

Another alternative I considered is turning a gauge into a counter by summing up all observations, and exporting how many observations were made (to provide the denominator), moving the mean calculation to prometheus, but this probably adds more complexity than necessary. High scrape frequency, and then calculating whatever I need on prometheus' side, is the route I'll take now.

Thanks for the tips!
Hugo

Björn Rabenstein

unread,
Feb 29, 2016, 8:15:40 AM2/29/16
to hug...@gmail.com, Prometheus Developers, Brian Brazil
On 28 February 2016 at 15:18, <hug...@gmail.com> wrote:
> Another alternative I considered is turning a gauge into a counter by summing up all observations, and exporting how many observations were made (to provide the denominator), moving the mean calculation to prometheus

Note that this is exactly what both histogram and summary gives you
"for free" (on top of the buckets a histogram and the precalculated
quantiles of a summary). If you want that, just create a quantile-less
summary, and use `Observe()` as usual.

In more detail:

If you have a summary or histogram called `foo`, you'll see the
following time series on the server:

- `foo_count` : Counts the number of `Observe()` calls.
- `foo_sum` : The sum of all observed values.

On the prometheus server, you'd calculate the average observation
value during the last 10m as `rate(foo_sum[10m]) /
rate(foo_count[10m])`.

This all is very "Promethean" and standard procedure.
Reply all
Reply to author
Forward
0 new messages