Client-side "min"/"max" tracking

285 views
Skip to first unread message

Paul "LeoNerd" Evans

unread,
Oct 26, 2016, 11:25:25 AM10/26/16
to Prometheus Developers
I have a monitoring probe that currently manually generates a "/metrics"
page using custom code, that I'm converting to use a standard library
(via first writing said standard library for Perl - more on that in a
later email).

The main metric that this probe exports is a histogram of "send" and
"receive" roundtrips for a messaging system. As well as exposing the
count, sum, and bounded buckets, it also keeps track of a
locally-calculated "maximum over the past 1 minute", which I can then
use with the max_over_time() function in prometheus or grafana to plot
larger graphs of maximum roundtrip times, as well as the average.

This feels like a useful-enough feature that I'm considering adding it
to my client library. Perhaps as an option that can be enabled on a
Summary or Histogram metric, allowing it to track min/max/both of the
observation over a short period of time, and add that to the output
format, perhaps looking something like:

recv_rtt_count 3
recv_rtt_sum 3.397684
recv_rtt_bucket{le="0.01"} 0
recv_rtt_bucket{le="0.1"} 1
...
recv_rtt_max_1m 1.476539

"Best Practice" would encourage keeping that horizon as short as
possible, as it blurs out the graphs and also uses more memory in the
exporter, having to remember those values; but not so short as to risk
missing a collection. I keep mine at 1 minute because with a scrape
interval of 20s, each scrape should cover about 3 observations, which
I feel is appropriate.

Is there any precedent in existing client libraries for doing this?
Something I can steal naming ideas out of?

Barring any other idea, I was thinking something along the lines of
two new constructor arguments, something like

aggregate => "max" # To request the aggregation at all
aggregate_horizon => "1m" # To set the duration of time that is
stored

How does that sit with people?

--
Paul "LeoNerd" Evans

leo...@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Brian Brazil

unread,
Oct 26, 2016, 11:34:22 AM10/26/16
to Paul LeoNerd Evans, Prometheus Developers
The only official guidance we currently have on these is https://prometheus.io/docs/instrumenting/writing_exporters/#drop-less-useful-statistics which discourages such metrics. If a client were to offer it it should be with associated warnings and discouragement.

On the naming side max1m would be more in line I think.


--

Paul "LeoNerd" Evans

unread,
Oct 26, 2016, 11:39:39 AM10/26/16
to prometheus...@googlegroups.com
On Wed, 26 Oct 2016 16:34:20 +0100
Brian Brazil <brian....@robustperception.io> wrote:

> The only official guidance we currently have on these is
> https://prometheus.io/docs/instrumenting/writing_exporters/#drop-less-useful-statistics
> which discourages such metrics.

Hm. This says:

> These should all be dropped, as they’re not very useful and add
> clutter. Prometheus can calculate rates itself, and usually more
> accurately (these are usually exponentially decaying averages).

In this case it's not a rate, but a maximum. Prometheus has no way to
determine what the maximum value observed has been, based on just the
sum + count. I'm interested in knowing what the peak RTT has been,
which it something Prometheus itself can't determine without at least a
little help from the probe's instrumentation.

> You don’t know what time the min/max were calculated over,

Which is why in this case it's suffixed with the duration, and the
duration is deliberately kept quite short so restarts don't overly
affect its accuracy.

> If a client were to offer it it should be with associated warnings
> and discouragement.

That I can certainly manage.

> On the naming side max1m would be more in line I think.

Ah, without the '_'? OK.
Reply all
Reply to author
Forward
0 new messages