synthetic histograms in Prometheus

Johny

unread,

Aug 7, 2022, 3:23:29 AM8/7/22

to Prometheus Users

We are migrating telemetry backend from legacy database to Prometheus and require estimating percentiles on gauge metrics published by user applications. Estimating percentiles on a gauge metric in Prometheus is not feasible and for a number of reasons, client applications will be difficult to modify to start publishing histograms.

I am exploring feasibility of creating a histogram in a recording rule in Prometheus based on the metrics published by users. The partial work put in so far seems inefficient, also illegible. Is there a recommended approach to solve this problem? As stated earlier, it will be extremely hard to solve the problem on the client side and I am looking for a solution within Prometheus.

Current metric is a gauge with with values representing request latency.

http_duration_milliseconds_gauge{instance="instance1:port1"}[1h]

1659752188 100

1659752068 120

..

1659751708 150

1659751588 160

Desired histogram after conversion -

http_duration_milliseconds_hist_bucket{instance="instance1:port1", le=100} 133

http_duration_milliseconds_hist_bucket{instance="instance1:port1", le=120} 222

http_duration_milliseconds_hist_bucket{instance="instance1:port1", le=140} 311

http_duration_milliseconds_hist_bucket{instance="instance1:port1", le=160} 330

http_duration_milliseconds_hist_bucket{instance="instance1:port1", le=180} 339

http_duration_milliseconds_hist_bucket{instance="instance1:port1", le=200} 340

Stuart Clark

unread,

Aug 7, 2022, 6:11:46 AM8/7/22

to Johny, Prometheus Users

On 07/08/2022 08:23, Johny wrote:

We are migrating telemetry backend from legacy database to Prometheus and require estimating percentiles on gauge metrics published by user applications. Estimating percentiles on a gauge metric in Prometheus is not feasible and for a number of reasons, client applications will be difficult to modify to start publishing histograms.

I am exploring feasibility of creating a histogram in a recording rule in Prometheus based on the metrics published by users. The partial work put in so far seems inefficient, also illegible. Is there a recommended approach to solve this problem? As stated earlier, it will be extremely hard to solve the problem on the client side and I am looking for a solution within Prometheus.

Current metric is a gauge with with values representing request latency.

http_duration_milliseconds_gauge{instance="instance1:port1"}[1h]

1659752188 100

1659752068 120

..

1659751708 150

1659751588 160

I'm not really sure what you are meaning by this metric?

A histogram of request latencies needs access to all the events that occur, with details of every single latency value. It can then increment the counter for a particular sot of range buckets to map the distribution over time. I don't really understand what the single gauge represents? Is that the latency of the most recent event? Some average over the last hour?

Without access to the underlying events I can't see how this can be possible - which is only possible in the application, or if you store events elsewhere (e.g. in log files) in a tool that connects to your event store system.

-- 
Stuart Clark

Ben Kochie

unread,

Aug 7, 2022, 7:49:05 AM8/7/22

to Johny, Prometheus Users

So, let's take a step back and find out some more information, because this question is sounding a lot like an XY Problem.

How are the current applications generating their metrics right now?

How are you getting the data to create these histograms?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f95b5512-1c81-4e12-9670-7c7eb0d29f5en%40googlegroups.com.

Johny

unread,

Aug 7, 2022, 1:14:59 PM8/7/22

to Prometheus Users

Gauge contains most recent values of a metric, sampled every 1 min or so, and exported by a user application, e.g. some latency sampled at 1 minute intervals by a client application. Lets presume this time series (scraped by Prometheus or sent via remote write) is absolute containing all the information we need for calculating derived statistics. In the most raw form, you can fetch the data points, sort them and calculate percentile. Incidentally, legacy backend has efficient mechanisms to calculate percentiles by scanning and reducing data using map-reduce.

Stuart Clark

unread,

Aug 7, 2022, 2:18:42 PM8/7/22

to Johny, Prometheus Users

On 07/08/2022 18:14, Johny wrote:
> Gauge contains most recent values of a metric, sampled every 1 min or
> so, and exported by a user application, e.g. some latency sampled at 1
> minute intervals by a client application. Lets presume this time
> series (scraped by Prometheus or sent via remote write) is absolute
> containing all the information we need for calculating derived
> statistics. In the most raw form, you can fetch the data points, sort
> them and calculate percentile. Incidentally, legacy backend has
> efficient mechanisms to calculate percentiles by scanning and reducing
> data using map-reduce.

I'm presuming there are more than one request/event every minute or so?

If that is the case it would mean that you can't make a histogram that
shows what you actually want to know. While in theory you could look at
the 60 samples per hour and plot those on a histogram it would be pretty
meaningless. If we assumed 1 request per second, sampling the latest
latency value every minute would mean that 59/60 events are being
discarded - so you have no idea what is actually happening from looking
at that single sampled latency. Your samples could all be returning
"low" values, which makes you believe that everything is working fine,
but in actual fact the other 59 events per minute are "high" and you
would never know.

This is the reason why histograms exist, and why more generally counters
are more useful than gauges. A gauge can only tell you about "now" which
may or may not be representative of what has actually been happening
since the last scrape. A counter however will tell you the absolute
change since the last scrape (e.g. the total number of requests since
the previous scrape, or the sum of the latencies of all events since the
scrape) meaning you never lose information (a counter that represents
total latency won't let you know if there was one spike or everything
was slow, but it will give you an average since the last scrape instead
of losing data).

It would be worth understanding why you aren't able to produce a
histogram in the application (or externally via processing an event
feed, such as logs)? By design a simple histogram is pretty low impact,
being a set of counters for each bucket.

--
Stuart Clark

Johny

unread,

Aug 7, 2022, 2:32:34 PM8/7/22

to Prometheus Users

Thanks. While I understand the limitations with a gauge, the objective here is to backport existing reports with the new backend, integrate and optimize later. There is a period of time we need to continue backward compatibility due to high barrier to change in clients. The time window used to calculate percentiles is biweekly or months, so taking the last/avg window within 1 minute (or few seconds in some cases) window is not too far fetched, and accepted by users. In light of this, is there a reasonable approach to recreate histograms/summaries from existing metrics within Prometheus?

Ben Kochie

unread,

Aug 7, 2022, 3:47:03 PM8/7/22

to Johny, Prometheus Users

Right, but more basic, how do you get this information from the application right now? Are you reading logs? Does it emit statsd data?

You're saying what, but not how.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c8e3c184-e88d-4217-badc-f5f779b52af3n%40googlegroups.com.

Ben Kochie

unread,

Aug 7, 2022, 3:48:44 PM8/7/22

to Johny, Prometheus Users

To put it another way. If you can read every event raw from a log line, like every request has a "took X milliseconds", there are better ways to reconstruct metrics for your use case.

Johny

unread,

Aug 7, 2022, 4:11:03 PM8/7/22

to Prometheus Users

The application publishes metrics to a remote-write endpoint in a Prometheus shard at the moment; in future we've plans to migrate to pull model as much as possible after building service discovery for native deployments -- but for backward compatibility, we are adopting this approach currently.

jaouad zarrabi

unread,

Sep 25, 2022, 6:26:45 PM9/25/22

to Prometheus Users

BullionStar is Singapore's Premier Bullion Dealer For Sell : GOLD / SILVER / BARS / COINS
- Over 1,000 Different Products
- Cash & Bullion Account
- Attractive Prices
- Quick & Easy
-Tax Free Bullion
- Financial Strength
- Global Reach
- Multi-Jurisdiction
https://www.bullionstar.com/?r=27869

Reply all

Reply to author

Forward