Increase without extrapolation

emai...@gmail.com

unread,

Mar 5, 2018, 12:14:37 PM3/5/18

to Prometheus Users

We have a requirement to calculate accurate availability figures for our applications. We have found that the metrics we need to make the calculations are already contained in the Prometheus databases that our components use. However, we are only able to get the results we need if we use the 'increase' function without the extrapolation. We were able to prove this by manipulating the data to make sure the time range boundary was far enough away from the first and last sample to prevent the extrapolation code from running.

So we are considering options to export the data from Prometheus and replicate the increase function but without the extrapolation.

This begs the question, would you accept a PR to add a new increase function that does 'rate' instead of 'extrapolatedRate'? The user would be able to decide which one to use for their needs.

Brian Brazil

unread,

Mar 5, 2018, 12:55:35 PM3/5/18

to Dominic Le Bredonchel, Prometheus Users

If you have this sort of very accurate reporting need you are better working from the raw data directly (or logs) than via PromQL.

--

Brian Brazil

www.robustperception.io

Message has been deleted

Brian Brazil

unread,

Mar 31, 2018, 2:27:00 AM3/31/18

to Alin Sînpălean, Prometheus Users

On 26 March 2018 at 14:38, Alin Sînpălean <alin.si...@gmail.com> wrote:

You can do one of two things. I am also of the opinion that rate()/increase() should not extrapolate, but it doesn't look like that will change anytime soon, so both of these are workarounds to current Prometheus limitations.

Use foo - foo offset 1m instead of increase(foo[1m]). It will not take into account counter resets (you could do that by doing evaluation every collection interval and accounting for it, if you actually care about that) and will take twice as much CPU (2 series lookups instead of one), but it will give you an accurate increase, no extrapolation.

This is incorrect, it'll not be accurate as metrics can't be accurate. It's just a different, not 100% accurate, approximation.

If you want to take advantage of Prometheus' counter reset handling, use increase(foo[70s]) * 60 / 70 wherever you would normally use increase(foo[60s]) (assuming a collection interval of 10s). It basically computes the increase over 6 successive collections (7 successive points), then undoes the extrapolation. Ugly and requires you to take into account both collection and evaluation intervals (and hope they never change), but it works.

This is not resilient to jitter, and is not a good approach. Generally this will overestimate by 16% as you're multiplying by 1.16.

As I said, if the OP wants an accurate result they need logs. If someone thinks that extrapolation is a problem then metrics cannot meet their use case, as scrapes won't be perfectly aligned with the data window of interest.

Brian

Cheers,
Alin.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/de54fe92-dde0-4253-ae86-92d0cfdcb6e3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Alin Sînpălean

unread,

Apr 2, 2018, 4:51:54 PM4/2/18

to Prometheus Users

[I'll give this a try, even though it is likely going to be marked as spam and left as such.]

On Saturday, March 31, 2018 at 8:27:00 AM UTC+2, Brian Brazil wrote:

On 26 March 2018 at 14:38, Alin Sînpălean <alin.si...@gmail.com> wrote:
You can do one of two things. I am also of the opinion that rate()/increase() should not extrapolate, but it doesn't look like that will change anytime soon, so both of these are workarounds to current Prometheus limitations.

Use foo - foo offset 1m instead of increase(foo[1m]). It will not take into account counter resets (you could do that by doing evaluation every collection interval and accounting for it, if you actually care about that) and will take twice as much CPU (2 series lookups instead of one), but it will give you an accurate increase, no extrapolation.
This is incorrect, it'll not be accurate as metrics can't be accurate. It's just a different, not 100% accurate, approximation.

Yes, it is not going to be perfectly accurate but, as the OP states -- "able to prove this by [preventing] the extrapolation code from running" -- it will do the job for them. "Accurate" was the term used by the OP, BTW, to describe the results they got without increase() extrapolation.

If you want to take advantage of Prometheus' counter reset handling, use increase(foo[70s]) * 60 / 70 wherever you would normally use increase(foo[60s]) (assuming a collection interval of 10s). It basically computes the increase over 6 successive collections (7 successive points), then undoes the extrapolation. Ugly and requires you to take into account both collection and evaluation intervals (and hope they never change), but it works.
This is not resilient to jitter, and is not a good approach. Generally this will overestimate by 16% as you're multiplying by 1.16.

No, Prometheus in general is not resilient to jitter. Outside of /range_query, which actually is, under the right conditions, i.e. no rate()/increase() extrapolation.

Prometheus could be resilient to (eval) jitter if it wanted to, e.g. by delaying evaluation until all scrapes in progress were complete and then running the evaluation similar to the way /range_query does it, at exactly spaced intervals. But no one is asking for that here, AFAICT.

As I said, if the OP wants an accurate result they need logs.

Umm, no. As the OP said, they want to prevent extrapolation to get "accurate enough" results for their needs. They never said they need perfect results.

The only material difference between logs and metrics is that logs have (in theory) infinite resolution, whereas metrics (in the Prometheus world) have some fixed time resolution, decided ahead of time, plus scrape jitter. But as long as you don't fail a large number of successive scrapes (which is in many respects similar to a logs collector losing lots of log records on the way) you are still able to compute an increase over some interval. It may not be the exact interval you want (either because of scrape resolution or because of missed scrapes) but an exact increase over some interval it is. (In the logs case, if some log records go missing you can't even get that.)

In particular, if you do foo - foo offset 5m exactly every 5 minutes (the way /range_eval does) and you have at least one successful scrape every 5 minutes, you will get a perfectly accurate increase, which you can then aggregate over time and get an accurate increase over e.g. 24 hours. It won't handle task restarts perfectly, but neither will logs.

If someone thinks that extrapolation is a problem then metrics cannot meet their use case, as scrapes won't be perfectly aligned with the data window of interest.

I am someone who thinks extrapolation is a problem while being sure metrics can meet my use case, because it has nothing to do with perfectly aligned scrapes. I wouldn't mind if (because of scrape interval jitter) I ended up with a timeseries [0, 1, 2, 2, 4, 5] (instead of an ideal [0, 1, 2, 3, 4, 5]) and a total increase of 5. I do mind that from this imperfect timeseries Prometheus guesstimates an increase of 6, though. Or, to be more precise, some random fraction between 5.0 and 6.0 (extrapolation to the right but not to the left, due to the 0), depending solely on when the kernel scheduler decides to schedule the evaluation.

Cheers,

Alin.

Brian

Cheers,
Alin.

On Monday, March 5, 2018 at 6:14:37 PM UTC+1, emai...@gmail.com wrote:
We have a requirement to calculate accurate availability figures for our applications. We have found that the metrics we need to make the calculations are already contained in the Prometheus databases that our components use. However, we are only able to get the results we need if we use the 'increase' function without the extrapolation. We were able to prove this by manipulating the data to make sure the time range boundary was far enough away from the first and last sample to prevent the extrapolation code from running.

So we are considering options to export the data from Prometheus and replicate the increase function but without the extrapolation.

This begs the question, would you accept a PR to add a new increase function that does 'rate' instead of 'extrapolatedRate'? The user would be able to decide which one to use for their needs.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/de54fe92-dde0-4253-ae86-92d0cfdcb6e3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Brian Brazil
www.robustperception.io

Colin Kelley

unread,

Oct 12, 2023, 5:00:48 AM10/12/23

to Prometheus Users

Hello,

I have just submitted a proposal for a design (built on Alin's excellent work) that addresses the concerns discussed here. We have been running a fork that implements that design for about 6 months now with excellent results.

Issue: https://github.com/prometheus/prometheus/issues/12967

Design Proposal: https://docs.google.com/document/d/1CF5jhyxSD437c2aU2wHcvg88i8CjSPO3kMHsEaDRe2w/edit#heading=h.hzsa87ps5uhr

-Colin

Reply all

Reply to author

Forward