delta/increase on a counter return wrong value

111 views
Skip to first unread message

Jérôme Loyet

unread,
Jan 18, 2024, 12:56:25 PM1/18/24
to Prometheus Users
Hello,

I have a counter and I want to counter the number of occurences on a duration (let's say 15m). I'm using delta() or increase but I'm not getting the result I'm expecting.

value @t0: 30242494
value @t0+15m: 30609457
calculated diff: 366963
round(max_over_time(metric[15m])) - round(min_over_time(metric[15m])): 366963
round(delta(metric[15m])): 373183
round(increase(metric[15m])): 373183

increase and delta both return the same value but it appears to be wrong (+6220) while max_over_time - min_over_time return the expected value.

I do not understand this behaviour. I must have miss something.

Any help is appreciated, thx a lot.

++ Jerome

Alexander Wilke

unread,
Jan 18, 2024, 1:14:18 PM1/18/24
to Jérôme Loyet, Prometheus Users
Maybe you are looking for

count_over_time


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e9864120-b1c2-4af9-91ee-9c9cbe0fb24an%40googlegroups.com.

Jérôme Loyet

unread,
Jan 18, 2024, 1:27:07 PM1/18/24
to Alexander Wilke, Prometheus Users
Hello,

my previous was not clear, sorry for that. I don't want count the number of sample (count_over_time) but I want to calculate the difference (delta) or the increase (increase) of the metric value during the range (15 minutes).

As the metric is a counter that only grows (it counts the number of request the service has handled), it should be the last value of the sample range minus the first one.

here is a screenshot of the corresponding metric:
image.png

There must be some black magic around increase/delta that I do not understand that gets me unexpected results.

Alexander Wilke

unread,
Jan 18, 2024, 1:30:54 PM1/18/24
to Jérôme Loyet, Prometheus Users
You May use

rate(metric{}[15m])

Chris Siebenmann

unread,
Jan 18, 2024, 4:26:42 PM1/18/24
to Jérôme Loyet, Prometheus Users, Chris Siebenmann
I suspect that you may be running into delta() and increase() time range
extrapolation. To selectively quote from the delta() documentation
(there's similar wording for increase()):

The delta is extrapolated to cover the full time range as
specified in the range vector selector, so that it is possible
to get a non-integer result even if the sample values are all
integers.

As far as I know, what matters here is the times when the first and last
time series points in the range were recorded by Prometheus. If the
first time series point was actually scraped 35 seconds after the start
of the range and the last time series point was scraped 20 seconds
before its end, Prometheus will extrapolate each end out to cover those
missing 55 seconds. As far as I know there's currently no way of
disabling this extrapolation; you just have to hope that its effects are
small.

Unfortunately these true first and last values and timestamps are very
hard to observe. If you ask for the value at t0, the start of the range,
as a single value (for example issuing an instant query for 'metric
@<time>'), Prometheus will actually look back before the start of the
range for the most recently scraped value. The timestamp of the most
recently observed value is 'timestamp(metric)', and you can make that
'the most recently observed metric at some time' with 'timestamp(metric
@<time>)' (and then use 'date -d @<number>' to convert that to a
human-readable time string; 'date -d "20234-01-18 13:00" +%s' will go
the other way). If you know your scrape interval, it's possible to
deduce the likely timestamp of the first time series point within a
range from getting the timestamp of the most recent point at the start
of the range (it's likely to be that time plus your scrape interval,
more or less).

(The brute force way to find this information is to issue an instant
query for 'metric[15m]', which in the Prometheus web interface will
return a list of measurements and timestamps; you can then look at the
first and last timestamps.)

- cks

Brian Candler

unread,
Jan 18, 2024, 6:06:33 PM1/18/24
to Prometheus Users
If you are not worried too much about what happens if the counter resets during that period, then you can use:

    (metric - metric offset 15m) >= 0

Jérôme Loyet

unread,
Jan 19, 2024, 4:00:54 AM1/19/24
to Prometheus Users
I understand that there can be some interpolation at the boundaries, but the value is not changing around the boundaries, it only changes in the middle of the time range. Scrap is done every 15s and the value of the metric is constant more than 1 minute before and after the boundaries. I deliberately chose a range with constant values around the boundaries to prevent any mis-interpolation.

I'm using promtool to check values, here are my results:
promtool query instant --time "$(date -d'2024-01-18 14:00:00 UTC' +%s)" $url 'metric'
metric 9732212 @[1705586400]

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric'
metric 9848219 @[1705587300]

promtool query range --start "$(date -d'2024-01-18 14:00:00 UTC' +%s)" --end "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric'
metric 9732212 @[1705586400]
...
metric 9848219 @[1705587300]
--> it returns 302 samples

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric[15m]'
9732212 @[1705586407.092]
9732212 @[1705586422.092]
9732212 @[1705586437.092]
9732212 @[1705586452.092]
9732212 @[1705586467.092]
9732212 @[1705586482.092]
9732212 @[1705586497.092]
...
9848219 @[1705587142.092]
9848219 @[1705587157.092]
9848219 @[1705587172.092]
9848219 @[1705587187.092]
9848219 @[1705587202.092]
9848219 @[1705587217.092]
9848219 @[1705587232.092]
9848219 @[1705587247.092]
9848219 @[1705587262.092]
9848219 @[1705587277.092]
9848219 @[1705587292.092]

--> it returns 61 samples
The timestamps are a bit different but values are right. But it returns way less samples than the range query. I would expect the same number of samples returned by the 2 queries.

now let's compute the delta:
promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'metric - metric offset 15m'
{} => 116007 @[1705587300]
--> this matches to 9848219  - 9732212

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'delta(metric [15m])'
{} => 117973.22033898304 @[1705587300]
--> this does not match

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'increase(metric [15m])'
{} => 117973.22033898304 @[1705587300]
--> this does not match but it's exactly the same as the delta() function which is expected



If I'm doing the same exercice with a metric with lower values, the increase/delta results are closer that what's is expected:
promtool query instant --time "$(date -d'2024-01-18 14:45:00 UTC' +%s)" $url 'metric2 - metric offset 15m'
{} =>
1 @[1705589100]

promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'delta(metric [15m])'
{} => 1.0169491525423728 @[1705589100]


promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)" $url 'increase(metric [15m])'
{} => 1.0169491525423728 @[1705589100]

it looks like the higher the value of the counter is the more the increase/delta function differs from true value. I will try other metrics with different kind of values but I have a feeling that there's something not working as expected.

And the advice from brian is working fine (as stated above), but it seems to be hard to implement in grafana.

Chris Siebenmann

unread,
Jan 19, 2024, 11:53:26 AM1/19/24
to Jérôme Loyet, Prometheus Users, Chris Siebenmann
> I understand that there can be some interpolation at the boundaries, but
> the value is not changing around the boundaries, it only changes in the
> middle of the time range. Scrap is done every 15s and the value of the
> metric is constant more than 1 minute before and after the boundaries. I
> deliberately chose a range with constant values around the boundaries to
> prevent any mis-interpolation.
>
> I'm using promtool to check values, here are my results:
[...]
This is a (natural) misunderstanding of what range queries do. A range
query evaluates your query term at every step through the range from
start to end, and returns a list of those results, each one with the
timestamp it was evaluated at. If you don't provide a step, Prometheus
works out a default one based on the time range, and experimentally the
default step for a 15 minute time range is 3 seconds (you can see this
in the Prometheus web interface), which gives about the right number of
answers as you get in your range query. A range query explicitly does
not restrict itself to the number of time series points that are
actually in the time range; it will freely re-use the same points across
multiple instant queries within the range.

(This is commonly visible if you use rate() with a time range larger
than the step size. If you query for 'rate(metric[2m])' for a range
query with a step of 3s, you will get a lot of duplicate results.)

The instant query for a time range gives a range vector as the result,
which contains the true set of time series points with their true
timestamps. Since you're querying a fifteen minute time range for a
metric scraped every 15 seconds, 61 samples is about right for what
you'd expect as the time series points scraped over that amount of time.
We can also see here that the first point was collected at 14:00:07 UTC
and the last one at 14:14:52. This means that a delta() of this same
range will extrapolate out to cover an additional 15 seconds (covering 7
missing seconds from the start and 8 missing seconds from the end).

> now let's compute the delta:
> promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)"
> $url 'metric - metric offset 15m'
> {} => 116007 @[1705587300]
> --> this matches to 9848219 - 9732212
>
> promtool query instant --time "$(date -d'2024-01-18 14:15:00 UTC' +%s)"
> $url 'delta(metric [15m])'
> {} => 117973.22033898304 @[1705587300]
> --> this does not match

This delta() result is actually very close to the result that 'bc -l'
gives me for a manual extrapolation of those extra 15 seconds. The true
difference between the first and last points within the range vector is
116007, the entire range vector covers 15 minutes less fifteen seconds,
or 885 seconds, and we're extrapolating it to 15 minutes:

$ bc -l
(15*60) / 885
1.01694915254237288135

(( 15 * 60 ) / 885 ) * 116007
117973.22033898305084676945

It feels weird that a total of a fifteen second gap at the start and end
of the range can have such a big effect, but as we can see the extra
time is not trivial at the level of this calculation. If the absolute
numbers are smaller the absolute difference between them will also be
smaller, but the relative difference would always be the same.

- cks

Jérôme Loyet

unread,
Jan 21, 2024, 10:41:33 AM1/21/24
to Prometheus Users
Thank you chris for the explanation, this is cristal clear to me now :)

but I feel like there's something missing there. I totally understand why the delta/increase/rate functions needs extrapolation at boundaries but I feel like some use cases would need a version of those functions without extrapolation. That is why I proposed PR #13436 to add the adelta/aincrease/arate functions to promql.

hope this help

++ Jerome
Reply all
Reply to author
Forward
0 new messages