Strange behaviour of rate() with newly-started timeseries

Brian Candler

unread,

Oct 13, 2023, 7:30:16 AM10/13/23

to Prometheus Users

I've been using "promtool test rules" to see how rate() behaves with a timeseries that has recently started, and I am struggling to understand it.

What would you think the following test should return?

```
evaluation_interval: 1m

tests:
- input_series:
- series: foo
values: '_ _ _ 17 137'
interval: 1m

promql_expr_test:
- expr: rate(foo[5m])
eval_time: 4m30s
exp_samples:
- value: 2 # I expected: increase of 120 in 60 seconds
labels: ""
```

Result (with prometheus 2.45.0):

```
FAILED:
expr: "rate(foo[5m])", time: 5m30s,
exp: {} 2E+00
got: {} 6.566666666666666E-01
```

That result is 197/300, and I have no idea how it derives this value!

Now change the input data to:

```
values: '_ _ _ 1017 1137'
```

and the result is 0.8 (=240/300, 192/240 or 48/60) - even though the input values are still 120 apart.

Any clues as to how it gets these results?

Brian Candler

unread,

Oct 13, 2023, 7:31:38 AM10/13/23

to Prometheus Users

(I copy pasted the wrong bit of screen from when I had eval_time: 5m30s, but it doesn't affect the result)

Brian Candler

unread,

Oct 13, 2023, 12:51:37 PM10/13/23

to Prometheus Users

I can reproduce the results by extracting parts of the extrapolatedRate() function, but I still don't understand what it's doing.

https://go.dev/play/p/ua3XGLEA0cI # _ _ _ 17 137

https://go.dev/play/p/NoJ3FoKLeLG # _ _ _ 1017 1137

Taking the first one:

- It extrapolates backwards to find where the zero crossing ought to be (if it started linearly from zero). In the case of _ _ _ 17 137, this gives a time 8.5 seconds earlier, so the time between first and end points grows from 60 to 68.5 seconds.

- It then adds half the average interval, making 98.5 seconds.

- It then takes the increase and scales it by 98.5 / 60, where 60 is the true gap between the samples

- It multiplies the true increase in the value by 98.5/60, then divides by 300 which is the full width of the range

120 * (98.5/60) / 300 = .6566666666

This calculates the extrapolated increase from time (t1 - 8.5) to (t2 + 30), but then it seems to be treating this increase as if it had been spread over the whole 5 minutes, rather than over 98.5 seconds?

Julius Volz

unread,

Oct 14, 2023, 10:35:12 AM10/14/23

to Brian Candler, Prometheus Users

Hi Brian,

On Fri, Oct 13, 2023 at 6:51 PM 'Brian Candler' via Prometheus Users <promethe...@googlegroups.com> wrote:

I can reproduce the results by extracting parts of the extrapolatedRate() function, but I still don't understand what it's doing.

https://go.dev/play/p/ua3XGLEA0cI # _ _ _ 17 137
https://go.dev/play/p/NoJ3FoKLeLG # _ _ _ 1017 1137

Taking the first one:
- It extrapolates backwards to find where the zero crossing ought to be (if it started linearly from zero). In the case of _ _ _ 17 137, this gives a time 8.5 seconds earlier, so the time between first and end points grows from 60 to 68.5 seconds.
- It then adds half the average interval, making 98.5 seconds.
- It then takes the increase and scales it by 98.5 / 60, where 60 is the true gap between the samples
- It multiplies the true increase in the value by 98.5/60, then divides by 300 which is the full width of the range

120 * (98.5/60) / 300 = .6566666666

This calculates the extrapolated increase from time (t1 - 8.5) to (t2 + 30), but then it seems to be treating this increase as if it had been spread over the whole 5 minutes, rather than over 98.5 seconds?

Yes, I think this is intended even for rate() and embarrassingly a detail I missed in my video about rates as well (https://www.youtube.com/watch?v=7uy_yovtyqw). That is, even for rate() it is not always enough to take the first and last (extrapolated) samples and look at the slope between them to get to the output per-second rate. It only works when rate() extrapolates all the way to the window boundaries, which it does not do in your example.

The idea behind the rate() implementation is to not give the impression that the counter has actually been consistently increasing by 2 per second over the entire 5 minute input window if the series likely just started somewhere under the window (meaning that the rate was 0 or non-existent before that). So in that case we do want to smear out the actual increase over the requested window range.

Btw. rate() hasn't always behaved like this. Here's a super old issue (that I actually made a lengthy comment on) and a PR by Björn to address it:

- Issue: https://github.com/prometheus/prometheus/issues/581

- https://github.com/prometheus/prometheus/pull/1295

I hope I didn't miss anything else in your points, since I didn't math myself through everything in detail.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/44e16f4a-9fcb-44fe-ad73-985cec076502n%40googlegroups.com.

--

Julius Volz

PromLabs - promlabs.com

Brian Candler

unread,

Oct 14, 2023, 11:17:01 AM10/14/23

to Prometheus Users

> The idea behind the rate() implementation is to not give the impression that the counter has actually been consistently increasing by 2 per second over the entire 5 minute input window if the series likely just started somewhere under the window (meaning that the rate was 0 or non-existent before that).

Hmm... so I guess it's a kind of histogram thing, where the area of a bar (= width x height) implies the total quantity - in this case meaning increase(), which is rate() times period.. It's estimating "how much did X increase" and taking into account that the counter cannot have been negative.

However, looking at the other example where the counter goes from 1017 to 1137 in 60 seconds (an increase of 120, with no zero crossing in the window): it extends by half a sample interval on each side, giving a range of 120 seconds, and proportionately scales up the increase from 120 to 240. It then assigns that entire increase to the window period of 300 seconds. Using rate = increase / window size, calculating the rate gives 240 / 300 = 0.8, rather than what I was expecting (2, which is the slope).

That logic is, well, surprising to me. I guess the question is, would I be more surprised to see increase(foo[5m]) equal to 600, given only those two data points?

In the past I have noticed rate graphs in Grafana behaving strangely for the first few samples of a new timeseries (being scraped from an SNMP device), and now I kind of understand it.

> Btw. rate() hasn't always behaved like this. Here's a super old issue (that I actually made a lengthy comment on) and a PR by Björn to address it:

Thanks for the links. I can understand the issue there: if a counter only increments occasionally, e.g.

(0 0 0 0) 0 1 1 1 1 1 2 2 2 2 2

and you are unlucky enough to pick up only the "0 1" at the start of the timeseries, you incorrectly extrapolate the rate to 1 / (sample interval). I don't think you'd need worry about this if the difference between the values is 2 or more: if the counter has incremented by N, then the average interval between those events must be somewhere between (sample interval / (N-1)) and (sample interval / N)

Reply all

Reply to author

Forward