Efficiently calculating rate of change over different durations in recording rules

433 views
Skip to first unread message

James Luck

unread,
Apr 28, 2022, 9:38:39 AM4/28/22
to Prometheus Users
Hey all, I've got a problem I'm trying to tackle and I would appreciate any ideas or feedback.

I'm developing a set of recording rules that do different calculations over various durations =(e.g. rate_2m, rate_30m, rate_1h, rate_6h, rate_12h, rate_1d, rate_3d):

# Example for 1 hour lookback
record: my_rule:rate_1h
expr: sum(rate(my_large_metric[1h]))

When the underlying metric has many labels and a very high cardinality, the cost of re-aggregating the metric becomes significant. I'm trying to offset this cost using an approach where I aggregate recording rules of a shorter duration over time, e.g:

# Aggregate + rate large metric 
record: my_rule:rate_1h
expr: sum(rate(my_large_metric[1h])) 
# Combine 1h samples together, avoiding cost of sum()
record: my_rule:rate_1d
expr: avg_over_time(my_rule:rate_1h[24h:1h])

Now this leads to inaccuracy between the recording rule values and the equivalent rate()-based expression due to the fact that rate will miss increases that happen between prior invocations (effectively the problem mentioned here: https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule).

Is there a way to avoid the performance hit and maintain accuracy? I'm hoping by pre-aggregating the counter values by instance might do it:

# Tracks the total count of events per scrape target 
record: instance:my_rule:sum 
expr: sum by (instance)(my_large_metric) 
# Use this count of events to calculate rate of change over any durations with greatly reduced aggregation costs 
record: my_rule:rate_1h 
expr: sum(rate(instance:my_rule:sum[1h])) 
record: my_rule:rate_1d 
expr: sum(rate(instance:my_rule:sum[1d]))

Now this does violate the principles outlined in https://www.robustperception.io/rate-then-sum-never-sum-then-rate but it does (I believe) avoid counter resets causing issues by aggregating per-instance. Other potential issues I can see with this:
- Removing labeled series might trigger a counter reset
- Slightly increased risk of counter underflow if summed values exceed 2^53

Curious to know what people's thoughts on this are.

Bjoern Rabenstein

unread,
May 5, 2022, 12:49:14 PM5/5/22
to James Luck, Prometheus Users
On 27.04.22 17:27, 'James Luck' via Prometheus Users wrote:
>
> # Example for 1 hour lookback
> record: my_rule:rate_1h
> expr: sum(rate(my_large_metric[1h]))
>
> When the underlying metric has many labels and a very high cardinality, the
> cost of re-aggregating the metric becomes significant. I'm trying to offset
> this cost using an approach where I aggregate recording rules of a shorter
> duration over time, e.g:
>
> # Aggregate + rate large metric
> record: my_rule:rate_1h
> expr: sum(rate(my_large_metric[1h]))
> # Combine 1h samples together, avoiding cost of sum()
> record: my_rule:rate_1d
> expr: avg_over_time(my_rule:rate_1h[24h:1h])
>
> Now this leads to inaccuracy between the recording rule values and the
> equivalent rate()-based expression due to the fact that rate will miss
> increases that happen between prior invocations (effectively the problem
> mentioned
> here: https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule).

I would just do `avg_over_time(my_rule:rate_1h[24h]). It will include
many "overlapping" evaluations into the average: Instead of using 24
data points from your recording rule, it will use as many as there are
(depending on your rule evaluation interval). The performance impact
should be manageable, and it avoids the problem of missing increases
in between perfectly spaced ranges. It introduces another error,
though: At the beginning and end of the 24h range, you get fewer
overlapping evaluations, so the first and last 1h of the total range
(which is actually 25h long, if you look at it precisely), is
gradually less weighted than the rest. You can further reduce this
error by having a large delta between the short and the long
range. For example, if you have a 15s scrape interval and rule
evaluation interval, you could record a `my_rule:rate_1m` without
problem. Then the error in `avg_over_time(my_rule:rate_1m[1d])` will
be very small.

Another approach would be to keep the subquery and shorten the inner
range by one evaluation interval. For example, assuming a 1m rule
evaluation interval, you could do
`avg_over_time(my_rule:rate_1h[24h:59m])`. As long as your rule has
always been evaluated at the right point in time, this should be
mathematically precise. However, data points might be missing, or the
evaluation time might have a jitter, and there might even be weird
things happening in the original time series you have calculated
`my_rule:rate_1h` from. So very generally, I'd go with the first
approach as the more robust one.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

James Luck

unread,
May 30, 2022, 7:56:23 PM5/30/22
to Prometheus Users
Hey Bjorn,

Firstly apologies for the slow response- I very much appreciate you taking the time to respond.

I like the overlapping evaluations idea- I will increase the evaluation interval and give this a try. I think the reduced precision at the start/ end of an evaluation period would be acceptable in my case and I like the idea of modifying the rule duration to account for this.

Cheers!


Reply all
Reply to author
Forward
0 new messages