On 27.04.22 17:27, 'James Luck' via Prometheus Users wrote:
>
> # Example for 1 hour lookback
> record: my_rule:rate_1h
> expr: sum(rate(my_large_metric[1h]))
>
> When the underlying metric has many labels and a very high cardinality, the
> cost of re-aggregating the metric becomes significant. I'm trying to offset
> this cost using an approach where I aggregate recording rules of a shorter
> duration over time, e.g:
>
> # Aggregate + rate large metric
> record: my_rule:rate_1h
> expr: sum(rate(my_large_metric[1h]))
> # Combine 1h samples together, avoiding cost of sum()
> record: my_rule:rate_1d
> expr: avg_over_time(my_rule:rate_1h[24h:1h])
>
> Now this leads to inaccuracy between the recording rule values and the
> equivalent rate()-based expression due to the fact that rate will miss
> increases that happen between prior invocations (effectively the problem
> mentioned
> here:
https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule).
I would just do `avg_over_time(my_rule:rate_1h[24h]). It will include
many "overlapping" evaluations into the average: Instead of using 24
data points from your recording rule, it will use as many as there are
(depending on your rule evaluation interval). The performance impact
should be manageable, and it avoids the problem of missing increases
in between perfectly spaced ranges. It introduces another error,
though: At the beginning and end of the 24h range, you get fewer
overlapping evaluations, so the first and last 1h of the total range
(which is actually 25h long, if you look at it precisely), is
gradually less weighted than the rest. You can further reduce this
error by having a large delta between the short and the long
range. For example, if you have a 15s scrape interval and rule
evaluation interval, you could record a `my_rule:rate_1m` without
problem. Then the error in `avg_over_time(my_rule:rate_1m[1d])` will
be very small.
Another approach would be to keep the subquery and shorten the inner
range by one evaluation interval. For example, assuming a 1m rule
evaluation interval, you could do
`avg_over_time(my_rule:rate_1h[24h:59m])`. As long as your rule has
always been evaluated at the right point in time, this should be
mathematically precise. However, data points might be missing, or the
evaluation time might have a jitter, and there might even be weird
things happening in the original time series you have calculated
`my_rule:rate_1h` from. So very generally, I'd go with the first
approach as the more robust one.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email]
bjo...@rabenste.in