[Feature/Proposal] Concurrent evaluation of independent rules

Danny Kopping

unread,

Nov 2, 2023, 4:15:22 AM11/2/23

to Prometheus Developers

Team,

I discussed this idea briefly at the last dev summit, and we agreed I'd raise a thread for this topic here. I'm proposing a mechanism to evaluate rules concurrently, under certain conditions, to improve rule reliability.

Rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the output of a preceding rule as their input. However, if there is no detectable relationship between rules then there is no reason to run them sequentially.

A missed group iteration occurs when the cumulative time to evaluate all rules exceeds the interval defined for that group. When this occurs, alert expressions are not evaluated for the next iteration and likewise recording rules produce no series; this can be a large reliability problem.

By evaluating rules concurrently, the likelihood of missed group iterations is reduced.

Of course, the trade-off here is more concurrent query load on the query engine. This can be ameliorated by bounding the concurrency using a global weighted semaphore. This feature would be opt-in, and the concurrency configurable.

Here is my implementation:

https://github.com/prometheus/prometheus/pull/12946

The feature is hidden behind a feature-flag, but I would argue that we can drop the flag and simply set --rules.max-concurrent-evals=0 as default which is functionally equivalent to not having any concurrency at all (the current behaviour); double opt-in feels unnecessary.

As an aside, this feature will be quite useful for Grafana Loki (for which I'm a maintainer). We vendor in the Prometheus rule engine for our rule evaluation, and we have a mode now where rules can be evaluated in a distributed fashion. Our rules run sequentially, but they don't need to (since our rules cannot have interdependencies), and being able to run a certain number of rules concurrently would massively improve our rule evaluation reliability.

Thanks!

Bjoern Rabenstein

unread,

Nov 8, 2023, 11:34:50 AM11/8/23

to Danny Kopping, Prometheus Developers

On 28.10.23 04:32, Danny Kopping wrote:
>
> The feature is hidden behind a feature-flag, but I would argue that we can
> drop the flag and simply set --rules.max-concurrent-evals=0 as default which
> is functionally equivalent to not having any concurrency at all (the
> current behaviour); double opt-in feels unnecessary.

Just a high level note about feature flags: The opt-in part is only
one reason to use a feature flag. The other is that it clearly marks a
feature as experimental. If we just introduced
`--rules.max-concurrent-evals`, people would inevitably use it
assuming it's a stable feature. Now imagine that it turns out that the
whole thing was a bad idea and we remove the feature again, those
users would see an unexpected breaking change.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Danny Kopping

unread,

Nov 8, 2023, 12:03:37 PM11/8/23

to Bjoern Rabenstein, Prometheus Developers

Thanks for the context Björn, makes sense

Danny Kopping
(+27) 84 941 4422

Danny Kopping

unread,

Jan 3, 2024, 11:23:19 AMJan 3

to Prometheus Developers

Hey folks

I'm requesting again for a review on this, please.

This feature will be very useful to both Loki & Mimir, and it's a relatively simple change with extensive tests.

Thanks

Danny Kopping
(+27) 84 941 4422

Reply all

Reply to author

Forward