Optimizing query with many duplicate calls

50 views
Skip to first unread message

David Leibovic

unread,
Sep 18, 2023, 1:58:01 AM9/18/23
to Prometheus Users
Hi there, I'm trying to optimize a slow query of this form:

(1 * avg_over_time(foo{instance=~"$i"}[$interval]) <= 10) or 
(2 * avg_over_time(foo{instance=~"$i"}[$interval]) <= 20) or 
(3 * avg_over_time(foo{instance=~"$i"}[$interval]) <= 30) or 
(10 * avg_over_time(foo{instance=~"$i"}[$interval]))

I suspect it's slow because of the many duplicate calls to avg_over_time(foo{instance=~"$i"}[$interval])

Is there some way to only call the avg_over_time function once and re-use the results subsequently? I'm using Prometheus in conjunction with Grafana, in case it's relevant.

The full query I'm trying to optimize is much more complicated, but I figured the above would be enough to understand the problem. But in case it's helpful, here is the full query I am trying to optimize (it's an Air Quality Index computation):

((50 - 0) / (12 - 0) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 12) - 0) + 0) or
((100 - 51) / (35.4 - 12.1) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) > 12 and avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 35.4) - 12.1) + 51) or
((150 - 101) / (55.4 - 35.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) > 35.4 and avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 55.4) - 35.5) + 101) or
((200 - 151) / (150.4 - 55.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) > 55.4 and avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 150.4) - 55.5) + 151) or
((300 - 201) / (250.4 - 150.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) > 150.4 and avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 250.4) - 150.5) + 201) or
((400 - 301) / (350.4 - 250.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) > 250.4 and avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 350.4) - 250.5) + 301) or
((500 - 401) / (500.4 - 350.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) > 350.4 and avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 500.4) - 350.5) + 401) or
clamp_max(avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]), 600)

Thanks for any help you can provide!

Ben Kochie

unread,
Sep 18, 2023, 2:01:50 AM9/18/23
to David Leibovic, Prometheus Users
One thing you can do to speed things up is to eliminate the `=~` in your query. Using regexp matching means it has to do a string search over every instance in your Prometheus for each metric. Using exact matching (`=`) will speed things up a lot. Although you won't be able to do multiple matching if you want that in your dashboard variables.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/664689b3-9f45-4b05-9438-4225e2dce773n%40googlegroups.com.

Brian Candler

unread,
Sep 18, 2023, 3:23:43 AM9/18/23
to Prometheus Users
One possibility is to use a recording rule for the expensive repeated query.

If I rewrite avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) to X, I get:

((50 - 0) / (12 - 0) * ((X <= 12) - 0) + 0) or
((100 - 51) / (35.4 - 12.1) * ((X > 12 and X <= 35.4) - 12.1) + 51) or
((150 - 101) / (55.4 - 35.5) * ((X > 35.4 and X <= 55.4) - 35.5) + 101) or
((200 - 151) / (150.4 - 55.5) * ((X > 55.4 and X <= 150.4) - 55.5) + 151) or
((300 - 201) / (250.4 - 150.5) * ((X > 150.4 and X <= 250.4) - 150.5) + 201) or
((400 - 301) / (350.4 - 250.5) * ((X > 250.4 and X <= 350.4) - 250.5) + 301) or
((500 - 401) / (500.4 - 350.5) * ((X > 350.4 and X <= 500.4) - 350.5) + 401) or
clamp_max(X, 600)

I guess you're trying to apply different scaling for different ranges of X:
- if X is between 0 and 12 (or negative) then rescale to 0 to 50
- if X is between 12 and 35.4 then rescale to 50(?) to 100
- if X is between 35.4 and 55.4 then rescale to 100(?) to 150
etc (except there seem to be some small discontinuities at the boundaries, e.g. 12 versus 12.1, 50 versus 51)

"A or B" will suppress elements in the B vector where the A vector has a value (i.e. with a matching label set). That means it's unnecessary to test the lower bounds, and I think your expression could simplify to something like this:

(X <= 12) * k1 + o1 or
(X <= 35.4) * k2 + o2 or
(X <= 55.4) * k3 + o3 or
(X <= 150.4) * k4  + o4 or
(X <= 250.4) * k5 + o5 or
(X <= 350.4) * k6 + o6 or
(X <= 500.4) * k7 + o7 or
clamp_max(X, 600)

That would roughly halve the number of the subexpressions X.

David Leibovic

unread,
Sep 19, 2023, 11:24:58 PM9/19/23
to Prometheus Users
Thanks very much to you both for the suggestions! Changing the regexp matching to exact string matching didn't help noticeably in my case, perhaps because I have less than 5 instances over which it has to do a regexp match. But removing the unnecessary lower bounds checks from my inequality checks reduced loading time by about 3/5. 

It's too bad that prometheus doesn't support something like variables that could be repeatedly referenced in promql - that would probably speed things up even more.

David Leibovic

unread,
Sep 19, 2023, 11:27:42 PM9/19/23
to Prometheus Users
Btw, here's the new promql I ended up with, in case it's helpful to anyone else:

((50 - 0) / (12 - 0) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 12) - 0) + 0) or
((100 - 51) / (35.4 - 12.1) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 35.4) - 12.1) + 51) or
((150 - 101) / (55.4 - 35.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 55.4) - 35.5) + 101) or
((200 - 151) / (150.4 - 55.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 150.4) - 55.5) + 151) or
((300 - 201) / (250.4 - 150.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 250.4) - 150.5) + 201) or
((400 - 301) / (350.4 - 250.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 350.4) - 250.5) + 301) or
((500 - 401) / (500.4 - 350.5) * ((avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]) <= 500.4) - 350.5) + 401) or
clamp_max(avg_over_time(ambient_pm25_env{instance=~"$room.*"}[$aqi_interval]), 600)

Ben Kochie

unread,
Sep 20, 2023, 12:49:16 AM9/20/23
to David Leibovic, Prometheus Users
On Wed, Sep 20, 2023 at 5:25 AM David Leibovic <david.l...@gmail.com> wrote:
Thanks very much to you both for the suggestions! Changing the regexp matching to exact string matching didn't help noticeably in my case, perhaps because I have less than 5 instances over which it has to do a regexp match. But removing the unnecessary lower bounds checks from my inequality checks reduced loading time by about 3/5. 

It's too bad that prometheus doesn't support something like variables that could be repeatedly referenced in promql - that would probably speed things up even more.

This has been discussed in the past. But it was possibly complicated to implement.

Another way to improve this. I think what we could do is a query optimization that automatically detects identical expressions and only evaluates them once. This would make it transparent to the user without having to change the language.
 

Ben Kochie

unread,
Sep 20, 2023, 12:57:19 AM9/20/23
to David Leibovic, Prometheus Users
One more question:

How wide is `$aqi_interval here? I'm guessing this is one of a few drop-down options, since it's not one of the dynamic values like $__interval.

If this value is hours or days, it's likely one of the main sources of slowness, since it needs to load a lot of samples. If you pre-recorded just that, it would be much faster to query.

- record: instance:ambient_pm25_env:avg1d
  expr: avg_over_time(ambient_pm25_env[1d])

Then your query would be instance:ambient_pm25_env:avg$aqi_interval{instance=~"$room.*"}. This would be an instant vector, and much faster. You could also do a relabel at the same time to extract the room label without the full hostname:port number. Which I'm guessing is why you're doing the regexp.

David Leibovic

unread,
Sep 21, 2023, 6:28:49 AM9/21/23
to Ben Kochie, Prometheus Users
Another way to improve this. I think what we could do is a query optimization that automatically detects identical expressions and only evaluates them once. This would make it transparent to the user without having to change the language.

I like this idea!

How wide is `$aqi_interval here?

It can vary between 10s and 24h. Most of the time I use 24h. The query gets slower the larger the interval is.

I'm guessing this is one of a few drop-down options, since it's not one of the dynamic values like $__interval.

Correct.

If you pre-recorded just that, it would be much faster to query.

Good point - I had considered doing this actually. I had a slight aversion to pre-recording because I would need to pre-record the averages for each $aqi_interval I was interested in looking at. Or else pre-record the longer intervals and fallback to unrecorded calculations for the shorter intervals. But if there is currently no better solution, then I may do this.

Thanks!
Reply all
Reply to author
Forward
0 new messages