How to prevent increasing prometheus_rule_group_iterations_missed

John Bryan Sazon

unread,

Feb 15, 2020, 9:52:10 AM2/15/20

to Prometheus Users

I've been seeing this metric (prometheus_rule_group_iterations_missed_total) increment all the time. In the rules UI, I don't see any errors. Every rule is OK. I also don't see anything unusual from the logs.

Julien Pivotto

unread,

Feb 15, 2020, 10:02:49 AM2/15/20

to John Bryan Sazon, Prometheus Users

On 15 Feb 06:52, John Bryan Sazon wrote:
> I've been seeing this metric
> (prometheus_rule_group_iterations_missed_total) increment all the time. In

> the rules UI, I don't see any errors. Every rule is *OK*. I also don't see

> anything unusual from the logs.

I means that the rules take too long to be evaluated or that your
prometheus server is overloaded.

You can use prometheus_rule_group_last_duration_seconds >
prometheus_rule_group_interval_seconds to see if you have an issue with
the duration of the group evaluation.

>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d7181460-0513-438c-80ba-e3245820c71f%40googlegroups.com.

--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu

signature.asc

John Bryan Sazon

unread,

Feb 15, 2020, 10:11:29 AM2/15/20

to Prometheus Users

Thanks! I saw something with this query:

prometheus_rule_group_last_duration_seconds >
prometheus_rule_group_interval_seconds

I had issues with recording rules error previously and I had to increase the following default values to solve those issues.

- --query.lookback-delta=6m # default is 5m

- --query.max-samples=100000000 # default is 50000000

- --query.timeout=4m # default is 2m

If I move the long duration recording rule to a specific new group, will that help?

On Saturday, 15 February 2020 16:02:49 UTC+1, Julien Pivotto wrote:

On 15 Feb 06:52, John Bryan Sazon wrote:
> I've been seeing this metric
> (prometheus_rule_group_iterations_missed_total) increment all the time. In
> the rules UI, I don't see any errors. Every rule is *OK*. I also don't see
> anything unusual from the logs.

I means that the rules take too long to be evaluated or that your
prometheus server is overloaded.

You can use prometheus_rule_group_last_duration_seconds >
prometheus_rule_group_interval_seconds to see if you have an issue with
the duration of the group evaluation.

>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Julien Pivotto

unread,

Feb 15, 2020, 10:25:27 AM2/15/20

to John Bryan Sazon, Prometheus Users

On 15 Feb 07:11, John Bryan Sazon wrote:
> Thanks! I saw something with this query:
>
> prometheus_rule_group_last_duration_seconds >
> prometheus_rule_group_interval_seconds
>
> I had issues with recording rules error previously and I had to increase
> the following default values to solve those issues.
>
> - --query.lookback-delta=6m # default is 5m
> - --query.max-samples=100000000 # default is 50000000
> - --query.timeout=4m # default is 2m
>
> If I move the long duration recording rule to a specific new group, will
> that help?

At this point it is difficult to help you more without getting more info
about the kind of queries you run. The options you change are merely
internal options that should not be changed.

My experience is that you probably have promql queries that could be
split and/or better rewritten.

signature.asc

Ben Kochie

unread,

Feb 15, 2020, 10:26:32 AM2/15/20

to John Bryan Sazon, Prometheus Users

Yes, moving an expensive rule to a separate group can help. Each rule group is a separate goroutine, so you can distribute the work among many cores.

We have a couple of expensive rules that I had to hand-hack some sharding to reduce the eval time. Not a great solution, but it works.

https://gitlab.com/gitlab-com/runbooks/-/blob/080c7bffab1051311f1c280f3deca3f6a7d5b934/rules/unicorn.yml#L104-111

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0993ee96-f910-43f4-8489-9731da202767%40googlegroups.com.

John Bryan Sazon

unread,

Feb 15, 2020, 10:29:48 AM2/15/20

to Prometheus Users

> My experience is that you probably have promql queries that could be
split and/or better rewritten.

I also think that I may have to better rewrite and split the long queries I have.

John Bryan Sazon

unread,

Feb 15, 2020, 10:30:42 AM2/15/20

to Prometheus Users

> Each rule group is a separate goroutine, so you can distribute the work among many cores.

Thanks for that useful information!

On Saturday, 15 February 2020 16:26:32 UTC+1, Ben Kochie wrote:

Yes, moving an expensive rule to a separate group can help. Each rule group is a separate goroutine, so you can distribute the work among many cores.

We have a couple of expensive rules that I had to hand-hack some sharding to reduce the eval time. Not a great solution, but it works.

https://gitlab.com/gitlab-com/runbooks/-/blob/080c7bffab1051311f1c280f3deca3f6a7d5b934/rules/unicorn.yml#L104-111

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0993ee96-f910-43f4-8489-9731da202767%40googlegroups.com.

Reply all

Reply to author

Forward

How to prevent increasing prometheus_rule_group_iterations_missed_total

John Bryan Sazon

Julien Pivotto

John Bryan Sazon

Julien Pivotto

Ben Kochie

John Bryan Sazon

John Bryan Sazon