Alert Manager Wrong Resolution

39 views
Skip to first unread message

Arnav Bose

unread,
May 26, 2020, 12:21:57 PM5/26/20
to Prometheus Users

I have this scenario with prometheus-alert manager alerting:


alert condition:  K_status != 3


The alert gets triggered fine and when the value changes back to 3, the alert resolves correctly as well.


The problem starts when it has already alerted based on K_status != 3 and then the telemetry goes missing. Now, it triggers a resolution, which is not correct.


if I add a condition to check if the metric exists or not along with the main condition, will it alert and resolve correctly even when telemetry goes missing? Eg:

K_status != 3 and on (port) K_status

So, it triggers an alert when both conditions satisfy. Now if telemetry goes missing, the 2nd condition will not satisfy, correct? So, in this case will the alert resolve?

Murali Krishna Kanagala

unread,
May 26, 2020, 1:39:11 PM5/26/20
to Arnav Bose, Prometheus Users
The alert should resolve if the query does not give any results.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3d859fbf-c188-435e-8bb1-a402410424a3%40googlegroups.com.

Arnav Bose

unread,
May 26, 2020, 2:25:42 PM5/26/20
to Prometheus Users
I understand. But in this if telemetry goes missing, it is not satisfying the actual condition. Hence, resolving the alert is not correct. How to stop that from happening?


On Tuesday, May 26, 2020 at 1:39:11 PM UTC-4, Murali Krishna Kanagala wrote:
The alert should resolve if the query does not give any results.

On Tue, May 26, 2020, 11:22 AM Arnav Bose <arnav...@gmail.com> wrote:

I have this scenario with prometheus-alert manager alerting:


alert condition:  K_status != 3


The alert gets triggered fine and when the value changes back to 3, the alert resolves correctly as well.


The problem starts when it has already alerted based on K_status != 3 and then the telemetry goes missing. Now, it triggers a resolution, which is not correct.


if I add a condition to check if the metric exists or not along with the main condition, will it alert and resolve correctly even when telemetry goes missing? Eg:

K_status != 3 and on (port) K_status

So, it triggers an alert when both conditions satisfy. Now if telemetry goes missing, the 2nd condition will not satisfy, correct? So, in this case will the alert resolve?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Brian Candler

unread,
May 26, 2020, 4:58:43 PM5/26/20
to Prometheus Users
Alerts "resolve" when they no longer exist.  Consider for example:

expr: foo < 3

The result of this promQL expression is all metrics with __name__="foo" and value < 3.   So given as input:

foo{instance="bar"} 1
foo{instance="baz"} 2
foo{instance="qux"} 4

then expr: foo < 3 gives

foo{instance="bar"} 1
foo{instance="baz"} 2

Hence the alert triggers with two values.  If any of them go back to 3 or higher, then they vanish from the expression results.

An alert "resolving" is simply when it ceases to exist.

Arnav Bose

unread,
May 26, 2020, 5:17:03 PM5/26/20
to Brian Candler, Prometheus Users
I understood this part.

I am facing a unique situation here. The alert based on the condition triggers fine . Now, after some time, the telemetry, based on which the alert was created, went missing. This does not happen often. At this point it should not resolve the alert. 

Because if the alert has triggered at foo < 3, then it should resolve only when foo >= 4. Not at the time when the telemetry foo is completely missing. 

If this is an expected behavior, can you suggest how I can stop resolution of an alert when telemetry goes missing for an active alert?


--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oUVvsp3IgTE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fbbdab5f-7a2c-4dbd-9a9f-db31061b2c96%40googlegroups.com.

Brian Candler

unread,
May 26, 2020, 5:25:53 PM5/26/20
to Prometheus Users
You have to alert on the condition "foo < 3, or foo does not exist".

To make the "does not exist" condition, you must have some other metric which *does* exist, for the relevant set of labels.

Then you can alert on something like "(foo < 3) or (bar unless foo)"

- with appropriate on(..) or ignoring(..) between bar and foo, if required.

Arnav Bose

unread,
May 26, 2020, 5:37:49 PM5/26/20
to Brian Candler, Prometheus Users
Will it work if I alert on foo < 3 and foo exists? 

Instead of 'or', if I use and ? So that when telemetry goes missing after alert, the second condition which checks if the metric is available will not send the resolution. . 


--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oUVvsp3IgTE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
May 27, 2020, 3:06:33 AM5/27/20
to Prometheus Users
No.  Perhaps I didn't explain it very clearly:

- alerts are PromQL expressions, just the same as those you draw graphs of
- when the PromQL expression returns a value (any value) it sends an alert for that metric + set of labels
- when the PromQL expression no longer returns a value for that metric + set of labels, a resolution is sent.

That's it.  There is no separate "resolution" condition.  Resolution = alert no longer exists.

There is no way to distinguish "the metric which generated the alert has vanished" from "the metric which generated the alert still exists but its value no longer matches the expression", because the expression is a filter: if the value no longer matches the filter, the value vanishes.

In prometheus, operators like "<" do not work like normal programming languages.  They are *not* boolean operators.  Rather, they apply a filter to the set of values in an instance vector..

Try going into PromQL and drawing a graph like "node_load1 > 1".  You'll see values where the condition is true, and gaps where the condition is false:

img1.png



This is exactly how alerting works.  A PromQL expression with any value generates an alert.  If no value is present, that's the resolution of the alert.  Alertmanager has no way of knowing the *reason* why the value no longer appears in the result set of the alerting expression.
Reply all
Reply to author
Forward
0 new messages