Discrepancy in Resolved Alerts.

69 views
Skip to first unread message

Yagyansh S. Kumar

unread,
Apr 14, 2020, 7:04:27 AM4/14/20
to Prometheus Users
Hi. I am using Alertmanager version 0.16.0. The Resolved Alerts that I am receiving are wrong. The Alertmanager fires the resolved alert as soon as the value decreases even slightly i.e it does not wait for the value to get less than the threshold. And this thing is happening for every alert.
Example:
[1] Resolved
Labels
alertname = HighCpuUtilisationCrit
cluster = ANALYTICS
instance = 172.20.8.186:9100
description = Current Value = *95.73*
summary = CPU Utilisation on *172.20.8.186:9100* - *dh4-k2-analytics-ga-ping-n1.dailyhunt.in* is more than 90%.

Here, my threshold is 90% but I am receiving the resolved alert at 95.73%.
Can someone help?
Thanks!

Stuart Clark

unread,
Apr 14, 2020, 7:11:43 AM4/14/20
to Yagyansh S. Kumar, Prometheus Users
That looks like you are putting the current value in a label. As a
result any time it changes a new alert will be created. Try moving that
to an annotation instead.

--
Stuart Clark

Yagyansh S. Kumar

unread,
Apr 14, 2020, 8:08:31 AM4/14/20
to Prometheus Users
Hi Stuart. Thanks for the quick response. But I am using the current value in Annotations only.

Here is an example of my Alert Rule:
  - alert: HighCpuUtilisationCrit
    expr: (sum by (instance,node,cluster) (sum(irate(node_cpu_seconds_total{mode!~"idle"} [5m])) without (cpu) / count(node_cpu_seconds_total{mode!~"idle"}) without (cpu) * 100) > 85) * on (instance) group_left(nodename) node_uname_info
    for: 2m
    labels:
      severity: "CRITICAL"
    annotations:
      summary: "CPU Utilisation on *{{ $labels.instance }}* - *{{ $labels.nodename }}* is more than 90%."
      description: "Current Value = *{{ $value | humanize }}*"
      identifier: "*Cluster:* `{{ $labels.cluster }}`, *node:* `{{ $labels.node }}` "

Brian Candler

unread,
Apr 18, 2020, 5:46:56 AM4/18/20
to Prometheus Users
I can see two possible issues here.

Firstly, the value of the annotation you see in the resolved messsage is always the value at the time *before* the alert resolved, not the value which is now below the threshold.

Let me simplify your expression to:

    foo > 85

This is a PromQL filter.  In general there could be many timeseries for metric "foo".  If you have ten timeseries, and two of them have values over 85, then the result of this expression is those two timeseries, with their labels and those two values above 85.  But if all the timeseries are below 85, then this expression returns no timeseries, and therefore it has no values.

So: suppose one "foo" timeseries goes up to 90 for long enough to trigger the alert (for: 2m).  You will get an alert with annotation:

description: Current value = 90

Maybe then it goes up to 95 for a while.  You don't get a new notification except in certain circumances (group_interval etc).

When the value of foo drops below the threshold, say to 70, then the alert ceases to exist.  Alertmanager sends out a "resolved" message with all the labels and annotations of the alert as it was when it last existed, i.e.

description: Current value = 95

There's nothing else it can do.  The "expr" in the alerting rule returns no timeseries, which means no values and no labels.  You can't create an annotation for an alert that doesn't exist.

It's for this reason that I removed all my alert annotations which had $value in them, since the Resolved messages are confusing.  However you could instead change them to something more verbose, e.g.

description: Most recent triggering value = 95

The second issue is, is it possible the value dipped below the threshold for one rule evaluation interval?

Prometheus does debouncing in one direction (the alert must be constantly active "for: 2m" before it goes from Pending into Firing), but not in the other direction. A single dip below the threshold and it will resolve immediately, and then it could go into Pending then Firing again.  You would see that as a resolved followed by a new alert.

There is a closed issue for alertmanager debouncing / flap detection here:

Personally I think prometheus itself should have a "Resolving" state analogous to "Pending", so a brief trip below the threshold doesn't instantly resolve - but like I say, that issue is closed.

HTH,

Brian.

Yagyansh S. Kumar

unread,
Apr 18, 2020, 10:26:47 AM4/18/20
to Prometheus Users
Thanks a lot for the detailed explantion, Brain.
I guess I need to monitor the resolved alerts a bit more closely and then take a call.

Yagyansh S. Kumar

unread,
Apr 18, 2020, 10:28:02 AM4/18/20
to Prometheus Users
I know this cannot be called as a Bug, but I find it a little odd that you cannot know the value that it dropped to in your alert once it has resolved.

Brian Candler

unread,
Apr 18, 2020, 10:35:03 AM4/18/20
to Prometheus Users
Because it's the presence of a value which triggers an alert, and the absence of a value which means the end of an alert.
Reply all
Reply to author
Forward
0 new messages