False Positive Alerts with wmi_cpu_time_total

46 views
Skip to first unread message

Vitalii Ludanenkov

unread,
Feb 20, 2022, 11:38:10 AM2/20/22
to Prometheus Users
Hello everybody. 
We are facing some issues with CPU monitoring.
Our graphs don't show reaching the thresholds even one time, not for 3m.
All info and screenshots will be below.
Alert is configured to alert at 20%. Related only to the blue graph.

Screenshot 2022-02-18 133538.png

Screenshot 2022-02-18 133642.png

Prometheus creates a massive amount of alerts in our Opsgenie, there are no issues with other alerts or even with a threshold of 60%.
Screenshot 2022-02-18 133820.png

Alert query:

Screenshot 2022-02-18 134142.png

Maybe you have some suggestions on what can cause that flapping and triggering the alert? 
Already tried to check graphs by 1,2,5,10 minute, by the hour and etc, there is nothing that should result in an alert.
Also, there are no such alerts from Cloudwatch monitoring.


Brian Candler

unread,
Feb 20, 2022, 11:56:44 AM2/20/22
to Prometheus Users
As far as I can see, you haven't shown your actual alerting rule.

However, it's straightforward to debug this: paste your entire alerting "expr" into the PromQL query interface.  Wherever the line is present, it means an alert will fire.  You can then work backwards from that to find the problem with your expr.

For example, say you have this rule:
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) < 0.8

Paste exactly "avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) < 0.8" into the PromQL browser to see if and when it fires.

In PromQL, the expression "foo" generates a vector: the set of all timeseries whose metric name is "foo".  Then "foo < 0.8" is a filter, not a boolean.  It filters the vector to only those whose value is less than 0.8.  When used as an alerting expression, you get an alert if the vector is not empty.

Vitalii Ludanenkov

unread,
Feb 20, 2022, 2:42:09 PM2/20/22
to Prometheus Users

I already attached screenshots with the rule, the actual query results(screenshot without > 20, because it doesn't show anything). The threshold is 20%. But the graph doesn't reach it, nonetheless, it causes an alert. 

Brian Candler

unread,
Feb 21, 2022, 5:51:24 AM2/21/22
to Prometheus Users
To summarize:

1. You're 100% positive that the alerting rule has
        expr: (blah) > 20

2. If you put "(blah) > 20" in the PromQL browser and and switch to graph mode, then it's blank

3. But alerts are still firing

In that case, you need to go into the PromQL web interface and click on "Alerts" at the top.  It will show you which alerts are currently firing, and the triggering label sets and values.

In short, it's impossible for expression "(blah) > 20" to fire if this expression returns an empty instant vector.  So either it's *not* an empty instant vector; or else some other alert expression is firing.  You didn't show any details from your OpsGenie messages, so it is at least possible that it's some other alerting rule that is causing the alerts.

You showed a graph of ALERTS{alertname="CPUSQLUtilizationWarning"} but no binding between that alert name and your alerting ruleset, since you didn't show the alert rule.

I believe you can have multiple alert rules with the same name.  Maybe there's a copy-paste issue when you were duplicating an existing rule?  So actually it's a different alert which is triggering with this name?

Finally: use promtool to check your config:

promtool check config /etc/prometheus/prometheus.yml
Reply all
Reply to author
Forward
0 new messages