How to debug possible false positive alarm?

46 views
Skip to first unread message

kumuthi...@gmail.com

unread,
Oct 2, 2018, 10:14:38 AM10/2/18
to Prometheus Users
Hi,

I have 2 cpu usage alerts set up in prometheus:

alert: Cpu_Usage_Greater_Than_70_Pct
expr: cpu:usage >
  70
labels:
  severity: warning
annotations:
  description: CPU Usage on these nodes is greater than 70 pct (over 5m)
  severity: warning
  summary: 'WARNING: CPU Usage is greater than 70 pct'

alert: Cpu_Usage_Greater_Than_90_Pct
expr: cpu:usage >
  90
labels:
  severity: danger
annotations:
  description: CPU Usage on these nodes is greater than 90 pct (over 5m)
  severity: danger
  summary: 'DANGER: CPU Usage is greater than 90 pct'


Where cpu:usage is defind as:

File: recording_rules.yml; Group name: Cpu Usage Percentage (over 5m)
-------
record: cpu:usage expr: 100 * (1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) BY (instance, job)))



This morning, the "cpu usage greater than 90 pct" alert fired (and was sent to AlertManager that emailed several people), but the 70% one did not fire.  Upon further investigation of Prometheus DB (via /graph GUI), I see that cpu% was never greater than ever 40% on any node for several days.
This seems to be a false positive alarm.

Is there a way for me to debug the root cause? 


Simon Pasquier

unread,
Oct 3, 2018, 4:32:30 AM10/3/18
to kumuthi...@gmail.com, Prometheus Users
To be sure that cpu:usage wasn't ever above 90, you would need the raw data (for instance, "cpu:usage[24h]" to get all samples for the last 24 hours).
You can also check the ALERTS time series and check whether, when and how long your alerts have fired.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/944c00d8-6fb1-4b27-82bc-16da9f957659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James S

unread,
Jul 8, 2021, 1:47:57 PM7/8/21
to Prometheus Users
I have the same issue for one node the CPU usage, Prometheus firing false possitive.
I checked in the GCP monitoring CPU Usage is not over 80%

Julius Volz

unread,
Jul 10, 2021, 5:41:06 PM7/10/21
to James S, Prometheus Users
Hi,

Could it be that when graphing the CPU usage, the graph resolution was just low and thus it might have skipped over a short spike in the rate?

Try:

   max_over_time(cpu:usage[3d])

...or something like that to make sure that you are really looking at all samples within a given time range, not just a subset depending on the graph resolution.

Not sure though why the 70% one wouldn't have fired if the 90% did, assuming the alerts were in the same rule group with same intervals (and thus evaluation timestamps).

Btw., you most likely want to have some "for" duration on that alert to make it less sensitive and/or also use rate() vs. irate() in the underlying recording rule to actually look at 5m worth of CPU usage vs. just at the last two samples of the 5m window.

Regards,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com
Reply all
Reply to author
Forward
0 new messages