node_exporter CPU underutilized alert

36 views
Skip to first unread message

mel

unread,
Jun 22, 2024, 12:01:58 PMJun 22
to Prometheus Users
I have this CPU underutilized alert for virtual machines.

expr: '(100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'

for: 1w

The problem is that I get alerted even if the CPU is 1 so I cannot reduce it further. I want the alert to fire only number of CPUs > 1.

Brian Candler

unread,
Jun 23, 2024, 5:57:40 AMJun 23
to Prometheus Users
node_cpu_seconds_total gives you a separate metric for each CPU, so with an 8 vCPU VM you'll get 8 alerts (if they're all under 20%)

You're saying that you're happy with all these alerts, but want to suppress them where the VM has only one vCPU?  In that case:

    count by (instance) (node_cpu_seconds_total{mode="idle"})

will give you the number of CPUs per instance, and hence you can modify your alert to something like

    expr: ( ......... unless on (instance) count by (instance) (node_cpu_seconds_total{mode="idle"} == 1)

which would give something like:

  (

      (100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20)
    unless on (instance)
      count by (instance) (node_cpu_seconds_total{mode="idle"} == 1)
  )
* on (instance) group_left (nodename)
  node_uname_info{nodename=~".+"}

Aside 1: personally I like to leave percentages as fractions. You can change these to percentages in alerts using humanizePercentage

Aside 2: It might be better to aggregate all the CPUs usage for an instance. Otherwise, if you have 8 mostly-idle CPUs, but each CPU in turn has a short burst of activity, you'll get no alerts.  Do do this, you should use sum over rate, not rate over sum.

mel

unread,
Jun 23, 2024, 8:55:44 PM (14 days ago) Jun 23
to Prometheus Users
Ah, good points. Thanks a bunch
Reply all
Reply to author
Forward
0 new messages