I can't build expression that returns data when evaluating node_load1 and machine_cpu_cores metrics

365 views
Skip to first unread message

Sven Nebel

unread,
Nov 10, 2017, 2:17:38 PM11/10/17
to Prometheus Users
Hi,
I'm playing with Prometheus alerts and I can't understand why the following alert expression does not "fire" when producing load enough in the machine

alert: node_load1
expr
: node_load1
 
> (machine_cpu_cores * 0.25)
labels
:
  severity
: warning
  value
: '{{ $value }}'
annotations
:
  description
: The load of {{ $labels.instance }} with {{ machine_cpu_cores }} cores
   
is {{ $value }}
  summary
: The load of {{ $labels.instance }} with {{ machine_cpu_cores }} cores is
   
{{ $value }}


When I replace (machine_cpu_cores * 0.25) by an integer number this seems to be working, using Prometheus query interface seems not returning any data when using operators combining node_load1 and machine_cpu_cores, any insight?

Thanks

Sven Nebel

unread,
Nov 10, 2017, 6:03:39 PM11/10/17
to Prometheus Users
After carefully reading the documentation I think I finally understood the issue, by default there is no common label I can use in both instant vectors to match them... I will try to see how to get some common label or find a different way to compare load_avg  against number of cores available.

Tobias Schmidt

unread,
Nov 10, 2017, 6:32:34 PM11/10/17
to Sven Nebel, Prometheus Users
You can calculate the number of cores from the node_cpu metric. We commonly do this in a rule. Then you can use on(instance) for example.

In general, load is isn't really a good indicator for issues and a poor cause based alert. I wouldn't recommend to write an alert for that. Good alerts cover actual user symptoms like high error rates or latency. Another group are resource constraints like disk space.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/18c3d06e-8ee1-4465-a908-b017c44e176e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sven Nebel

unread,
Nov 10, 2017, 6:41:27 PM11/10/17
to Prometheus Users
I think I still have to play a bit more with the query language to achieve everything I want, thanks for the advice I will reconsider the alerts I want to put in place :-)

Ben Kochie

unread,
Nov 11, 2017, 3:09:06 AM11/11/17
to Sven Nebel, Prometheus Users
Specifically, Tobias is referring to the ideas here:


This style of alerting drives a lot of the design behind Prometheus.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/98f3d528-7abf-4c43-ab07-21cf503854dd%40googlegroups.com.

Sven Nebel

unread,
Nov 11, 2017, 3:34:13 AM11/11/17
to Prometheus Users
Hi,
Thank you for the link it's really inspiring, as a matter of helping others in building queries here is the working query of what I initially intended to achieve

node_load1 > on (instance) (count by (instance) (rate(node_cpu{name="node-exporter",mode="user"}[5m])) * 0.5)

I guess this could be used to deliver extra information into alerts/pages instead of as an alerting rule expression, I have to say I just realized that a superficial read of the documentation (As I initially did) is not enough, all the information is there but you really have to read it all and be sure you understand the whole... slowly read it again and be sure you understand every step :-)

Thanks!

Sven Nebel

unread,
Nov 11, 2017, 3:40:38 AM11/11/17
to Prometheus Users
Looks like I can't edit my last post, please let me correct the last sentence to make it more clear

so if something doesn't work and/or makes no sense slowly read it again and be sure you understand every step :-)
Reply all
Reply to author
Forward
0 new messages