Prometheus alerts with a ($) value that is below the threshold

40 views
Skip to first unread message

J Houer

unread,
Apr 24, 2020, 5:18:28 AM4/24/20
to Prometheus Users

We use a Prometheus alert (and node-exporter) to check whether we are running out of memory on a node.


Issue: In many cases I get an alert with a $value that is below the threshold value in the expression.The expression is:


alert: GettingOutOfMemory
expr: max(sum
  by(instance) ((((node_memory_MemTotal_bytes) - (node_memory_MemFree_bytes + node_memory_Buffers_bytes
  + node_memory_Cached_bytes)) / (node_memory_MemTotal_bytes)) * 100)) >= 90
for: 5m
labels:
  severity: warning
annotations:
  description: Docker Swarm node {{ $labels.instance }} memory usage is at {{ humanize $value}}%.
  summary: Memory is getting low for Swarm node '{{ $labels.node_name }}'


I get messages saying that we ran out of memory at e.g. 63%. So that is the value of the $value. This is clearly below the 90% threshold.

Why do I get this alert even though the $value is below the threshold?


How can I repair this Prometheus alert rule so I will only get only alerts when the $value is above the threshold?

Julius Volz

unread,
Apr 24, 2020, 10:04:03 AM4/24/20
to J Houer, Prometheus Users
This is very strange, to say the least.

With the >= 90 filter on the top level of the expression, the alerting rule should never return series that have a sample value below 90, and thus $value should never be below 90 either. In case this is really not a misconfiguration somewhere (please double-check, and also try the latest Prometheus version), it would be interesting if you can make this a reproducible case somehow for a bug report.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6a2cc11f-b733-4a66-87b5-55e8355d7ebb%40googlegroups.com.

J Houer

unread,
Apr 24, 2020, 11:40:20 AM4/24/20
to Prometheus Users
Is there a way to print the values of the parts as well, so e.g. the value node_memory_MemTotal_bytes
Or is it possible to have the evaluation steps logged?


Op vrijdag 24 april 2020 11:18:28 UTC+2 schreef J Houer:

Julius Volz

unread,
Apr 25, 2020, 6:18:39 AM4/25/20
to J Houer, Prometheus Users
On Fri, Apr 24, 2020 at 5:40 PM J Houer <joha...@gmail.com> wrote:
Is there a way to print the values of the parts as well, so e.g. the value node_memory_MemTotal_bytes
Or is it possible to have the evaluation steps logged? 

It's a bit cumbersome, but you could include a separate query in the annotation template that queries for that metric for the particular host that is generating the alert. For that you need to preserve the instance label in your original alerting query, so you could e.g. change the max() to topk(1, ...), which doesn't aggregate it away, just selects the top 1. And then you have it in $labels.instance in the annotation template and could use the "query" template function, see: https://prometheus.io/docs/prometheus/latest/configuration/template_examples/#display-one-value

So you'd have something like this in your annotation template:

{{ with printf "node_memory_MemTotal_bytes{instance='%s'}" $labels.instance | query }}
  {{ . | first | value | humanize1024 }}B
{{ end }}
 
Op vrijdag 24 april 2020 11:18:28 UTC+2 schreef J Houer:

We use a Prometheus alert (and node-exporter) to check whether we are running out of memory on a node.


Issue: In many cases I get an alert with a $value that is below the threshold value in the expression.The expression is:


alert: GettingOutOfMemory
expr: max(sum
  by(instance) ((((node_memory_MemTotal_bytes) - (node_memory_MemFree_bytes + node_memory_Buffers_bytes
  + node_memory_Cached_bytes)) / (node_memory_MemTotal_bytes)) * 100)) >= 90
for: 5m
labels:
  severity: warning
annotations:
  description: Docker Swarm node {{ $labels.instance }} memory usage is at {{ humanize $value}}%.
  summary: Memory is getting low for Swarm node '{{ $labels.node_name }}'


I get messages saying that we ran out of memory at e.g. 63%. So that is the value of the $value. This is clearly below the 90% threshold.

Why do I get this alert even though the $value is below the threshold?


How can I repair this Prometheus alert rule so I will only get only alerts when the $value is above the threshold?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

J Houer

unread,
Apr 25, 2020, 8:23:34 AM4/25/20
to Prometheus Users
Thank you Julius Volz!!


Op vrijdag 24 april 2020 11:18:28 UTC+2 schreef J Houer:

We use a Prometheus alert (and node-exporter) to check whether we are running out of memory on a node.

Reply all
Reply to author
Forward
0 new messages