Hello everyone!
I'd like to learn about some best practices on handling exception cases in alerts. Let's say we are monitoring "node_exporter" metrics like system load or disk space used. Most servers typically fall below the alert threshold, but a few (1-2) run above or close to the threshold as part of normal operation.
What is the best way to have alerts when metric X passes a threshold for most servers, but for the ones that are already running close to X, set a different rule?
In my case, a few servers typically have high cpu usage, while others have high disk space usage.
Should I create different rules and filter by job? This looks like it wouldn't scale if I get more servers closer to the threshold in the future.
Should I increase the threshold for all? In this case some typically idle servers might get overloaded and I wouldn't be notified until it's too late.
Should I add the threshold in a label and maintain it per server? Can I have defaults in a simple way and only use the label to do overrides? This should reduce the number of rules I need to maintain.
As example rules I'm currently using:
- alert: high_cpu_load
expr: node_load1{send_alerts="True"} > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Server under high load"
description: "[{{$labels.job}}] Host is under high load, the avg load 1m is at {{$value}}. Reported by instance {{ $labels.instance }}."
- alert: high_storage_load
expr: (node_filesystem_size_bytes{fstype="ext4", send_alerts="True"} - node_filesystem_free_bytes{fstype="ext4", send_alerts="True"}) / node_filesystem_size_bytes{fstype="ext4", send_alerts="True"} * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Server storage is almost full"
description: "[{{$labels.job}}] Host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }}."
Thanks for any adivce!
Regards,
Adrian