Best practices when handling exceptions in alerts

30 views
Skip to first unread message

Adrian Popa

unread,
Jan 13, 2022, 2:41:33 AM1/13/22
to Prometheus Users
Hello everyone!
I'd like to learn about some best practices on handling exception cases in alerts. Let's say we are monitoring "node_exporter" metrics like system load or disk space used. Most servers typically fall below the alert threshold, but a few (1-2) run above or close to the threshold as part of normal operation.

What is the best way to have alerts when metric X passes a threshold for most servers, but for the ones that are already running close to X, set a different rule?

In my case, a few servers typically have high cpu usage, while others have high disk space usage.

Should I create different rules and filter by job? This looks like it wouldn't scale if I get more servers closer to the threshold in the future.

Should I increase the threshold for all? In this case some typically idle servers might get overloaded and I wouldn't be notified until it's too late.

Should I add the threshold in a label and maintain it per server? Can I have defaults in a simple way and only use the label to do overrides? This should reduce the number of rules I need to maintain.

As example rules I'm currently using:
  - alert: high_cpu_load
    expr: node_load1{send_alerts="True"} > 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Server under high load"
      description: "[{{$labels.job}}] Host is under high load, the avg load 1m is at {{$value}}. Reported by instance {{ $labels.instance }}."
   - alert: high_storage_load
    expr: (node_filesystem_size_bytes{fstype="ext4", send_alerts="True"} - node_filesystem_free_bytes{fstype="ext4", send_alerts="True"}) / node_filesystem_size_bytes{fstype="ext4", send_alerts="True"}  * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Server storage is almost full"
      description: "[{{$labels.job}}] Host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }}."

Thanks for any adivce!
Regards,
Adrian

Brian Candler

unread,
Jan 13, 2022, 3:00:28 AM1/13/22
to Prometheus Users
On Thursday, 13 January 2022 at 07:41:33 UTC adrian....@gmail.com wrote:
What is the best way to have alerts when metric X passes a threshold for most servers, but for the ones that are already running close to X, set a different rule?


You can also monitor on trends rather than static thresholds - e.g. for disk space you can use predict_linear to detect when a filesystem looks like it's going to become full.  See this thread.

However, I'd also caution you against setting alerts on causes, and concentrate your alerting on symptoms instead.   You can't avoid all cause-based alerts, but you can minimise them.

"CPU load" for example, is not a particularly useful metric to alert on.  Suppose the CPU load hits 99% at 3am in the morning, but the service is still working fine.  Do you really want to get someone out of bed for this?  And if you do get them out of bed, what exactly are they going to do about it anyway?

This document, which is only a few pages, is well worth reading:

Adrian Popa

unread,
Jan 14, 2022, 6:16:16 AM1/14/22
to Brian Candler, Prometheus Users
Thank you!
I remember seeing it somewhere in the past, but couldn't remember it.

Regarding system load - even if it triggers an alert at 3 AM, as long as it goes to email and gets checked up in the morning, I think it's fine. At least  you're not missing out on (potentially) abnormal behavior.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/43a0cd05-75a9-4a03-af2c-b29cad12435fn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages