Selective Blacklist on Alert Rules

77 views
Skip to first unread message

Dave Cadwallader

unread,
Aug 15, 2018, 1:22:06 PM8/15/18
to Prometheus Users
Hello!

I'd like to selectively apply a blacklist to a rule that's noisy and non-useful for only 1% of cases. 

Background:

We're using the predict_linear function to give us early-warnings about disks filling up, and it's working great!

  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_avail{job="node"}[1h], 4 * 3600) < 0
    for: 60m

The Problem:

The only problem is that for a few specific partitions on a few specific instances that run nightly ETL batch processing, we expect the partition fullness to quickly spike dramatically as they do their jobs.  

Let's say that I wanted to exclude:

1. The /etl-data partition on the etl-processor instance
2. The /agg-data partition on the data-aggregator instance

Options:

1. Add exclusions directly into the alert rule:

  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_avail{job="node", instance!~"etl-processor|data-aggregator", mountpoint!~"/etl-data|/agg-data"}[1h], 4 * 3600) < 0
    for: 60m

But this seems unmaintainable, and makes it hard to reason about.

2. Flip this to be a whitelist, but I'd rather leave this alert in place as the default for the 99% of our instances, and not have to remember to add it for all of them.

3. Add a long-term silence (100y) to my alertmanagers based on these partitions and labels.  But this seems hacky and would not survive a restart of the alertmanagers if I understand correctly.

Are there any cleaner ways that I can maintain a blacklist for an alert?  Or any other strategy you'd recommend for selectively 

Thanks!
Dave

Brian Brazil

unread,
Aug 15, 2018, 2:22:05 PM8/15/18
to Dave Cadwallader, Prometheus Users
On 15 August 2018 at 18:22, Dave Cadwallader <dcadwa...@gmail.com> wrote:
Hello!

I'd like to selectively apply a blacklist to a rule that's noisy and non-useful for only 1% of cases. 

Background:

We're using the predict_linear function to give us early-warnings about disks filling up, and it's working great!

  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_avail{job="node"}[1h], 4 * 3600) < 0
    for: 60m

The Problem:

The only problem is that for a few specific partitions on a few specific instances that run nightly ETL batch processing, we expect the partition fullness to quickly spike dramatically as they do their jobs.  

Let's say that I wanted to exclude:

1. The /etl-data partition on the etl-processor instance
2. The /agg-data partition on the data-aggregator instance

Options:

1. Add exclusions directly into the alert rule:

  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_avail{job="node", instance!~"etl-processor|data-aggregator", mountpoint!~"/etl-data|/agg-data"}[1h], 4 * 3600) < 0
    for: 60m

But this seems unmaintainable, and makes it hard to reason about.

This is how I'd usually deal with this. In more complex setups you end up doing something like https://www.robustperception.io/using-time-series-as-alert-thresholds, but this case is still on the simple side of things.

Brian
 

2. Flip this to be a whitelist, but I'd rather leave this alert in place as the default for the 99% of our instances, and not have to remember to add it for all of them.

3. Add a long-term silence (100y) to my alertmanagers based on these partitions and labels.  But this seems hacky and would not survive a restart of the alertmanagers if I understand correctly.

Are there any cleaner ways that I can maintain a blacklist for an alert?  Or any other strategy you'd recommend for selectively 

Thanks!
Dave

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b248e925-385c-43a0-a2f8-e50495f896dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Reply all
Reply to author
Forward
0 new messages