Hello!
I'd like to selectively apply a blacklist to a rule that's noisy and non-useful for only 1% of cases.
Background:
We're using the predict_linear function to give us early-warnings about disks filling up, and it's working great!
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail{job="node"}[1h], 4 * 3600) < 0
for: 60m
The Problem:
The only problem is that for a few specific partitions on a few specific instances that run nightly ETL batch processing, we expect the partition fullness to quickly spike dramatically as they do their jobs.
Let's say that I wanted to exclude:
1. The /etl-data partition on the etl-processor instance
2. The /agg-data partition on the data-aggregator instance
Options:
1. Add exclusions directly into the alert rule:
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail{job="node", instance!~"etl-processor|data-aggregator", mountpoint!~"/etl-data|/agg-data"}[1h], 4 * 3600) < 0
for: 60m
But this seems unmaintainable, and makes it hard to reason about.
2. Flip this to be a whitelist, but I'd rather leave this alert in place as the default for the 99% of our instances, and not have to remember to add it for all of them.
3. Add a long-term silence (100y) to my alertmanagers based on these partitions and labels. But this seems hacky and would not survive a restart of the alertmanagers if I understand correctly.
Are there any cleaner ways that I can maintain a blacklist for an alert? Or any other strategy you'd recommend for selectively
Thanks!
Dave