Simulate alerting rules on historical data

Will Sewell

unread,

Sep 28, 2021, 4:20:31 AM9/28/21

to Prometheus Users

I'm thinking about ways we can reduce noisy alerts. One of the problems is it's tricky to tweak alert thresholds without any data on the precision and recall of the alert. It's a non-trivial problem to get this data because a human is typically required to classify an alert as a true positive or a false negative. This makes it hard to fully automate gathering this data. I am considering whether there is a way of obtaining this data using a hybrid approach: a human is able to classify an alert as a true positive of false positive - for example via a button in the alert body (e.g. in Slack or PagerDuty) and this gets sent to an analytics database which we can later prioritise which alert thresholds that need tweaking.

My question is, is there any precedent for this kind of system in the Prometheus/Alertmanager ecosystem? i.e. open source software that does this out of the box, or experience report blog posts?

Many thanks,

Will

sayf.eddi...@gmail.com

unread,

Sep 29, 2021, 1:33:55 AM9/29/21

to Prometheus Users

Hello,

No I am not aware of such tool, but it shouldnt be hard to write a simple exporter (maybe using python prometheus_client lib) to replay historical data and expose it to a Prometheus/Alertmanager setup.

Or, given the alerts are also stored in the TSDB, you can build sth that navigate the data on time basis and detect when the state of the alert changed to "pending" or "firing" and check the thresholds

Brian Candler

unread,

Sep 29, 2021, 3:08:55 AM9/29/21

to Prometheus Users

It's maybe worth mentioning that alert thresholds can come from their own timeseries, as described here:

https://www.robustperception.io/using-time-series-as-alert-thresholds

That can help automate these updates. You don't need to modify any rules; instead you expose the 'current' value of the threshold, let it be scraped as normal, and modify the threshold as required. As a bonus, you get a full history of which thresholds were used and when.

But I don't know of a complete out-of-box solution which provides the full feedback loop.

l.mi...@gmail.com

unread,

Sep 30, 2021, 10:17:20 AM9/30/21

to Prometheus Users

https://github.com/cloudflare/pint can try to estimate the number of time an alert would trigger if you configure it to do so,

see example at https://github.com/cloudflare/pint/blob/main/docs/examples/config.hcl#L58

Not exactly what you're looking for, but it can be useful to find alerts that would fire too often or would never fire (if you know it should fire).

Will Sewell

unread,

Sep 30, 2021, 11:47:26 AM9/30/21

to Prometheus Users

Thanks all. This is all really helpful.

pint looks useful in general, and it also looks like it will be helpful for this particular use case. Thanks for sharing! For future readers, I think the docs here are most helpful: https://github.com/cloudflare/pint/blob/main/docs/CONFIGURATION.md#alerts.

Reply all

Reply to author

Forward