Alertmanager frequently sending erroneous resolve notifications

48 views
Skip to first unread message

Sarah Dundras

unread,
May 18, 2024, 2:50:32 PM5/18/24
to Prometheus Users

Hi, this problem is driving me mad:

I am monitoring backups that log their backup results to a textfile. It is being picked up and all is well, also the alert are ok, BUT! Alertmanager frequently sends out odd "resolved" notifications although the firing status never changed!

Here's such an alert rule that does this:

- alert: Restic Prune Freshness
expr: restic_prune_status{uptodate!="1"} and restic_prune_status{alerts!="0"}
for: 2d
labels:
topic: backup
freshness: outdated
job: "{{ $labels.restic_backup }}"
server: "{{ $labels.server }}"
product: veeam
annotations:
description: "Restic Prune for '{{ $labels.backup_name }}' on host '{{ $labels.server_name }}' is not up-to-date (too old)"
host_url: "https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2&var-server_name={{ $labels.server_name }}&var-result=0&var-backup_name=All"
service_url: "https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2&var-server_name=All&var-result=0&var-backup_name={{ $labels.backup_name }}"
service: "{{ $labels.job_name }}"

What can be done?

Brian Candler

unread,
May 18, 2024, 4:54:07 PM5/18/24
to Prometheus Users
> What can be done?

Perhaps the alert condition resolved very briefly. The solution with modern versions of prometheus (v2.42.0 or later) is to do this:

for: 2d
keep_firing_for: 10m

The alert won't be resolved unless it has been *continuously* absent for 10 minutes. (Of course, this means your "resolved" notifications will be delayed by 10 minutes - but that's basically the whole point, don't send them until you're sure they're not going to retrigger)

The other alternative is simply to turn off resolved notifications entirely. This approach sounds odd but has a lot to recommend it:

The point is that if a problem occurred which was serious enough to alert on, then it requires investigation before the case can be "closed": either there's an underlying problem, or if it was a false positive then the alert condition needs tuning. Sending a resolved message encourages laziness ("oh, it fixed itself, no further work required").  Also, turning off resolved messages instantly reduces your notifications by 50%.
Reply all
Reply to author
Forward
0 new messages