Hi,
I’ve been struggling with this for days and still have not found the root of the problem or a solution.
We have configured Prometheus to send alerts to Alertmanager based on Node Exporter data. Here are the rules defined in Prometheus:
- alert: DiskSpaceLow
annotations:
description: '{{ $labels.job }} reports remaining disk space on mountpoint {{ $labels.mountpoint }} is {{ $value }}%'
summary: Remaining disk space is low
expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / node_filesystem_size_bytes{fstype="ext4"} < 15
for: 15m
labels:
severity: warning
- alert: DiskSpaceLow
annotations:
description: '{{ $labels.job }} reports remaining disk space on mountpoint {{ $labels.mountpoint }} is {{ $value }}%'
summary: Remaining disk space is low
expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / node_filesystem_size_bytes{fstype="ext4"} < 2
labels:
severity: critical
Here is a summarized version of our Alertmanager configuration:
global:
resolve_timeout: 1m
route:
group_by: ['alertname', 'severity']
group_wait: 1m
group_interval: 5m
repeat_interval: 1d
routes:
- match:
endpoint: metrics
group_wait: 10s
group_interval: 1m
repeat_interval: 6h
continue: true
routes:
- match:
team: xxx
receiver: xxx-team-receiver
…
receivers:
…
inhibit_rules:
- target_match:
severity: warning
source_match:
severity: critical
equal:
- alertname
This is an example of what happens in our Slack channel for receiving alerts:
17:27:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:32:05 — 2 DiskSpaceLow alerts firing (server1 and server2)
17:33:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:37:05 — DiskSpaceLow resolved
17:37:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:42:35 — DiskSpaceLow resolved
17:43:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:47:35 — DiskSpaceLow resolved
17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:53:05 — 2 DiskSpaceLow alerts firing (server1 and server2)
17:54:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:58:05 — DiskSpaceLow resolved
17:59:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
18:04:05 — 2 DiskSpaceLow alerts firing (server1 and server2)
18:05:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
It keeps flapping like this forever. Needless to say, the disk space on these servers are not changing during this time, so the alerts should not be resolved.
During these “resolved” time ranges, I checked Prometheus’s web interface and these alerts are still firing there. They never resolve there. But they disappear from Alertmanager’s web interface during these “resolved” times.
I wrote a script to get the list of active alerts from Alertmanager’s API every 30 seconds to see what appears here. Here’s an weird thins that I saw there: at, say, 17:36:30, the alerts are there in the API and their “endsAt” is set to 17:39:04 (three minutes after their updatedAt), but at 17:37:00 there are not alerts at all. The API does not return any of the previous alerts, even though their previous endsAt has not come yet.
Why does Alertmanager suddenly resovles/removes these alerts before their endsAt comes?
Any help is appreciated because I have been struggling with this problem for days. I even read the source code of Prometheus and Alertmanager but could not find anything that could cause this problem there.
BR,
Mostafa