Alertmanager keeps resolving and reopening alerts

1,036 views
Skip to first unread message

Mostafa Hajizadeh

unread,
Jan 19, 2020, 4:11:55 AM1/19/20
to Prometheus Users
Hi,

I’ve been struggling with this for days and still have not found the root of the problem or a solution.

We have configured Prometheus to send alerts to Alertmanager based on Node Exporter data. Here are the rules defined in Prometheus:

    - alert: DiskSpaceLow
      annotations:
        description: '{{ $labels.job }} reports remaining disk space on mountpoint {{ $labels.mountpoint }} is {{ $value }}%'
        summary: Remaining disk space is low
      expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / node_filesystem_size_bytes{fstype="ext4"} < 15
      for: 15m
      labels:
        severity: warning
    - alert: DiskSpaceLow
      annotations:
        description: '{{ $labels.job }} reports remaining disk space on mountpoint {{ $labels.mountpoint }} is {{ $value }}%'
        summary: Remaining disk space is low
      expr: 100 * node_filesystem_avail_bytes{fstype="ext4"} / node_filesystem_size_bytes{fstype="ext4"} < 2
      labels:
        severity: critical

Here is a summarized version of our Alertmanager configuration:

global:
  resolve_timeout: 1m

route:
  group_by: ['alertname', 'severity']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 1d
  routes:
  - match:
      endpoint: metrics
    group_wait: 10s
    group_interval: 1m
    repeat_interval: 6h
    continue: true
    routes:
    - match:
        team: xxx
      receiver: xxx-team-receiver
    …

receivers:

inhibit_rules:
- target_match:
    severity: warning
  source_match:
    severity: critical
  equal:
  - alertname

This is an example of what happens in our Slack channel for receiving alerts:
 
17:27:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:32:05 — 2 DiskSpaceLow alerts firing (server1 and server2)
17:33:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:37:05 — DiskSpaceLow resolved
17:37:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:42:35 — DiskSpaceLow resolved
17:43:35 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:47:35 — DiskSpaceLow resolved
17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:48:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:53:05 — 2 DiskSpaceLow alerts firing (server1 and server2)
17:54:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
17:58:05 — DiskSpaceLow resolved
17:59:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)
18:04:05 — 2 DiskSpaceLow alerts firing (server1 and server2)
18:05:05 — 3 DiskSpaceLow alerts firing (server1, server2, and server3)

It keeps flapping like this forever. Needless to say, the disk space on these servers are not changing during this time, so the alerts should not be resolved.

During these “resolved” time ranges, I checked Prometheus’s web interface and these alerts are still firing there. They never resolve there. But they disappear from Alertmanager’s web interface during these “resolved” times.

I wrote a script to get the list of active alerts from Alertmanager’s API every 30 seconds to see what appears here. Here’s an weird thins that I saw there: at, say, 17:36:30, the alerts are there in the API and their “endsAt” is set to 17:39:04 (three minutes after their updatedAt), but at 17:37:00 there are not alerts at all. The API does not return any of the previous alerts, even though their previous endsAt has not come yet.

Why does Alertmanager suddenly resovles/removes these alerts before their endsAt comes?

Any help is appreciated because I have been struggling with this problem for days. I even read the source code of Prometheus and Alertmanager but could not find anything that could cause this problem there.

BR,
Mostafa

Mostafa Hajizadeh

unread,
Jan 19, 2020, 4:16:39 AM1/19/20
to Prometheus Users
Sorry for so many typos in the last paragraphs. :-)

Simon Pasquier

unread,
Jan 23, 2020, 11:49:54 AM1/23/20
to Mostafa Hajizadeh, Prometheus Users
What's your evaluation_interval and scrape_interval in Prometheus?
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8bd1ba00-ea95-42ba-918c-c3bec66de1ae%40googlegroups.com.

Mostafa Hajizadeh

unread,
Feb 1, 2020, 3:07:49 AM2/1/20
to Simon Pasquier, Prometheus Users
Hi,

So sorry for the late response. I did not see your email.

Both are 30s.

This is from Prometheus’s configuration page in its web panel:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
We have set scrape_interval through our ServiceMonitor for each service, but that is 30s too.

BR,
Mostafa

Simon Pasquier

unread,
Feb 3, 2020, 11:55:47 AM2/3/20
to Mostafa Hajizadeh, Prometheus Users
I would check the following query in the graph view to make sure that
the alert is constantly firing.
ALERTS{alertname="DiskSpaceLow"}
You can remove resolve_timeout from your Alertmanager configuration
(though it's unlikely to be the issue). It shouldn't be needed if you
run a recent version of Prometheus.
Other than that, try running Alertmanager with the "--log.level=debug" flag.

Mostafa Hajizadeh

unread,
Feb 3, 2020, 11:58:38 AM2/3/20
to Simon Pasquier, Prometheus Users
Hi,

Thanks for the tips. I will follow them and let you know.

BR,
Mostafa

Rajesh Reddy Nachireddi

unread,
Feb 4, 2020, 11:40:16 AM2/4/20
to Mostafa Hajizadeh, Simon Pasquier, Prometheus Users
Hi Simon,

When we change the resolve timeout from 5m to any other, its not honoring..

Is there any way we can change this timer and also if possible how to differentiate no data found alert(may be some source issue) vs no data match (metrics are available but not meeting any of these thresholds).

As we have Alert rules for different variety of data with different frequency of collection.. can we configure one resolve timer for each group or alert rule.


Your inputs are much appreciated

Regards,

Rajesh


Simon Pasquier

unread,
Feb 13, 2020, 9:51:05 AM2/13/20
to Rajesh Reddy Nachireddi, Mostafa Hajizadeh, Prometheus Users
On Tue, Feb 4, 2020 at 5:40 PM Rajesh Reddy Nachireddi
<rajesh...@gmail.com> wrote:
>
> Hi Simon,
>
> When we change the resolve timeout from 5m to any other, its not honoring..

If you run a recent version of Prometheus, resolve_timeout is
ineffective because the firing alert received by Alertmanager will
have an endsAt field set to some time in the future (the value depends
on the evaluation interval and is at least 3m).

>
> Is there any way we can change this timer and also if possible how to differentiate no data found alert(may be some source issue) vs no data match (metrics are available but not meeting any of these thresholds).

You need alerts detecting that a target is down to cover the first scenario.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAEyhnpL%3D2Nc3mErnEwc3rnsieXoydVSfXDSt7mSoytzL%2Bce7NQ%40mail.gmail.com.

Joe Devilla

unread,
Feb 14, 2020, 1:01:15 PM2/14/20
to Prometheus Users
Mostafa

Have you had any luck resolving this issue?  I am running into the exact same problem.

Joe
>> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages