Alertmanager spams messages on slack

Joe Devilla

unread,

Feb 14, 2020, 12:32:43 PM2/14/20

to Prometheus Users

Hi

I am using alertmanager to post alerts on slack. Here is the configuration of my alert:

expr: <a query that takes 5 seconds>
for: 60m

Here are the settings on my alertmanager:

global:
  resolve_timeout: 5m
route:
  group_by: ['alertname', 'cluster']
  group_interval: 5m
  group_wait: 30s
  receiver: "slack"
  repeat_interval: 12h

To enhance performance, I had created a recording rule so that the 5 second query takes 100ms.

I have two issues:

I was running into an issue where I was getting "toggling" on the slack channel, meaning that the alert would be in an unresolved state, quickly be resolved, then go back into an unresolved state.  In this case, the alert was not actually being resolved.  When viewing prometheus, the alert would show up, but when viewing the alertmanager, the alert would periodically disappear than reappear.  Why would the alertmanager lose the alert only to have it reappear seconds later?
What is the behavior for slack to send messages?  I would assume that it would send messages on the following situations:
Alert goes into alarm
Alert goes out of alarm
num_firing on alert either increases or decreases
      When I look at my slack channel, despite the alertmanager settings above, I would see messages posted at the following times:

12:02AM
12:08AM
1:02AM
1:08AM
1:52AM
2:53AM
2:58AM
3:18AM
3:38AM
4:23AM
6:23AM
6:43AM
6:48AM
6:53AM
6:59AM
8:39AM
8:54AM
9:04AM
9:19AM

In summary, I had 2 questions:

Why would alertmanager be dropping alerts?
Why is the alertmanager sending messages to slack at non-determinant times?

Chris Siebenmann

unread,

Feb 18, 2020, 2:26:11 PM2/18/20

to Joe Devilla, Prometheus Users, cks.prom...@cs.toronto.edu

> I am using alertmanager to post alerts on slack. Here is the configuration
> of my alert:
>
> expr: <a query that takes 5 seconds>
> for: 60m
>
> Here are the settings on my alertmanager:
>
> global:
> resolve_timeout: 5m
> route:
> group_by: ['alertname', 'cluster']
> group_interval: 5m
> group_wait: 30s
> receiver: "slack"
> repeat_interval: 12h

[...]
> 2. What is the behavior for slack to send messages? I would assume that

> it would send messages on the following situations:

> 1. Alert goes into alarm
> 2. Alert goes out of alarm
> 3. num_firing on alert either increases or decreases
[...]

One of the things you are likely running into is how group_interval
works. Once an alert group is active, Alertmanager will only send
out further notifications at every group_interval after the initial
trigger, regardless of when alerts in the group resolve or further
alerts are triggered. So if your initial alert goes out at 6:43 AM, the
next notification about the alert group's state will only be sent by
Alertmanager at exactly 6:48 AM, then 6:53 AM, and so on.

If there are no state changes in the alert group at the next
notification time, Alertmanager doesn't send out a new notification.
But if there is a state change that arrives later, it is still not
immediately sent out; it has to wait to the next tick. So if you have an
alert that you get notified about at 6:43 AM and is resolved at 6:49 AM,
you will not get another alert group notification until 6:53 AM.

This may mean that you want a relatively short group_interval time
setting. This can lead to a lot of alert notifications if a bunch of
alerts in an alert group trigger one after another, but this may be
a feature in your environment.

- cks

Joe Devilla

unread,

Feb 18, 2020, 7:55:55 PM2/18/20

to Prometheus Users

Chris

Thanks for your reply. The sys admins on my team had noticed that the issue is a corrupted block on our Thanos cluster. We are working on upgrading the cluster to prevent the drops.

Joe

Reply all

Reply to author

Forward