Alertmanager spams messages on slack

36 views
Skip to first unread message

Joe Devilla

unread,
Feb 14, 2020, 12:32:43 PM2/14/20
to Prometheus Users
Hi

I am using alertmanager to post alerts on slack.  Here is the configuration of my alert:

expr: <a query that takes 5 seconds>
for: 60m

Here are the settings on my alertmanager:

global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_interval: 5m
group_wait: 30s
receiver: "slack"
repeat_interval: 12h

To enhance performance, I had created a recording rule so that the 5 second query takes 100ms.

I have two issues:

  1. I was running into an issue where I was getting "toggling" on the slack channel, meaning that the alert would be in an unresolved state, quickly be resolved, then go back into an unresolved state. In this case, the alert was not actually being resolved. When viewing prometheus, the alert would show up, but when viewing the alertmanager, the alert would periodically disappear than reappear. Why would the alertmanager lose the alert only to have it reappear seconds later?
  2. What is the behavior for slack to send messages? I would assume that it would send messages on the following situations:
    1. Alert goes into alarm
    2. Alert goes out of alarm
    3. num_firing on alert either increases or decreases
When I look at my slack channel, despite the alertmanager settings above, I would see messages posted at the following times:
  1. 12:02AM
  2. 12:08AM
  3. 1:02AM
  4. 1:08AM
  5. 1:52AM
  6. 2:53AM
  7. 2:58AM
  8. 3:18AM
  9. 3:38AM
  10. 4:23AM
  11. 6:23AM
  12. 6:43AM
  13. 6:48AM
  14. 6:53AM
  15. 6:59AM
  16. 8:39AM
  17. 8:54AM
  18. 9:04AM
  19. 9:19AM
In summary, I had 2 questions:
  1. Why would alertmanager be dropping alerts?
  2. Why is the alertmanager sending messages to slack at non-determinant times?

Chris Siebenmann

unread,
Feb 18, 2020, 2:26:11 PM2/18/20
to Joe Devilla, Prometheus Users, cks.prom...@cs.toronto.edu
> I am using alertmanager to post alerts on slack. Here is the configuration
> of my alert:
>
> expr: <a query that takes 5 seconds>
> for: 60m
>
> Here are the settings on my alertmanager:
>
> global:
> resolve_timeout: 5m
> route:
> group_by: ['alertname', 'cluster']
> group_interval: 5m
> group_wait: 30s
> receiver: "slack"
> repeat_interval: 12h
[...]
> 2. What is the behavior for slack to send messages? I would assume that
> it would send messages on the following situations:
> 1. Alert goes into alarm
> 2. Alert goes out of alarm
> 3. num_firing on alert either increases or decreases
[...]

One of the things you are likely running into is how group_interval
works. Once an alert group is active, Alertmanager will only send
out further notifications at every group_interval after the initial
trigger, regardless of when alerts in the group resolve or further
alerts are triggered. So if your initial alert goes out at 6:43 AM, the
next notification about the alert group's state will only be sent by
Alertmanager at exactly 6:48 AM, then 6:53 AM, and so on.

If there are no state changes in the alert group at the next
notification time, Alertmanager doesn't send out a new notification.
But if there is a state change that arrives later, it is still not
immediately sent out; it has to wait to the next tick. So if you have an
alert that you get notified about at 6:43 AM and is resolved at 6:49 AM,
you will not get another alert group notification until 6:53 AM.

This may mean that you want a relatively short group_interval time
setting. This can lead to a lot of alert notifications if a bunch of
alerts in an alert group trigger one after another, but this may be
a feature in your environment.

- cks

Joe Devilla

unread,
Feb 18, 2020, 7:55:55 PM2/18/20
to Prometheus Users
Chris

Thanks for your reply.  The sys admins on my team had noticed that the issue is a corrupted block on our Thanos cluster.  We are working on upgrading the cluster to prevent the drops.

Joe
Reply all
Reply to author
Forward
0 new messages