Alert notify retries cease after 1 min

46 views
Skip to first unread message

Andres Sanchez Smith

unread,
Dec 1, 2020, 4:47:17 AM12/1/20
to Prometheus Users
Does anyone know how alertmanager can be configured to allow permanent notify retries? If connection was lost to the webhook target for several hours, with my current setup none of the alerts that occurred during the outage would be sent, and no one would ever know something was amiss 

To add more context, the retries cease after 1 min, and it does 12 retries in total. I was looking through the alertmanager code and it seems that in v0.21 (which is the one we are running) the retries should be endless, capped at 1 min per retry (if I'm reading the backoff timer code correctly) so it seems odd that the retries end after one minute 

Here's a sample of the error I see in the Alertmanager logs:level=error ts=2020-11-27T13:03:54.660Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=3 err="sd_webhook/webhook[0]: notify retry canceled after 12 attempts: Post \"http://192.168.1.10:4444\": dial tcp 192.168.1.10:4444: connect: connection refused"  

b.ca...@pobox.com

unread,
Dec 1, 2020, 5:48:44 AM12/1/20
to Prometheus Users
In notify/notify.go I see:

        for {
                i++
                // Always check the context first to not notify again.
                select {
                case <-ctx.Done():
                        if iErr == nil {
                                iErr = ctx.Err()
                        }

                        return ctx, nil, errors.Wrapf(iErr, "%s/%s: notify retry canceled after %d attempts", r.groupName, r.integration.String(), i)

That is: it keeps retrying at exponential intervals until the overall context expires - which according to your measurements is 1 minute.

I'm not entirely sure where this limit comes from, but it might be the group_interval - see dispatch/dispatch.go:

                        // Give the notifications time until the next flush to
                        // finish before terminating them.
                        ctx, cancel := context.WithTimeout(ag.ctx, ag.timeout(ag.opts.GroupInterval))

I don't think it's designed to be a long-term queue.  If you have a situation where the webhook endpoint really could be down for hours on end, and you don't want to lose alerts, then I think you should run a local webhook on the same server, which queues the requests and then delivers them to the *real* webhook when it becomes available.

Of course, you'd also have to be happy that you may get a splurge of alerts, many of which may already have been resolved.

Andres Sanchez Smith

unread,
Dec 1, 2020, 6:12:45 AM12/1/20
to Prometheus Users
Thanks for the quick response! That would make sense, my group_interval is also 1m, I'll be sure to try that out to see if that's what is limiting it, although as you say, if that's the case, we'll probably have to implement some local webhook and alert storage solution. We would be delighted to get all the alerts, resolved or not :) we need them to keep track of what has happened in the system at different points in time.

Thank you for your help!

b.ca...@pobox.com

unread,
Dec 1, 2020, 6:17:07 AM12/1/20
to Prometheus Users
On Tuesday, 1 December 2020 at 11:12:45 UTC andres.sanche...@gmail.com wrote:
We would be delighted to get all the alerts, resolved or not :) we need them to keep track of what has happened in the system at different points in time.

Note that Prometheus keeps its own history of these, in the automatically-generated ALERTS and ALERTS_FOR_STATE timeseries, that you can query later.
 
Reply all
Reply to author
Forward
0 new messages