Alerts resolved upon prometheus crash

34 views
Skip to first unread message

Julien Pivotto

unread,
Mar 4, 2020, 6:38:27 AM3/4/20
to prometheus-users
Hello there,

We are running a pair of HA prometheis and HA alertmanagers.

One prometheus server OOM'd; and restarted. When it was down, we
received alert resolution notifications from the alertmanager:

> resolved (duration: 115h45m0s)

But a few seconds after:

> firing (duration: 115h52m16s)

I would have expected that the second prometheus, which had the alert
all the time and was working as expected, would have prevented the alert
to disappear.

Note that the alert does NOT have a `for` clause.

There is an entry at 9:44:39, then the server drops, and the alert is
firing again at 9:53. Note: We received the new "firing" at 9:52, with included 115h52m16s of duration.

Both Prometheis servers send alerts to both alertmanagers.


What can have appened here?

Our evaluation_interval is 1m, and resend-delay is default.

--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
signature.asc

Julien Pivotto

unread,
Mar 4, 2020, 6:39:51 AM3/4/20
to prometheus-users
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/20200304113821.GA19241%40oxygen.

Note: alertmanagers are 0.20.0 pulled from GH releases and both
prometheus are 2.16.0 pulled from GH releases too.
signature.asc

Julien Pivotto

unread,
Mar 4, 2020, 6:45:11 AM3/4/20
to prometheus-users
When I look at the metrics, it looks like
rate(alertmanager_alerts_received_total[5m]) is showing a lot of
'resolved' at that time. It it possible that Prometheus somehow sends
resolved alerts when TSDB is not yet ready? And because those rules were
running for a long time, we tried to restore them ?

regards,
signature.asc

Daniel Swarbrick

unread,
Mar 5, 2020, 4:17:01 AM3/5/20
to Prometheus Users
By default, Alertmanager will consider alerts resolved if 5 minutes or more elapses without the alert firiing (resolve_timeout config option).

If your Prometheus instance crashes and takes more than 5 minutes to restart, it's highly likely that any previously firing alerts will be "resolved". If the alerting rule conditions still exist after the restart, new alerts will be fired.
> > To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Julien Pivotto

unread,
Mar 5, 2020, 4:35:17 AM3/5/20
to Daniel Swarbrick, Prometheus Users
On 05 Mar 01:17, Daniel Swarbrick wrote:
> By default, Alertmanager will consider alerts resolved if 5 minutes or more
> elapses without the alert firiing (resolve_timeout config option).
>
> If your Prometheus instance crashes and takes more than 5 minutes to
> restart, it's highly likely that any previously firing alerts will be
> "resolved". If the alerting rule conditions still exist after the restart,
> new alerts will be fired.

Except that another prometheus server was still sending the alerts, so
that is not likely the explanation :(

But the server was in a pretty bad shape so maybe the alertmanager on
the same host was foobar too doring that time.
> > an email to promethe...@googlegroups.com <javascript:>.
> > > > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/20200304113821.GA19241%40oxygen.
> >
> > >
> > > Note: alertmanagers are 0.20.0 pulled from GH releases and both
> > > prometheus are 2.16.0 pulled from GH releases too.
> >
> >
> > When I look at the metrics, it looks like
> > rate(alertmanager_alerts_received_total[5m]) is showing a lot of
> > 'resolved' at that time. It it possible that Prometheus somehow sends
> > resolved alerts when TSDB is not yet ready? And because those rules were
> > running for a long time, we tried to restore them ?
> >
> > regards,
> >
> >
> > --
> > (o- Julien Pivotto
> > //\ Open-Source Consultant
> > V_/_ Inuits - https://www.inuits.eu
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c78909f5-1f22-4e2a-a276-794408a8dae5%40googlegroups.com.
signature.asc

Chris Siebenmann

unread,
Mar 9, 2020, 4:29:03 PM3/9/20
to Daniel Swarbrick, Prometheus Users, cks.prom...@cs.toronto.edu
These days alerts time out faster than this, and the timeout is
controlled by Prometheus instead of by Alertmanager. If you look
at an active alert in Alertmanager, you'll see an 'endsat' value
(or a similar-sounding label) that's a couple of minutes into the
future. Prometheus sets that in alerts it sends to Alertmanager, and
when that is set in an alert, the Alertmanager resolve_timeout setting
is ignored.

As far as I know there is no straightforward way to lengthen this
automatic default timeout in Prometheus.

- cks
Reply all
Reply to author
Forward
0 new messages