sometimes I just received a resolved email but not firing email

zhang windy

unread,

Feb 15, 2020, 9:11:27 PM2/15/20

to Prometheus Users

hi, buddies, I met a promblem, sometimes I just received a resolved email but not firing email.

Do you kown the problem? so I want to comfirm the relevant logs, I have configed the prometheus log level to debug, but I can't find the relevant alert’s log in prometheus, I can see some relevant logs in alertmanager, how can see a firing alert that was sent to alertmanager in prometheus's log? is there somethng wrong I missed? could you give me some ideas? thank you very much.

the prometheus's deug log below, I can't find the relevant log about alert:

level=info ts=2020-02-15T07:00:03.047720487Z caller=head.go:488 component=tsdb msg="head GC completed" duration=117.297372ms

level=info ts=2020-02-15T07:00:08.727679818Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=53 high=65 duration=5.679902199s

level=debug ts=2020-02-15T07:35:46.342968643Z caller=scrape.go:826 component="scrape manager" scrape_pool=monitoring/kube-state-metrics/0 target=https://10.222.2.31:8443/metricsmsg="Scrape failed" err="context deadline exceeded"

level=debug ts=2020-02-15T08:00:14.342979729Z caller=scrape.go:826 component="scrape manager" scrape_pool=monitoring/kube-state-metrics/0 target=https://10.222.2.31:8443/metricsmsg="Scrape failed" err="context deadline exceeded"

level=info ts=2020-02-15T09:00:02.184258444Z caller=compact.go:398 component=tsdb msg="write block" mint=1581746400000 maxt=1581753600000ulid=01E1416XN899QYYF84S7KE2Y5V

level=info ts=2020-02-15T09:00:02.472066954Z caller=head.go:488 component=tsdb msg="head GC completed" duration=102.889478ms

level=info ts=2020-02-15T09:00:08.099244646Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=66 high=78 duration=5.627118986s

level=debug ts=2020-02-15T10:10:10.343090205Z caller=scrape.go:826 component="scrape manager" scrape_pool=monitoring/kube-state-metrics/0 target=https://10.222.2.31:8443/metricsmsg="Scrape failed" err="context deadline exceeded"

level=info ts=2020-02-15T11:00:02.363236579Z caller=compact.go:398 component=tsdb msg="write block" mint=1581753600000 maxt=1581760800000ulid=01E1482MXB0Q5QD94MG17PPEXQ

level=info ts=2020-02-15T11:00:02.648408131Z caller=head.go:488 component=tsdb msg="head GC completed" duration=108.614406ms

level=info ts=2020-02-15T11:00:08.250280453Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=79 high=91 duration=5.601822028s

Brian Candler

unread,

Feb 16, 2020, 3:47:34 AM2/16/20

to Prometheus Users

If I understand correctly, prometheus doesn't send any "resolved" message to alertmanager: it just stops sending alerts. Alertmanager treats the lack of alert as meaning "resolved".

Therefore, if you receive the "resolved" message, then this proves that alertmanager must have received the original alert.

Hence you need to check the alertmanager logs to determine why it couldn't send out the original alert E-mail.

Brian Candler

unread,

Feb 16, 2020, 5:01:34 AM2/16/20

to bryan, Prometheus Users

On 16/02/2020 09:52, bryan wrote:
> okay, I checked the alertmanager's log, for example, see below:
>
> level=debug ts=2020-02-16T00:41:46.642829251Z caller=dispatch.go:104
> component=dispatcher msg="Received alert" alert=Watchdog[e1749c6][active]
> level=debug ts=2020-02-16T00:41:49.181048151Z caller=dispatch.go:104
> component=dispatcher msg="Received alert"
> alert=MemeryUsagePercentageHigh[3889cfa][active]
> level=debug ts=2020-02-16T00:41:51.761879845Z caller=nflog.go:538
> component=nflog msg="gossiping new entry"
> entry="entry:<group_key:\"{}/{severity=\\\"warning\\\"}:{alertname=\\\"MemeryUsagePercentageHigh\\\"}\"
> receiver:<group_name:\"webhook\" integration:\"webhook\" >
> timestamp:<seconds:1581813737 nanos:577614310 >
> firing_alerts:6164213166729845817 > expires_at:<seconds:1582245737
> nanos:577614310 > "
> level=debug ts=2020-02-16T00:41:51.761992675Z caller=nflog.go:538
> component=nflog msg="gossiping new entry"
> entry="entry:<group_key:\"{}/{severity=\\\"warning\\\"}:{alertname=\\\"MemeryUsagePercentageHigh\\\"}\"
> receiver:<group_name:\"webhook\" integration:\"webhook\" idx:1 >
> timestamp:<seconds:1581813737 nanos:575755098 >
> firing_alerts:6164213166729845817 > expires_at:<seconds:1582245737
> nanos:575755098 > "
> level=debug ts=2020-02-16T00:42:27.873139582Z caller=cluster.go:295
> component=cluster memberlist="2020/02/16 00:42:27 [DEBUG] memberlist:
> Initiating push/pull sync with: 10.222.5.103:6783\n"
>
> but I don't know which log means send firing email and resoved email,
> maybe "msg="gossiping new entry"" means send a alert email? but it's
> not exactly, cos I find a similar message but not received email, so I
> feel confused, could you explain it for me? I have set the log file in
> attachment, maybe you can check it.
>
> In addition, I can't see the prometheus process's relevant log about
> sending alert, how can I confirm that alertmanager have received the
> original alert or the prometheus have send message to alertmanager?
> thank you.

You may have to turn on "debug" level logging.

Are you running an alertmanager cluster?
https://github.com/prometheus/alertmanager#high-availability

That would explain the "gossiping" - and maybe one of the cluster
members is having problems sending E-mail.

Brian Candler

unread,

Feb 16, 2020, 6:30:31 AM2/16/20

to bryan, Prometheus Users

On 16/02/2020 10:09, bryan wrote:
> yes, I'm running an alertmanager cluste, and I have turn on prometheus
> "debug" level logging, but nothing could be found, for details:

Have you set --log.level=debug on the alertmanager processes as well?

I see the following in my (non-clustered) test environment:

Feb 16 11:04:41 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:04:41.923Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][active]
Feb 16 11:05:56 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:05:56.922Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][active]
Feb 16 11:06:26 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:06:26.952Z caller=dispatch.go:465 component=dispatcher
aggrGroup="{}:{alertname=\"UpDown\"}" msg=flushing
alerts=[UpDown[0f48c03][active]]
Feb 16 11:07:11 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:07:11.924Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][active]

This shows the alerts being received from prometheus. However I don't
see any debug logs for the SMTP exchanges when it's sending out mail.

When I resolve the problem, alertmanager logs show:

Feb 16 11:22:11 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:22:11.922Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][resolved]
Feb 16 11:23:26 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T11:23:26.921Z caller=dispatch.go:135 component=dispatcher
msg="Received alert" alert=UpDown[0f48c03][resolved]

So I was wrong: prometheus *does* actively notify resolved alerts.

If the SMTP server was down, I didn't get any error logged. But after
restarting the SMTP server, the message was delivered - so it appears
that alertmanager does its own queueing and retrying.

One thing that might be useful to you is the alertmanager metrics for
failed notifications:

$ curl -s localhost:9093/metrics | grep notifications_failed
# HELP alertmanager_notifications_failed_total The total number of
failed notifications.
# TYPE alertmanager_notifications_failed_total counter
alertmanager_notifications_failed_total{integration="email"} 0
alertmanager_notifications_failed_total{integration="hipchat"} 0
alertmanager_notifications_failed_total{integration="opsgenie"} 0
alertmanager_notifications_failed_total{integration="pagerduty"} 0
alertmanager_notifications_failed_total{integration="pushover"} 0
alertmanager_notifications_failed_total{integration="slack"} 0
alertmanager_notifications_failed_total{integration="victorops"} 0
alertmanager_notifications_failed_total{integration="webhook"} 0
alertmanager_notifications_failed_total{integration="wechat"} 0

You could try this on all your alertmanager nodes, and see if a
particular one has problems with E-mail.

Brian Candler

unread,

Feb 16, 2020, 7:26:51 AM2/16/20

to bryan, Prometheus Users

Correction: I've just tried this again, and if I shut down the SMTP
server, I *do* see failed SMTP attempts from alertmanager generating
logs, at increasing retry intervals:

Feb 16 12:18:41 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T12:18:41.928Z caller=dispatch.go:465 component=dispatcher

aggrGroup="{}:{alertname=\"UpDown\"}" msg=flushing
alerts=[UpDown[0f48c03][active]]

Feb 16 12:18:41 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T12:18:41.938Z caller=notify.go:668 component=dispatcher
msg="Notify attempt failed" attempt=1 integration=email receiver=brian
err="establish connection to server: dial tcp [XXXX::34]:25: connect:
connection refused"
Feb 16 12:18:42 prometheus alertmanager[1772]: level=debug
ts=2020-02-16T12:18:42.668Z caller=notify.go:668 component=dispatcher
msg="Notify attempt failed" attempt=2 integration=email receiver=brian
err="establish connection to server: dial tcp [XXXX::34]:25: connect:
connection refused"
...

I don't know why I didn't see them before.

However, when I start the SMTP server again, and the message is
successfully delivered, there isn't a log showing successful delivery.

Chris Siebenmann

unread,

Feb 18, 2020, 2:07:25 PM2/18/20

to Brian Candler, Prometheus Users, cks.prom...@cs.toronto.edu

Modern versions of Prometheus do send specific 'resolved' messages to
Alertmanager when an alert ends. I don't know what Alertmanager does
if it receives a 'resolved' message for an alert that it hasn't seen
before, though.

If Prometheus stops periodically sending information about current
alerts to Alertmanager (for example because it's been stopped or has
crashed), the current alerts will normally time out in a few minutes and
be declared as resolved by Alertmanager. But normally this shouldn't
happen; Prometheus re-notifies Alertmanager about any current alerts
on a regular basis.

- cks

Simon Pasquier

unread,

Feb 21, 2020, 9:58:20 AM2/21/20

to Chris Siebenmann, Brian Candler, Prometheus Users

What Chris says: Prometheus will send a resolved alert to
Alertmanager. If Alertmanager didn't receive any firing alert for the
group, it won't send the resolved notification.
You may get more insights by running Alertmanager with --log.level=debug.

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/20200218190718.AAC183201A1%40apps1.cs.toronto.edu.
>

Reply all

Reply to author

Forward