Alertmanager notification problems [group_wait, group_interval, repeat_interval, resolve_timeout]

385 views
Skip to first unread message

mikayzo

unread,
Jun 20, 2019, 7:10:27 AM6/20/19
to Prometheus Users
Hello people, I am setting up alertmanager solution in my workplace and while I was testing notifications I ran into some strange behavior of group_interval, repeat_interval and resolve_timeout in the context of webhook and slack notifications.

Versions:

# prometheus --version
prometheus
, version 2.10.0 (branch: HEAD, revision: d20e84d0fb64aff2f62a977adc8cfb656da4e286)
  build user
:       root@a49185acd9b0
  build date
:       20190525-12:28:13
  go version
:       go1.12.5


# alertmanager --version
alertmanager
, version 0.17.0 (branch: HEAD, revision: c7551cd75c414dc81df027f691e2eb21d4fd85b2)
  build user
:       root@932a86a52b76
  build date
:       20190503-09:10:07
  go version
:       go1.12.4

And a bit of background about the system:

I receive a bunch of alerts from vast amount of hosts which have a lot of different services. To make the case simpler, consider I have 3 hosts and each of them have 3 services defined in one group(basic_services):

instance: 
 
- host1
 
- host2
 
- host3

alert_type
:
 
- basic services:
   
- nginx
   
- db   
   
- squid
 
Alertmanager global configs:

resolve_timeout: 2m

Alertmanager route configs:

group_wait: 10s
group_interval
: 1m
restart_interval
: 5m

Alertmanager receiver configs for both slack notifications and webhook:

send_resolved: true

----

This is how everything should be working:
  - Prometheus is scraping and evaluating queries every 1m.
  - If lets say nginx goes down on host1 - prometheus notices it and after 2m of evaluation (which is defined in my alerting_rules.yml) - fires the alert to alertmanager.
    - If its a new alert - alertmanager should wait for 10s and send the notification via webhook and slack
    - If its an existing alert but a new entry is added to the same group (for example if alongside nginx - squid crashes) then the notification should be sent after 1m
    - If its an existing alert but there are no new entries added - it should be repeatedly sent out after 5m
  - If all alerts in the group are resolved - after 2m slack and webhook should get resolved notifications.

[Please correct me if I was wrong somewhere in the run-through of events described above.]

In #SLACK notification case - everything is pretty much as expected, but the timing is still a bit off. If anyone has an explanation to the [questions] below - I would appreciate it:

10:29 killed nginx
10:30 prometheus: notices nginx
10:32 nginx: pending -> firing
10:33 alert pushed to slack [why 1m instead of 10s?]
10:39 alert repeated to slack [why 6m instead of 5m?]
10:45 alert repeated to slack [why 6m instead of 5m?]
10:45 killed squid
10:46 prometheus: notices squid
10:48 squid: pending -> firing
10:49 new_alert pushed to slack (because of added new entry: squid to group)
10:50 started nginx and squid
10:50 prometheus: notices nginx and squid
10:50 resolved event sent to slack [expected at 10:52, because of resolve_timeout: 2m, why did I get the resolve right away?]


In #WEBHOOK case I was left baffled, because nothing makes any sense here:

11:26 killed nginx
11:26 prometheus: notices nginx
11:28 nginx: pending -> firing
11:29 alert pushed to webhook [why 1m? group_wait is 10s]
11:30 alert repeated to webhook [why 1m? repeat_interval is 5m]
11:31 alert repeated to webhook [why 1m? repeat_interval is 5m]
11:31 killed squid
11:32 alert repeated to webhook
11:32 prometheus: notices squid
11:33 alert repeated to webhook
11:34 alert repeated to webhook
11:34 squid: pending -> firing
11:35 new_alert pushed to webhook (because of added new entry: squid to group)
11:35 started nginx and squid
11:36 prometheus: notices nginx and squid
11:36 new_alert repeated to webhook
11:45 sooooooo wheres the fucking resolved notification?

Anything I`ve tried - webhook just ignored my configured parameters, group_interval is used for everything here.

However, I`ve noticed that the first notification is being sent after group_wait + group_interval. Is this the correct/intended behavior?


If anyone has any insight about the thing I`ve typed here - I would really appreciate if you could share it.

Simon Pasquier

unread,
Jun 20, 2019, 8:35:29 AM6/20/19
to mikayzo, Prometheus Users
If all alerts in a group are resolved, group_interval applies.
resolve_timeout isn't really used anymore as you're running the latest
version of Prometheus.
Since a few versions, Prometheus will set the EndsAt field of the
alert to 3 times the max between 1min and the evaluation interval.

>
> [Please correct me if I was wrong somewhere in the run-through of events described above.]
>
> In #SLACK notification case - everything is pretty much as expected, but the timing is still a bit off. If anyone has an explanation to the [questions] below - I would appreciate it:
>
> 10:29 killed nginx
> 10:30 prometheus: notices nginx
> 10:32 nginx: pending -> firing
> 10:33 alert pushed to slack [why 1m instead of 10s?]
You can turn on the debug log level to track what happens exactly in
AlertManager.

> 10:39 alert repeated to slack [why 6m instead of 5m?]
AlertManager will evaluate groups at every group interval so depending
on the exact timing, it may an an additional interval for the
notification to be sent.
> 10:45 alert repeated to slack [why 6m instead of 5m?]
> 10:45 killed squid
> 10:46 prometheus: notices squid
> 10:48 squid: pending -> firing
> 10:49 new_alert pushed to slack (because of added new entry: squid to group)
> 10:50 started nginx and squid
> 10:50 prometheus: notices nginx and squid
> 10:50 resolved event sent to slack [expected at 10:52, because of resolve_timeout: 2m, why did I get the resolve right away?]

See above.

>
>
> In #WEBHOOK case I was left baffled, because nothing makes any sense here:
>
> 11:26 killed nginx
> 11:26 prometheus: notices nginx
> 11:28 nginx: pending -> firing
> 11:29 alert pushed to webhook [why 1m? group_wait is 10s]
> 11:30 alert repeated to webhook [why 1m? repeat_interval is 5m]
> 11:31 alert repeated to webhook [why 1m? repeat_interval is 5m]
> 11:31 killed squid
> 11:32 alert repeated to webhook
> 11:32 prometheus: notices squid
> 11:33 alert repeated to webhook
> 11:34 alert repeated to webhook
> 11:34 squid: pending -> firing
> 11:35 new_alert pushed to webhook (because of added new entry: squid to group)
> 11:35 started nginx and squid
> 11:36 prometheus: notices nginx and squid
> 11:36 new_alert repeated to webhook
> 11:45 sooooooo wheres the fucking resolved notification?

Can you share the AlertManager configuration please?
What do you see on the webhook side? Does it answer with 200 OK to the
AlertManager's requests?

>
> Anything I`ve tried - webhook just ignored my configured parameters, group_interval is used for everything here.
>
> However, I`ve noticed that the first notification is being sent after group_wait + group_interval. Is this the correct/intended behavior?
>
>
> If anyone has any insight about the thing I`ve typed here - I would really appreciate if you could share it.
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/46ebfeba-de59-48f7-99e0-f1a98ea1f235%40googlegroups.com.

Chris Siebenmann

unread,
Jun 20, 2019, 5:08:39 PM6/20/19
to mikayzo, Prometheus Users, cks.prom...@cs.toronto.edu
> This is how everything should be working:
> - Prometheus is scraping and evaluating queries every 1m.
> - If lets say nginx goes down on host1 - prometheus notices it and after
> 2m of evaluation (which is defined in my alerting_rules.yml) - fires the
> alert to alertmanager.
> - If its a new alert - alertmanager should wait for 10s and send the
> notification via webhook and slack
> - If its an existing alert but a new entry is added to the same group
> (for example if alongside nginx - squid crashes) then the notification
> should be sent after 1m
> - If its an existing alert but there are no new entries added - it
> should be repeatedly sent out after 5m
> - If all alerts in the group are resolved - after 2m slack and webhook
> should get resolved notifications.
>
> [Please correct me if I was wrong somewhere in the run-through of events
> described above.]

Although the documentation is not clear about this, and for some things
you have to read the source code, the alert sequence works differently
than what you've written here in two important ways.

First, 'group_interval' is not a wait delay, it is a ticker. Once
the alert group generates its initial alert, Alertmanager will send
out an update every group_interval later if there is one to send, not
send out an update after (at least) a wait of group_interval. So if
your alert fires at 10:45:20, the next possible notification is at
10:46:20, the following one at 10:47:20, and so on. If a new alert in
the group reaches Alertmanager at 10:46:21, it has missed the tick and
must wait until 10:47:20. If I remember correctly, this applies to all
notifications, including alert-is-resolved one.

(This obviously has a much larger effect on people who have longer
group_interval settings than yours, at 1m.)

Second, when Prometheus determines that an alert has ended, it
immediately notifies Alertmanager about this. This is part of what
makes resolve_timeout irrelevant. If the situation doesn't change,
Alertmanager will then send the 'alert is resolved' notification out
when the group_interval ticker time next comes up, which is anywhere
from a few seconds later to (in your configuration) a minute.

(If the alert re-appears and Prometheus re-notifies Alertmanager about
it, only the last update that Alertmanager receives matters for sending
a resolved notification. It is as if the alert was never resolved
and re-triggered. This is also true for 'group_wait', which is an
important difference between 'for: ...' in Prometheus alert rules and a
'group_wait' in Alertmanager. The alert rules require the condition to
be true through the entire time; the 'group_wait' just requires it to be
true at the end.)

That group_interval is a ticker can interact with alert groups
that come and go, because Alertmanager only cleans up empty alert
groups every so often (in the current code, I believe once every 30
seconds). Until this cleanup happens, the alert group remains running
under its group_interval ticker and as far as I know, if a new alert in
the group appears, the time delay that applies is not group_wait, it is
the group_interval ticker. This is probably not an issue in your
situation, but I mention it for completeness.

As a general note, if you want a detailed look into what Alertmanager
thinks of your current alerts, the magic trick is:

curl -s http://localhost:9093/api/v1/alerts | jq .

This gives you much more information than the web UI does.

- cks

mikayzo

unread,
Jun 21, 2019, 7:16:15 AM6/21/19
to Prometheus Users
[because of sensitive info I have to scramble some info of the alerts]

I changed some timing configurations so its currently:

scrape_interval: 60s
evaluation_interval
: 60s

for: 2m

group_wait: 10s
group_interval
: 1m
restart_interval
: 5m

So next tests with the webhook were as following:

09:10: http killed

09:12 `curl -s http://localhost:9093/api/v1/alerts | jq .`  :
   
{
     
"labels": {
       
"alert": "http",
       
"instance": "host1",
       
"severity": "warning",
     
},
     
"startsAt": "2019-06-21T09:12:20.867457637Z",
     
"endsAt": "2019-06-21T09:15:20.867457637Z",
     
"status": {

       
"state": "active",
       
"silencedBy": [],
       
"inhibitedBy": []
     
},
     
"receivers": [
       
"API"
     
],
     
"fingerprint": "dfd5398a6c326632"
   
},

09:15: `curl -s http://localhost:9093/api/v1/alerts | jq .`  :
   
{
     
"labels": {
       
"alert": "http",
       
"instance": "host1",
       
"severity": "warning",

     
},
     
"startsAt": "2019-06-21T09:12:20.867457637Z",
     
"endsAt": "2019-06-21T09:17:20.867457637Z",
     
"status": {

       
"state": "active",
       
"silencedBy": [],
       
"inhibitedBy": []
     
},
     
"receivers": [
       
"API"
     
],
     
"fingerprint": "dfd5398a6c326632"
   
},

09:16: http started



in the mean time:

# tail -f /var/log/syslog | grep host1


Jun 21 11:12:31 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][active]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:12:31.006808898Z"}
Jun 21 11:13:31 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][active]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:13:31.007241701Z"}
Jun 21 11:14:31 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][active]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:14:31.008633924Z"}
Jun 21 11:15:31 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][active]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:15:31.009031847Z"}
Jun 21 11:16:31 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:16:31.009444342Z"}
Jun 21 11:18:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:18:20.999826613Z"}
Jun 21 11:20:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:20:20.998772723Z"}
Jun 21 11:22:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:22:21.008138495Z"}
Jun 21 11:24:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:24:21.003222682Z"}
Jun 21 11:26:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:26:20.99797136Z"}
Jun 21 11:28:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:28:21.003219098Z"}
Jun 21 11:30:21 prom1 alertmanager[46160]: {"aggrGroup":"{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}","alerts":"[HTTP Down[dfd5398][resolved]]","caller":"dispatch.go:430","component":"dispatcher","level":"debug","msg":"flushing","ts":"2019-06-21T09:30:20.998402894Z"}

and this is what i receive in the webhook:

{
 
"receiver": "API",
 
"status": "firing",
 
"alerts": [
   
{
     
"status": "firing",
     
"labels": {
       
"alert": "http",
       
"alertname": "HTTP Down",
       
"instance": "host1",
       
"severity": "warning"
     
},
     
"annotations": {
       
"summary": "HTTP is down"
     
},
     
"startsAt": "2019-06-21T09:12:20.867457637Z",
     
"endsAt": "0001-01-01T00:00:00Z"
   
}
 
],
 
"groupLabels": {
   
"alert": "http",
   
"instance": "host1"
 
},
 
"commonLabels": {
   
"alert": "http",
   
"alertname": "HTTP Down",
   
"instance": "host1",
   
"severity": "warning"
 
},
 
"commonAnnotations": {
   
"summary": "HTTP is down"
 
},
 
"version": "4",
 
"groupKey": "{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}"
}
127.0.0.1 - - [21/Jun/2019 09:13:31] "POST / HTTP/1.1" 200 -

{
 
"receiver": "API",
 
"status": "firing",
 
"alerts": [
   
{
     
"status": "firing",
     
"labels": {
       
"alert": "http",
       
"alertname": "HTTP Down",
       
"instance": "host1",
       
"severity": "warning"
     
},
     
"annotations": {
       
"summary": "HTTP is down"
     
},
     
"startsAt": "2019-06-21T09:12:20.867457637Z",
     
"endsAt": "0001-01-01T00:00:00Z"
   
}
 
],
 
"groupLabels": {
   
"alert": "http",
   
"instance": "host1"
 
},
 
"commonLabels": {
   
"alert": "http",
   
"alertname": "HTTP Down",
   
"instance": "host1",
   
"severity": "warning"
 
},
 
"commonAnnotations": {
   
"summary": "HTTP is down"
 
},
 
"version": "4",
 
"groupKey": "{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}"
}
127.0.0.1 - - [21/Jun/2019 09:14:31] "POST / HTTP/1.1" 200 -

{
 
"receiver": "API",
 
"status": "firing",
 
"alerts": [
   
{
     
"status": "firing",
     
"labels": {
       
"alert": "http",
       
"alertname": "HTTP Down",
       
"instance": "host1",
       
"severity": "warning"
     
},
     
"annotations": {
       
"summary": "HTTP is down"
     
},
     
"startsAt": "2019-06-21T09:12:20.867457637Z",
     
"endsAt": "0001-01-01T00:00:00Z"
   
}
 
],
 
"groupLabels": {
   
"alert": "http",
   
"instance": "host1"
 
},
 
"commonLabels": {
   
"alert": "http",
   
"alertname": "HTTP Down",
   
"instance": "host1",
   
"severity": "warning"
 
},
 
"commonAnnotations": {
   
"summary": "HTTP is down"
 
},
 
"version": "4",
 
"groupKey": "{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}"
}
127.0.0.1 - - [21/Jun/2019 09:15:31] "POST / HTTP/1.1" 200 -

{
 
"receiver": "API",
 
"status": "firing",
 
"alerts": [
   
{
     
"status": "firing",
     
"labels": {
       
"alert": "http",
       
"alertname": "HTTP Down",
       
"instance": "host1",
       
"severity": "warning"
     
},
     
"annotations": {
       
"summary": "HTTP is down"
     
},
     
"startsAt": "2019-06-21T09:12:20.867457637Z",
     
"endsAt": "0001-01-01T00:00:00Z"
   
}
 
],
 
"groupLabels": {
   
"alert": "http",
   
"instance": "host1"
 
},
 
"commonLabels": {
   
"alert": "http",
   
"alertname": "HTTP Down",
   
"instance": "host1",
   
"severity": "warning"
 
},
 
"commonAnnotations": {
   
"summary": "HTTP is down"
 
},
 
"version": "4",
 
"groupKey": "{}/{instance=~\"^(?:host1)$\"}:{alert=\"http\", instance=\"host1\"}"
}
127.0.0.1 - - [21/Jun/2019 09:16:31] "POST / HTTP/1.1" 200 -

----

Looks like `EndAt` gets completely lost somewhere when it is sent to the webhook.

Also, resolved message is being sent for 6 times for some reason, but webhook never receives it.

While catching packets I noticed, that when the first resolved message is being sent - i get an OK packet, but nothing for the remaining 5 resolve packets:

11:16:31.002369 IP 127.0.0.1.38005 > 127.0.0.1.40414: Flags [P.], seq 1:18, ack 1237, win 1365, options [nop,nop,TS val 757657828 ecr 757657828], length 17

 
0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
 
0x0010:  4500 0045 3e1b 4000 4006 fe95 7f00 0001  E..E>.@.@.......
 
0x0020:  7f00 0001 9475 9dde 9f35 885b 1057 bdbc  .....u...5.[.W..
 
0x0030:  8018 0555 fe39 0000 0101 080a 2d28 f0e4  ...U.9......-(..
 
0x0040:  2d28 f0e4 4854 5450 2f31 2e30 2032 3030  -(..HTTP/1.0.200
 
0x0050:  204f 4b0d 0a                             .OK..

Chris Siebenmann

unread,
Jun 21, 2019, 1:02:11 PM6/21/19
to mikayzo, Prometheus Users, cks.prom...@cs.toronto.edu
Two little notes:

[...]
> "alerts": [
> {
> "status": "firing",
[...]
> "startsAt": "2019-06-21T09:12:20.867457637Z",
> "endsAt": "0001-01-01T00:00:00Z"
[...]
> Looks like `EndAt` gets completely lost somewhere when it is sent to
> the webhook.

Zapping endsAt for still-firing alerts is sensible behavior, but to
understand it you need to know that Prometheus normally periodically
re-notifies Alertmanager that every firing alert is still firing. Inside
Prometheus and Alertmanager, the 'endsAt' time for firing alerts is used
as a timeout if Alertmanager stops hearing about the alerts at all; when
Alertmanager reaches the endsAt time (or past it), it considers the
alert resolved. Exposing it to external webhooks and so on would only be
confusing, since there is no guarantee that the alert actually will end
at that time.

(Every time Prometheus re-notifies Alertmanager about an alert, it sets
the endsAt time to be three minutes in the future. This means that
you can deduce the last re-notification time from the currently visible
endsAt.)

endsAt should be set properly if you receive a resolved alert, for
obvious reasons; now everything knows when the alert actually finished.
I assume this works for webhooks; I know that it does in email templates.

> Also, resolved message is being sent for 6 times for some reason, but
> webhook never receives it.

This is expected behavior. When an alert stops firing in Prometheus,
Prometheus doesn't immediately delete it (although it stops showing in
the web UI). Instead the now-inactive alert sticks around for 15 minutes
and keeps being sent to Alertmanager over that time, generally once a
minute (Prometheus's --rules.alert.resend-delay setting, which is also
how often it re-notifies Alertmanager that a firing alert is still firing).

More details than you probably wanted to know about this are here:

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertsClearingTime

- cks

mikayzo

unread,
Jun 25, 2019, 4:35:12 AM6/25/19
to Prometheus Users
I have read the link you shared couple of times but I am still confused.

Is the webhook supposed to get Resolved message in this first place or not? 

I can see resolved messages in alertmanager for around 15 minutes as you`ve said, but after they stop - webhook still does not receive anything similar to "Resolved". It just stops receiving alerts once the alertmanager notices that there are no more `firing` alerts. 

I was thinking of using a webhook to turn on/off services in an application via API. But in this case - when the alert starts firing and webhook receives these notifications - the service gets turned off. However, as there are no resolved messages - it stays off forever. Which is not what is intended in my case. 

Do you have any suggestions how to approach this? Should there be some additional logic programmed for API, for example to turn on services after a few minutes if no more `firing` alerts are received? Or is there a better solution?

Simon Pasquier

unread,
Jun 25, 2019, 10:59:10 AM6/25/19
to mikayzo, Prometheus Users
The "... "msg":"flushing" ..." log message doesn't actually say that
AlertManager sends a notification, only that it is evaluating the
alert group. It will send a notification only if the group has changed
or repeat_interval has elapsed.
By default, the webhook receiver should send notification when all
alerts in the group are resolved.
You haven't shared the AlertManager configuration so it is difficult
to say more.

>
> While catching packets I noticed, that when the first resolved message is being sent - i get an OK packet, but nothing for the remaining 5 resolve packets:
>
> 11:16:31.002369 IP 127.0.0.1.38005 > 127.0.0.1.40414: Flags [P.], seq 1:18, ack 1237, win 1365, options [nop,nop,TS val 757657828 ecr 757657828], length 17
>
> 0x0000: 0000 0304 0006 0000 0000 0000 0000 0800 ................
> 0x0010: 4500 0045 3e1b 4000 4006 fe95 7f00 0001 E..E>.@.@.......
> 0x0020: 7f00 0001 9475 9dde 9f35 885b 1057 bdbc .....u...5.[.W..
> 0x0030: 8018 0555 fe39 0000 0101 080a 2d28 f0e4 ...U.9......-(..
> 0x0040: 2d28 f0e4 4854 5450 2f31 2e30 2032 3030 -(..HTTP/1.0.200
> 0x0050: 204f 4b0d 0a .OK..
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8b49fddb-75c4-4f60-855d-97e86d87115a%40googlegroups.com.

Chris Siebenmann

unread,
Jun 28, 2019, 7:56:54 PM6/28/19
to mikayzo, Prometheus Users, cks.prom...@cs.toronto.edu
> I have read the link you shared couple of times but I am still confused.
>
> Is the webhook supposed to get Resolved message in this first place or not?
>
> I can see resolved messages in alertmanager for around 15 minutes as you`ve
> said, but after they stop - webhook still does not receive anything similar
> to "Resolved". It just stops receiving alerts once the alertmanager notices
> that there are no more `firing` alerts.

As far as I can see and find discussions of, webhooks get notified
of Resolved alerts in general; there is no particular exclusion for
them. One example is this article:

https://www.robustperception.io/audio-alerting-with-prometheus

If your webhook is not receiving 'resolved' alerts, I would suspect
something in the specific Alertmanager configuration for it.

- cks
Reply all
Reply to author
Forward
0 new messages