Hi guys,
I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each.
Everything works fine but very often I experience issue that an alert is firing again even the event is already resolved by alertmanager.
Below logs from example event(Chrony_Service_Down) recorded by alertmanager:
############################################################################################################
(1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][active]
(2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
server.example.com\\\", instance=\\\"
server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"
server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 > firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 nanos:262824014 > "
(3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved]
(4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
server.example.com\\\", instance=\\\"
server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"
server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 > resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 nanos:897562679 > "
(5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
server.example.com\\\", instance=\\\"
server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"
server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 > firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 nanos:649205670 > "
(6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
server.example.com\\\", instance=\\\"
server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"
server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 > resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 nanos:137020780 > "
(7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved]
#############################################################################################################
Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired alert second time even minute ago (Jan 10 10:50:48) the alert was marked as resolved.
Such behavior generates duplicate alert in our system which is quite annoying in our scale.
What is worth to mention:
- For test purpose the event is scrapped by 4 Promethues servers(default) but alert rule is evaluated by one Promethues.
- The event occurres only once so there is no flapping which might cause another alert firing.
Thanks