An alert fires twice even an event occurres only once

207 views
Skip to first unread message

LukaszSz

unread,
Jan 13, 2023, 6:08:37 AM1/13/23
to Prometheus Users
Hi guys,

I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each.
Everything works fine but very often I experience issue that an alert is firing again even the event is already resolved by alertmanager.

Below logs from example event(Chrony_Service_Down) recorded by alertmanager:

############################################################################################################
(1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][active]

(2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 > firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 nanos:262824014 > "

(3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved]

(4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 > resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 nanos:897562679 > "

(5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 > firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 nanos:649205670 > "

(6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\", datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 > resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 nanos:137020780 > "

(7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved]
#############################################################################################################

Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired alert second time even minute ago (Jan 10 10:50:48) the alert was marked as resolved.
Such behavior generates duplicate alert in our system which is quite annoying in our scale.

What is worth to mention:
- For test purpose the event is scrapped by 4 Promethues servers(default) but alert rule is evaluated by one Promethues.
- The event occurres only once so there is no flapping which might cause another alert firing.

Thanks

Brian Candler

unread,
Jan 13, 2023, 7:34:52 AM1/13/23
to Prometheus Users
Are the alertmanagers clustered?  Then you should configure prometheus to deliver the alert to *all* alertmanagers.

LukaszSz

unread,
Jan 13, 2023, 7:55:49 AM1/13/23
to Prometheus Users
Yes Brian. As I mentioned in my post the Alertmangers are in cluster and this event is visible on my 4 alertmanagers.
Problem which I described is that an alerts are firing twice and it generates duplication. 

Brian Candler

unread,
Jan 13, 2023, 8:02:14 AM1/13/23
to Prometheus Users
Yes, but have you configured the prometheus (the one which has alerting rules) to have all four alertmanagers as its destination?

LukaszSz

unread,
Jan 13, 2023, 9:16:27 AM1/13/23
to Prometheus Users
Yes. The prometheus server is configured to communicate with  all alertmanagers ( sorry there are 8 alertmanagers ):

alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: "^prometheus_server$"
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager1:9093
      - alertmanager2:9093
      - alertmanager3:9093
      - alertmanager4:9093
      - alertmanager5:9093
      - alertmanager6:9093
      - alertmanager7:9093
      - alertmanager8:9093 

Brian Candler

unread,
Jan 13, 2023, 9:28:57 AM1/13/23
to Prometheus Users
That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 or 3 would be better - spread over different regions)

LukaszSz

unread,
Jan 13, 2023, 9:53:14 AM1/13/23
to Prometheus Users
Interesting. Seems that the alertmanagers are spread over 3 different regions ( 2xAsia, 2xUSA,4xEurope).
Maybe there is some latency problem between them like latency in gossip messages ?

Brian Candler

unread,
Jan 15, 2023, 5:58:47 AM1/15/23
to Prometheus Users
I wouldn't have thought that a few hundred ms of latency would make any difference.

I am however worried about the gossiping.  If this is one monster-sized cluster, then all 8 nodes should be communicating with every other 7 nodes.

I'd say this is a bad design.  Either:

1. Have a single global alertmanager cluster, with 2 nodes - that will give you excellent high availability for your alerting.  (How often do expect two regions to go offline simultaneously?)  Or 3 nodes if your management absolutely insists on it.  (But this isn't the sort of cluster that needs to maintain a quorum).

Or:

2. Completely separate the regions.  Have one alertmanager cluster in region A, one cluster in region B, one cluster in region C.  Have the prometheus instances in region A only talking to the alertmanager instances in region A, and so on.  In this case, each region sends its alerts completely independently.

There is little benefit in option (2) unless there are tight restrictions on inter-region communication; it gives you a lot more stuff to manage.  If you need to go this route, then having a frontend like Karma or alerta.io may be helpful.

LukaszSz

unread,
Jan 16, 2023, 9:32:12 AM1/16/23
to Prometheus Users
Hi ,

(1) We would like avoid such architecture. In this scenario we keep one region without local alertmanager. It means that we could lost alerts in case lost connection between that region and regions where alertmanager cluster is configured.

(2) It looks very promising. Currently one blocking point is lack of fronted  where we can set a silence. I saw your previous posts about Karma. We are going to test this direction.

Our other ideas are:

(3) Reduce current AM cluster from 8 to 4 nodes (1 AM per region) 
(4) If (3) not help we want tweak/play with gossip to improve AM nodes communication. Do you or anyone has experience with gossip and some best practice in AM HA ?

Thanks 

Brian Candler

unread,
Jan 16, 2023, 1:19:37 PM1/16/23
to Prometheus Users
> (1) We would like avoid such architecture. In this scenario we keep one region without local alertmanager. It means that we could lost alerts in case lost connection between that region and regions where alertmanager cluster is configured.

But if you've totally lost connectivity from this region, then even if you try to send a message to PagerDuty or OpsGenie or whatever, won't that fail too?

LukaszSz

unread,
Jan 23, 2023, 4:24:45 AM1/23/23
to Prometheus Users
>But if you've totally lost connectivity from this region, then even if you try to send a message to PagerDuty or OpsGenie or whatever, won't that fail too?
That is true. 

Nevertheless what I did so far reduced number of nodes in cluster from 8 to 4 - in every region we have one alertmanager node now. After one week no duplication observed. I will keep this config for the next few weeks.
Reply all
Reply to author
Forward
0 new messages