Alert manager in HA forwards multiple alarms for the same alert

73 views
Skip to first unread message

Raji Amarnath

unread,
May 14, 2020, 3:13:22 AM5/14/20
to Prometheus Users
Hi, 

We have a scenario where Prometheus and Alert Manager are running in HA mode. 

We have 2 instances of Prometheus running on 2 nodes and 2 alert managers configured for each of the Prometheus instances. 
When an alert is getting raised, one of the Alertmanager forwards an alert first. Subsequently, when the alert is getting cleared, the second alert manager raises and clears the same alert.  
Alertmanager 1 =>  raises alert
After some time, when
Alertmanager 1 => clears alert, then
Alertmanager 2 => raises the same alert
Alertmanager 2 => clears alert

Expectation:
We want one alarm to be raised and cleared for one particular alert. Also, we would like to know the reason why the second Alertmanager sends and clears the alert when the first alert manager clears the alert. 

We enabled debug logs for the Alertmanager, but we didn't find anything related to this. Please let us know where we can find more logs related to this. 

Prometheus Version:
2.11.1
Alertmanager Version:
0.16.2

Julius Volz

unread,
May 14, 2020, 8:23:05 AM5/14/20
to Raji Amarnath, Prometheus Users
1) Are your two Alertmanager instances clustered together via --cluster.peer, and do the logs indicate that the clustering correctly?

2) Do both Prometheus severs send all their alerts to both of your Alertmanager instances (instead of each Prometheus just sending alerts to one of Alertmanager instances)?

These two conditions are necessary for things to work properly, though I'm not sure why you'd encounter a situation where one AM sends and clears an alert, and then the other does the same, in serial, with some time in between.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9a809dfd-4eca-464b-87ab-7cc9c6140a75%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com

Venkata Bhagavatula

unread,
May 15, 2020, 3:38:36 AM5/15/20
to Prometheus Users
Hi Julius,

I am from same team as that of Rajalakshmi,

Both alertmanager instances are clustered together via --cluster.peer. 
Following is way the alertmanager process are started:
Alertmanager1:
/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/appdata/cpro/alertmanager --data.retention=120h --alerts.gc-interval=30m --log.level=debug --web.listen-address=0.0.0.0:9093 --cluster.peer-timeout=15s --cluster.pushpull-interval=1m0s --cluster.tcp-timeout=10s --cluster.probe-interval=1s --cluster.settle-timeout=1m0s --cluster.reconnect-interval=10s --cluster.reconnect-timeout=6h0m0s --cluster.listen-address=xxx.aaa.bbb.10:9094 --cluster.peer=xxx.aaa.bbb.11:9094

Alertmanager2:

/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/appdata/cpro/alertmanager --data.retention=120h --alerts.gc-interval=30m --log.level=debug --web.listen-address=0.0.0.0:9093 --cluster.peer-timeout=15s --cluster.pushpull-interval=1m0s --cluster.tcp-timeout=10s --cluster.probe-interval=1s --cluster.settle-timeout=1m0s --cluster.reconnect-interval=10s --cluster.reconnect-timeout=6h0m0s --cluster.listen-address=xxx.aaa.bbb.11:9094 --cluster.peer=xxx.aaa.bbb.10:9094

From the logs, i see that clustering is fine:
level=debug ts=2020-05-13T13:56:03.645392388Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=xxx.aaa.bbb.11:9094
level=debug ts=2020-05-13T13:56:03.648236922Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01E87548XXRBCNYA1KZBSQA69Y addr=xxx.aaa.bbb.10:9094
level=debug ts=2020-05-13T13:56:03.64950765Z caller=cluster.go:295 component=cluster memberlist="2020/05/13 15:56:03 [DEBUG] memberlist: Initiating push/pull sync with: xxx.aaa.bbb.11:9094\n"
level=debug ts=2020-05-13T13:56:03.650570599Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01E77XH84J43BF3WXNYM611JBC addr=xxx.aaa.bbb.11:9094
level=debug ts=2020-05-13T13:56:03.650600684Z caller=cluster.go:479 component=cluster msg="peer rejoined" peer=01E77XH84J43BF3WXNYM611JBC
level=debug ts=2020-05-13T13:56:03.650661887Z caller=cluster.go:231 component=cluster msg="joined cluster" peers=1

Following is the alert rule:
- alert: RsyslogServiceInactive
    annotations:
      description: '{{$labels.instance}}: Rsyslog is inactive'
      summary: '{{$labels.instance}}: Major issue detected'
    expr: (node_systemd_unit_state{name="rsyslog.service",state="active"} == 0)
    for: 1m
    labels:
      eventType: 11
      host: '{{$labels.instance}}'
      id: 3001218
      key: MOC019
      name: RsyslogServiceInactive
      probableCause: 714
      severity: MAJOR
      text: Rsyslog is inactive

Following is the alertmanager configuration:
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

For point 2, how can we verify that both servers are sending alerts to alertmanager?  when alert is raised, then raised alert is notified only once. After the alert is resolved, then alert resolved is notified, and after some time again one more firing and resolve sequence is coming.

After we increased the group_wait from default 10s to 30s, i see that issue is not coming.  Can you give us some pointers on how the alert notification pipeline works?

Thanks n Regards,
Chalapathi.

On Thursday, May 14, 2020 at 5:53:05 PM UTC+5:30, Julius Volz wrote:
1) Are your two Alertmanager instances clustered together via --cluster.peer, and do the logs indicate that the clustering correctly?

2) Do both Prometheus severs send all their alerts to both of your Alertmanager instances (instead of each Prometheus just sending alerts to one of Alertmanager instances)?

These two conditions are necessary for things to work properly, though I'm not sure why you'd encounter a situation where one AM sends and clears an alert, and then the other does the same, in serial, with some time in between.

On Thu, May 14, 2020 at 9:13 AM Raji Amarnath <amarnath....@gmail.com> wrote:
Hi, 

We have a scenario where Prometheus and Alert Manager are running in HA mode. 

We have 2 instances of Prometheus running on 2 nodes and 2 alert managers configured for each of the Prometheus instances. 
When an alert is getting raised, one of the Alertmanager forwards an alert first. Subsequently, when the alert is getting cleared, the second alert manager raises and clears the same alert.  
Alertmanager 1 =>  raises alert
After some time, when
Alertmanager 1 => clears alert, then
Alertmanager 2 => raises the same alert
Alertmanager 2 => clears alert

Expectation:
We want one alarm to be raised and cleared for one particular alert. Also, we would like to know the reason why the second Alertmanager sends and clears the alert when the first alert manager clears the alert. 

We enabled debug logs for the Alertmanager, but we didn't find anything related to this. Please let us know where we can find more logs related to this. 

Prometheus Version:
2.11.1
Alertmanager Version:
0.16.2

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages