Alert manager in HA forwards multiple alarms for the same alert

73 views

Skip to first unread message

Raji Amarnath

unread,

May 14, 2020, 3:13:22 AM5/14/20

to Prometheus Users

Hi,

We have a scenario where Prometheus and Alert Manager are running in HA mode.

We have 2 instances of Prometheus running on 2 nodes and 2 alert managers configured for each of the Prometheus instances.

When an alert is getting raised, one of the Alertmanager forwards an alert first. Subsequently, when the alert is getting cleared, the second alert manager raises and clears the same alert.

Alertmanager 1 => raises alert

After some time, when

Alertmanager 1 => clears alert, then

Alertmanager 2 => raises the same alert

Alertmanager 2 => clears alert

Expectation:

We want one alarm to be raised and cleared for one particular alert. Also, we would like to know the reason why the second Alertmanager sends and clears the alert when the first alert manager clears the alert.

We enabled debug logs for the Alertmanager, but we didn't find anything related to this. Please let us know where we can find more logs related to this.

Prometheus Version:

2.11.1

Alertmanager Version:

0.16.2

Julius Volz

unread,

May 14, 2020, 8:23:05 AM5/14/20

to Raji Amarnath, Prometheus Users

1) Are your two Alertmanager instances clustered together via --cluster.peer, and do the logs indicate that the clustering correctly?

2) Do both Prometheus severs send all their alerts to both of your Alertmanager instances (instead of each Prometheus just sending alerts to one of Alertmanager instances)?

These two conditions are necessary for things to work properly, though I'm not sure why you'd encounter a situation where one AM sends and clears an alert, and then the other does the same, in serial, with some time in between.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9a809dfd-4eca-464b-87ab-7cc9c6140a75%40googlegroups.com.

Julius Volz

PromLabs - promlabs.com

Venkata Bhagavatula

unread,

May 15, 2020, 3:38:36 AM5/15/20

to Prometheus Users

Hi Julius,

I am from same team as that of Rajalakshmi,

Both alertmanager instances are clustered together via --cluster.peer.

Following is way the alertmanager process are started:

Alertmanager1:

/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/appdata/cpro/alertmanager --data.retention=120h --alerts.gc-interval=30m --log.level=debug --web.listen-address=0.0.0.0:9093 --cluster.peer-timeout=15s --cluster.pushpull-interval=1m0s --cluster.tcp-timeout=10s --cluster.probe-interval=1s --cluster.settle-timeout=1m0s --cluster.reconnect-interval=10s --cluster.reconnect-timeout=6h0m0s --cluster.listen-address=xxx.aaa.bbb.10:9094 --cluster.peer=xxx.aaa.bbb.11:9094

Alertmanager2:

level=debug ts=2020-05-13T13:56:03.645392388Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=xxx.aaa.bbb.11:9094

level=debug ts=2020-05-13T13:56:03.648236922Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01E87548XXRBCNYA1KZBSQA69Y addr=xxx.aaa.bbb.10:9094

level=debug ts=2020-05-13T13:56:03.64950765Z caller=cluster.go:295 component=cluster memberlist="2020/05/13 15:56:03 [DEBUG] memberlist: Initiating push/pull sync with: xxx.aaa.bbb.11:9094\n"

level=debug ts=2020-05-13T13:56:03.650570599Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01E77XH84J43BF3WXNYM611JBC addr=xxx.aaa.bbb.11:9094

level=debug ts=2020-05-13T13:56:03.650600684Z caller=cluster.go:479 component=cluster msg="peer rejoined" peer=01E77XH84J43BF3WXNYM611JBC

level=debug ts=2020-05-13T13:56:03.650661887Z caller=cluster.go:231 component=cluster msg="joined cluster" peers=1

Following is the alert rule:

- alert: RsyslogServiceInactive

annotations:

description: '{{$labels.instance}}: Rsyslog is inactive'

summary: '{{$labels.instance}}: Major issue detected'

expr: (node_systemd_unit_state{name="rsyslog.service",state="active"} == 0)

for: 1m

labels:

eventType: 11

host: '{{$labels.instance}}'

id: 3001218

key: MOC019

name: RsyslogServiceInactive

probableCause: 714

severity: MAJOR

text: Rsyslog is inactive

Following is the alertmanager configuration:

global:

resolve_timeout: 5m

route:

group_by: ['alertname']

group_wait: 30s

group_interval: 10s

repeat_interval: 12h

receiver: 'web.hook'

receivers:

- name: 'web.hook'

webhook_configs:

- url: 'http://localhost:8005/'

inhibit_rules:

- source_match:

severity: 'critical'

target_match:

severity: 'warning'

equal: ['alertname', 'dev', 'instance']

For point 2, how can we verify that both servers are sending alerts to alertmanager? when alert is raised, then raised alert is notified only once. After the alert is resolved, then alert resolved is notified, and after some time again one more firing and resolve sequence is coming.

After we increased the group_wait from default 10s to 30s, i see that issue is not coming. Can you give us some pointers on how the alert notification pipeline works?

Thanks n Regards,

Chalapathi.

On Thursday, May 14, 2020 at 5:53:05 PM UTC+5:30, Julius Volz wrote:

1) Are your two Alertmanager instances clustered together via --cluster.peer, and do the logs indicate that the clustering correctly?

2) Do both Prometheus severs send all their alerts to both of your Alertmanager instances (instead of each Prometheus just sending alerts to one of Alertmanager instances)?

These two conditions are necessary for things to work properly, though I'm not sure why you'd encounter a situation where one AM sends and clears an alert, and then the other does the same, in serial, with some time in between.

On Thu, May 14, 2020 at 9:13 AM Raji Amarnath <amarnath....@gmail.com> wrote:

Hi,

We have a scenario where Prometheus and Alert Manager are running in HA mode.

We have 2 instances of Prometheus running on 2 nodes and 2 alert managers configured for each of the Prometheus instances.
When an alert is getting raised, one of the Alertmanager forwards an alert first. Subsequently, when the alert is getting cleared, the second alert manager raises and clears the same alert.
Alertmanager 1 => raises alert
After some time, when
Alertmanager 1 => clears alert, then
Alertmanager 2 => raises the same alert
Alertmanager 2 => clears alert

Expectation:
We want one alarm to be raised and cleared for one particular alert. Also, we would like to know the reason why the second Alertmanager sends and clears the alert when the first alert manager clears the alert.

We enabled debug logs for the Alertmanager, but we didn't find anything related to this. Please let us know where we can find more logs related to this.

Prometheus Version:
2.11.1
Alertmanager Version:
0.16.2

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9a809dfd-4eca-464b-87ab-7cc9c6140a75%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages