Hi Julius,
I am from same team as that of Rajalakshmi,
Both alertmanager instances are clustered together via --cluster.peer.
Following is way the alertmanager process are started:
Alertmanager1:
/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/appdata/cpro/alertmanager --data.retention=120h --alerts.gc-interval=30m --log.level=debug --web.listen-address=0.0.0.0:9093 --cluster.peer-timeout=15s --cluster.pushpull-interval=1m0s --cluster.tcp-timeout=10s --cluster.probe-interval=1s --cluster.settle-timeout=1m0s --cluster.reconnect-interval=10s --cluster.reconnect-timeout=6h0m0s --cluster.listen-address=xxx.aaa.bbb.10:9094 --cluster.peer=xxx.aaa.bbb.11:9094
Alertmanager2:
/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/appdata/cpro/alertmanager --data.retention=120h --alerts.gc-interval=30m --log.level=debug --web.listen-address=0.0.0.0:9093 --cluster.peer-timeout=15s --cluster.pushpull-interval=1m0s --cluster.tcp-timeout=10s --cluster.probe-interval=1s --cluster.settle-timeout=1m0s --cluster.reconnect-interval=10s --cluster.reconnect-timeout=6h0m0s --cluster.listen-address=xxx.aaa.bbb.11:9094 --cluster.peer=xxx.aaa.bbb.10:9094From the logs, i see that clustering is fine: level=debug ts=2020-05-13T13:56:03.645392388Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=xxx.aaa.bbb.11:9094
level=debug ts=2020-05-13T13:56:03.648236922Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01E87548XXRBCNYA1KZBSQA69Y addr=xxx.aaa.bbb.10:9094
level=debug ts=2020-05-13T13:56:03.64950765Z caller=cluster.go:295 component=cluster memberlist="2020/05/13 15:56:03 [DEBUG] memberlist: Initiating push/pull sync with: xxx.aaa.bbb.11:9094\n"
level=debug ts=2020-05-13T13:56:03.650570599Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01E77XH84J43BF3WXNYM611JBC addr=xxx.aaa.bbb.11:9094
level=debug ts=2020-05-13T13:56:03.650600684Z caller=cluster.go:479 component=cluster msg="peer rejoined" peer=01E77XH84J43BF3WXNYM611JBC
level=debug ts=2020-05-13T13:56:03.650661887Z caller=cluster.go:231 component=cluster msg="joined cluster" peers=1
Following is the alert rule:
- alert: RsyslogServiceInactive
annotations:
description: '{{$labels.instance}}: Rsyslog is inactive'
summary: '{{$labels.instance}}: Major issue detected'
expr: (node_systemd_unit_state{name="rsyslog.service",state="active"} == 0)
for: 1m
labels:
eventType: 11
host: '{{$labels.instance}}'
id: 3001218
key: MOC019
name: RsyslogServiceInactive
probableCause: 714
severity: MAJOR
text: Rsyslog is inactive
Following is the alertmanager configuration:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 10s
repeat_interval: 12h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
For point 2, how can we verify that both servers are sending alerts to alertmanager? when alert is raised, then raised alert is notified only once. After the alert is resolved, then alert resolved is notified, and after some time again one more firing and resolve sequence is coming.
After we increased the group_wait from default 10s to 30s, i see that issue is not coming. Can you give us some pointers on how the alert notification pipeline works?
Thanks n Regards,
Chalapathi.