I was wondering if anyone can tell me what Alertmanager parameter represents the wait times for AM0 and AM1 in the below screen shot? Are these built-in based on the fact you have a cluster?
I have two Promethues servers and two Alertmanager servers. My alerts are not getting dedup’d. I can only guess that these “wait” times are not set correctly.
I get one alert in Slack from each Alertmanager. I know this because I have the --web.external-url set for the IP address of each Alertmanager. By clicking the alert in Slack, it takes me to the Alertmanager GUI that sent the alert.
I have searched the web and cannot find a solution. The problems other people have posted do not seem to apply.
My peer position does not change – I have stable values.
My alertmanager_cluster_members
are stable – I have no flopping
My alertmanager_cluster_failed_peers
value is 0.
A graph of the alerts:
The corresponding alerts in Slack:
OCI LogFire Alert Notifications APP [12:40 PM]
[FIRING:1] InstanceDown (999.888.777.148:9100 node_exporter critical)
Endpoint on 999.888.777.148:9100 is down
Server: 999.888.777.148:9100
Reported by Job: node_exporter
Exporter down (instance 999.888.777.148:9100)
Prometheus exporter down
VALUE = 0
LABELS: map[__name__:up instance:999.888.777.148:9100 job:node_exporter]
[FIRING:1] ExporterDown (999.888.777.148:9100 node_exporter warning)
[FIRING:1] InstanceDown (999.888.777.148:9100 node_exporter critical)
Endpoint on 999.888.777.148:9100 is down
Server: 999.888.777.148:9100
Reported by Job: node_exporter
Exporter down (instance 999.888.777.148:9100)
Prometheus exporter down
VALUE = 0
LABELS: map[__name__:up instance:999.888.777.148:9100 job:node_exporter]
[RESOLVED] InstanceDown (999.888.777.148:9100 node_exporter critical)
[RESOLVED] ExporterDown (999.888.777.148:9100 node_exporter warning)
Sometimes the resolve works, sometimes I get four. In this case, it worked.
The route for both alertmanager.ymls are identical –
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
If I silence an alert in one Alertmanager, it does appear in the other. That tells me there is a connection between the two Alertmanagers.
From what I know and see on the status page, the cluster looks good.
Status from 999.888.777.146
Peers:
Name: 01DCJWMTSV9JENF0HYAAS41ES2
Address: 999.888.777.148:9094
Name: 01DCJX4PM2JQCJ3N469MYWB6FD
Address: 999.888.777.146:9094
Status from 999.888.777.148
Peers:
Name: 01DCJWMTSV9JENF0HYAAS41ES2
Address: 999.888.777.148:9094
Name: 01DCJX4PM2JQCJ3N469MYWB6FD
Address: 999.888.777.146:9094
Element Value
alertmanager_peer_position{instance="999.888.777.146:9093",job="alertmanager"} 1
alertmanager_peer_position{instance="999.888.777.148:9093",job="alertmanager"} 0
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5f4b8c5c-3611-4836-829a-4f8257e5ef05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAM6RFu7_umSSfro34%3Di%3DAnCvQLbhhYH9LWhnKE126gHihPnODQ%40mail.gmail.com.