Alertmanager High Availability Gossiping and Duplicate Alerts

2,638 views
Skip to first unread message

kekr...@gmail.com

unread,
Jun 7, 2019, 11:45:16 AM6/7/19
to Prometheus Users

I was wondering if anyone can tell me what Alertmanager parameter represents the wait times for AM0 and AM1 in the below screen shot?  Are these built-in based on the fact you have a cluster?



I have two Promethues servers and two Alertmanager servers.  My alerts are not getting dedup’d.  I can only guess that these “wait” times are not set correctly.


I get one alert in Slack from each Alertmanager.  I know this because I have the --web.external-url set for the IP address of each Alertmanager.  By clicking the alert in Slack, it takes me to the Alertmanager GUI that sent the alert.


I have searched the web and cannot find a solution.  The problems other people have posted do not seem to apply. 


My peer position does not change – I have stable values.


My alertmanager_cluster_members are stable – I have no flopping


My alertmanager_cluster_failed_peers value is 0.


A graph of the alerts:


 

The corresponding alerts in Slack:


OCI LogFire Alert Notifications APP [12:40 PM]


[FIRING:1] InstanceDown (999.888.777.148:9100 node_exporter critical)

 

Endpoint on 999.888.777.148:9100 is down

Server:  999.888.777.148:9100

Reported by Job:  node_exporter


Exporter down (instance 999.888.777.148:9100)

Prometheus exporter down

VALUE = 0

LABELS: map[__name__:up instance:999.888.777.148:9100 job:node_exporter]


[FIRING:1] ExporterDown (999.888.777.148:9100 node_exporter warning)


[FIRING:1] InstanceDown (999.888.777.148:9100 node_exporter critical)


Endpoint on 999.888.777.148:9100 is down

Server:  999.888.777.148:9100

Reported by Job:  node_exporter


Exporter down (instance 999.888.777.148:9100)

Prometheus exporter down

VALUE = 0

LABELS: map[__name__:up instance:999.888.777.148:9100 job:node_exporter]


[RESOLVED] InstanceDown (999.888.777.148:9100 node_exporter critical)


[RESOLVED] ExporterDown (999.888.777.148:9100 node_exporter warning)


Sometimes the resolve works, sometimes I get four.  In this case, it worked.


The route for both alertmanager.ymls are identical –


          route:

group_by: ['alertname']

group_wait: 10s

group_interval: 10s

repeat_interval: 1h

receiver: 'web.hook'


If I silence an alert in one Alertmanager, it does appear in the other.  That tells me there is a connection between the two Alertmanagers.

 

From what I know and see on the status page, the cluster looks good.

 

Status from 999.888.777.146

Peers:

 

    Name: 01DCJWMTSV9JENF0HYAAS41ES2

    Address: 999.888.777.148:9094

    Name: 01DCJX4PM2JQCJ3N469MYWB6FD

    Address: 999.888.777.146:9094

 

Status from 999.888.777.148

Peers:

 

    Name: 01DCJWMTSV9JENF0HYAAS41ES2

    Address: 999.888.777.148:9094

    Name: 01DCJX4PM2JQCJ3N469MYWB6FD

    Address: 999.888.777.146:9094

 

Element                                                                                                                               Value

alertmanager_peer_position{instance="999.888.777.146:9093",job="alertmanager"}        1

alertmanager_peer_position{instance="999.888.777.148:9093",job="alertmanager"}        0

Simon Pasquier

unread,
Jun 11, 2019, 5:35:59 AM6/11/19
to kekr...@gmail.com, Prometheus Users
The wait time between peers to send notifications is controlled by the --cluster.peer-timeout flag which has a default vlaue of 15s.
The group_interval of 10s probably explains the duplicates. You need to either decrease cluster.peer-timeout or increase group_interval.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5f4b8c5c-3611-4836-829a-4f8257e5ef05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Kruppa

unread,
Jun 18, 2019, 4:39:01 PM6/18/19
to Prometheus Users
Thank you for you reply Simon.

I tried your suggestion but still got duplicates.

Here is what I did -

Test 1
--cluster.peer-timeout - left at default value of 15
group_interval = 15

Duplicate Firing messages/Single Resolve message

Test 2 
--cluster.peer-timeout - left at default value of 15
group_interval = 25

Duplicate Firing messages/Single Resolve message

Test 3

--cluster.peer-timeout = 5
group_interval = 15

Duplicate Firing and Resolve messages.

I did notice in test 1 and 2 it took longer for the duplicate firing messages to show up.  In test 3, they appear pretty much at the same time.






To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Kevin Kruppa

unread,
Jun 18, 2019, 4:52:09 PM6/18/19
to Prometheus Users
After I posted my reply,  I tried one more test -  

--cluster.peer-timeout - left at default value of 15
group_interval = 45

Waiting 45 seconds seemed to do the trick.  There was only one Firing message and one Resolved message.

Thank you Simon for your help.

Ben Kochie

unread,
Jun 19, 2019, 9:18:20 AM6/19/19
to Simon Pasquier, kekr...@gmail.com, Prometheus Users
Maybe we should log warn if the interval is too short for gossip.

Reply all
Reply to author
Forward
0 new messages