Alertmanager issues with high number of notifications

Martin Chodúr

unread,

Sep 13, 2018, 2:01:11 AM9/13/18

to Prometheus Users

Hi, can I ask how many notifications is resonable amount for alertmanager to accept from the prometheus which it can withstand and sync with others AM instances in cluster?
I experienced issue when lot of alerts fired (4k per 5m) and stayed that way for a 20m with evaluation interval 20s

This generated fairly large ammount of notifications to alertmanager. I have 2 AM in HA and one of them restarted and than started to send `full_sync` messages to the other and permanently tried to reconnect and those reconnects were failing.

with debug log I see
```
level=warn ts=2018-09-12T11:38:25.038242103Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=warn ts=2018-09-12T11:38:25.038286673Z caller=main.go:265 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=debug ts=2018-09-12T11:05:53.996286988Z caller=channel.go:111 component=cluster msg="failed to send reliable" key=nfl node=01CPQQ4RWN9B30CF9APWWEP84R err="dial tcp 10.99.41.178:6783: i/o timeout"
caller=cluster.go:287 component=cluster memberlist="2018/09/12 18:46:48 [ERR] memberlist: Push/Pull with 01C...GWV failed: dial tcp 10.99.41.103:6783: i/o timeout\n"
```

I tried to enlarge the timeouts for cluster TCP, peer intervals and so on but nothing helped. I tried tcptraceroute between those two containers and everything is ok as well. This got resolved only after I cleared the `nflog`. I suspect that the nflog got too big and the cluster failed to sync it?

Did anyone encountered similar behaviour?

Simon Pasquier

unread,

Sep 17, 2018, 9:43:22 AM9/17/18

to Martin Chodúr, Prometheus Users

Maybe the folks at SoundCloud have more insights. You're right that when AlertManager connects to the cluster, it will receive nflog data from the other peers but unless the size is very large, it shouldn't be a big problem. How large is the file in your case?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/908860ed-9584-4c4e-a36f-4dbd05e5c27d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Chodúr

unread,

Sep 18, 2018, 7:15:02 AM9/18/18

to Prometheus Users

I hope the 2.4.0 Prometheus version with alert throttling could help. Also I have 120h retention on AM which i could try to lower.

The nflog size is quite big sometimes

Number of alerts

Dne pondělí 17. září 2018 15:43:22 UTC+2 Simon Pasquier napsal(a):

Maybe the folks at SoundCloud have more insights. You're right that when AlertManager connects to the cluster, it will receive nflog data from the other peers but unless the size is very large, it shouldn't be a big problem. How large is the file in your case?

On Thu, Sep 13, 2018 at 8:01 AM, Martin Chodúr <m.ch...@seznam.cz> wrote:

Hi, can I ask how many notifications is resonable amount for alertmanager to accept from the prometheus which it can withstand and sync with others AM instances in cluster?
I experienced issue when lot of alerts fired (4k per 5m) and stayed that way for a 20m with evaluation interval 20s

This generated fairly large ammount of notifications to alertmanager. I have 2 AM in HA and one of them restarted and than started to send `full_sync` messages to the other and permanently tried to reconnect and those reconnects were failing.

with debug log I see
```
level=warn ts=2018-09-12T11:38:25.038242103Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=warn ts=2018-09-12T11:38:25.038286673Z caller=main.go:265 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=debug ts=2018-09-12T11:05:53.996286988Z caller=channel.go:111 component=cluster msg="failed to send reliable" key=nfl node=01CPQQ4RWN9B30CF9APWWEP84R err="dial tcp 10.99.41.178:6783: i/o timeout"
caller=cluster.go:287 component=cluster memberlist="2018/09/12 18:46:48 [ERR] memberlist: Push/Pull with 01C...GWV failed: dial tcp 10.99.41.103:6783: i/o timeout\n"
```

I tried to enlarge the timeouts for cluster TCP, peer intervals and so on but nothing helped. I tried tcptraceroute between those two containers and everything is ok as well. This got resolved only after I cleared the `nflog`. I suspect that the nflog got too big and the cluster failed to sync it?

Did anyone encountered similar behaviour?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Simon Pasquier

unread,

Sep 18, 2018, 7:56:41 AM9/18/18

to Martin Chodúr, Prometheus Users

100kB doesn't sound a lot to me and I would assume that transferring this amount of data over TCP shouldn't be a problem.

Did you have a look at the alertmanager_cluster_* and alertmanager_oversize_gossip_* metrics?

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/56e53107-7356-4a3a-9681-0783c6114b66%40googlegroups.com.

Reply all

Reply to author

Forward