This generated fairly large ammount of notifications to alertmanager. I have 2 AM in HA and one of them restarted and than started to send `full_sync` messages to the other and permanently tried to reconnect and those reconnects were failing.
with debug log I see
```
level=warn ts=2018-09-12T11:38:25.038242103Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=warn ts=2018-09-12T11:38:25.038286673Z caller=main.go:265 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=debug ts=2018-09-12T11:05:53.996286988Z caller=channel.go:111 component=cluster msg="failed to send reliable" key=nfl node=01CPQQ4RWN9B30CF9APWWEP84R err="dial tcp 10.99.41.178:6783: i/o timeout"
caller=cluster.go:287 component=cluster memberlist="2018/09/12 18:46:48 [ERR] memberlist: Push/Pull with 01C...GWV failed: dial tcp 10.99.41.103:6783: i/o timeout\n"
```
I tried to enlarge the timeouts for cluster TCP, peer intervals and so on but nothing helped. I tried tcptraceroute between those two containers and everything is ok as well. This got resolved only after I cleared the `nflog`. I suspect that the nflog got too big and the cluster failed to sync it?
Did anyone encountered similar behaviour?
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/908860ed-9584-4c4e-a36f-4dbd05e5c27d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Maybe the folks at SoundCloud have more insights. You're right that when AlertManager connects to the cluster, it will receive nflog data from the other peers but unless the size is very large, it shouldn't be a big problem. How large is the file in your case?
On Thu, Sep 13, 2018 at 8:01 AM, Martin Chodúr <m.ch...@seznam.cz> wrote:
Hi, can I ask how many notifications is resonable amount for alertmanager to accept from the prometheus which it can withstand and sync with others AM instances in cluster?
I experienced issue when lot of alerts fired (4k per 5m) and stayed that way for a 20m with evaluation interval 20s
This generated fairly large ammount of notifications to alertmanager. I have 2 AM in HA and one of them restarted and than started to send `full_sync` messages to the other and permanently tried to reconnect and those reconnects were failing.
with debug log I see
```
level=warn ts=2018-09-12T11:38:25.038242103Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=warn ts=2018-09-12T11:38:25.038286673Z caller=main.go:265 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to join 10.249.0.57: read tcp 10.64.37.98:41516->10.249.0.57:32453: i/o timeout"
level=debug ts=2018-09-12T11:05:53.996286988Z caller=channel.go:111 component=cluster msg="failed to send reliable" key=nfl node=01CPQQ4RWN9B30CF9APWWEP84R err="dial tcp 10.99.41.178:6783: i/o timeout"
caller=cluster.go:287 component=cluster memberlist="2018/09/12 18:46:48 [ERR] memberlist: Push/Pull with 01C...GWV failed: dial tcp 10.99.41.103:6783: i/o timeout\n"
```
I tried to enlarge the timeouts for cluster TCP, peer intervals and so on but nothing helped. I tried tcptraceroute between those two containers and everything is ok as well. This got resolved only after I cleared the `nflog`. I suspect that the nflog got too big and the cluster failed to sync it?
Did anyone encountered similar behaviour?
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/56e53107-7356-4a3a-9681-0783c6114b66%40googlegroups.com.