Messages are dropping because too many are queued in AlertManager

801 views
Skip to first unread message

shivakumar sajjan

unread,
Jan 6, 2022, 1:14:41 AM1/6/22
to Prometheus Users
Hi,

I have single instance cluster for AlertManager and I see below warning in AlertManager 

container level=warn ts=2021-11-03T08:50:44.528Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4125 limit=4096

Alert Manager Version information:

Branch: HEAD 
BuildDate: 20190708-14:31:49 
BuildUser: root@868685ed3ed0 
GoVersion: go1.12.6 
Revision: 1ace0f76b7101cccc149d7298022df36039858ca 
Version: 0.18.0

AlertManager metrics

# HELP alertmanager_cluster_members Number indicating current number of members in cluster. 
# TYPE alertmanager_cluster_members gauge alertmanager_cluster_members 1 
# HELP alertmanager_cluster_messages_pruned_total Total number of cluster messages pruned. 
# TYPE alertmanager_cluster_messages_pruned_total counter alertmanager_cluster_messages_pruned_total 23020 
# HELP alertmanager_cluster_messages_queued Number of cluster messages which are queued. 
# TYPE alertmanager_cluster_messages_queued gauge alertmanager_cluster_messages_queued 4125

I am new to alerting. Could you please answer for the below questions?

  • Why are messages queueing up due to this alertmanager is not sending alert resolve information to webhook instance?

  • What is the solution for the above issue?

  • How do we see those queued messages in AlertManager?

  • Do we lose alerts when messages are dropped because of too many queued ?

  • Why are messages queued even though there is logic to prune messages on regular interval i.e 15 minutes ?

  • Do we lose alerts when AlertManager pruned messages on regular interval?


Thanks,

Shiva


Matthias Rampke

unread,
Jan 6, 2022, 4:15:45 PM1/6/22
to shivakumar sajjan, Prometheus Users
What is your webhook receiver? Are any of the resolve messages getting through? Are the requests succeeding?

I think Alertmanager will retry failed webhooks, not sure for how long. This would keep them in the queue, leading to what you observe in Alertmanager.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ea293a79-9f3f-42f1-b9a4-66ff7353cb16n%40googlegroups.com.

shivakumar sajjan

unread,
Jan 7, 2022, 12:27:31 AM1/7/22
to Matthias Rampke, Prometheus Users
Hi Matthias,

Thanks for responding my questions

It is a service where I added an API to post alert information(firing/resolved) by alertmanager whenever alerts are triggered.

There are below warnings in AlertManager pod logs:

level=warn ts=2022-01-06T20:27:41.726Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4097 limit=4096
level=warn ts=2022-01-06T20:42:41.726Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4121 limit=4096
level=warn ts=2022-01-06T21:27:41.726Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4097 limit=4096
level=warn ts=2022-01-06T21:42:41.726Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-06T21:57:41.727Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-06T22:42:41.727Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4123 limit=4096
level=warn ts=2022-01-06T22:57:41.727Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4155 limit=4096
level=warn ts=2022-01-06T23:12:41.727Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4100 limit=4096
level=warn ts=2022-01-06T23:27:41.728Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4097 limit=4096
level=warn ts=2022-01-06T23:42:41.728Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4099 limit=4096
level=warn ts=2022-01-06T23:57:41.728Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4097 limit=4096
level=warn ts=2022-01-07T00:27:41.728Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4124 limit=4096
level=warn ts=2022-01-07T00:42:41.729Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4124 limit=4096
level=warn ts=2022-01-07T00:57:41.729Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4097 limit=4096
level=warn ts=2022-01-07T01:42:41.729Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4099 limit=4096
level=warn ts=2022-01-07T01:57:41.730Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-07T02:42:41.730Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-07T02:57:41.730Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4155 limit=4096
level=warn ts=2022-01-07T03:12:41.730Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-07T03:27:41.731Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-07T03:42:41.731Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4099 limit=4096
level=warn ts=2022-01-07T03:57:41.731Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-07T04:42:41.732Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
level=warn ts=2022-01-07T04:57:41.732Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4097 limit=4096



There are errors in prometheus server pod logs:

level=error ts=2021-09-06T10:11:22.754Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:36:27.753Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:36:52.755Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://10.64.87.17:9093/api/v1/alerts: dial tcp 127.0.0.1:9093: i/o timeout"
level=error ts=2021-09-07T23:37:02.756Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=64 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:37:12.757Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=11 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:37:27.755Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:37:42.754Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:37:56.967Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=2 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2021-09-07T23:38:06.968Z caller=notifier.go:528 component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts count=18 msg="Error sending alert" err="Post http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"


May I know what could be the cause ?

Thanks,
Shiva

Matthias Rampke

unread,
Jan 13, 2022, 4:01:12 PM1/13/22
to shivakumar sajjan, Prometheus Users
From these logs, it's not clear. Try increasing the log level (--log.level=debug) on Alertmanager and Prometheus.

We do not know enough about your setup and the receiving service to solve this for you. You will have to systematically troubleshoot every part of the chain.

It seems that there are multiple issues at once – Alertmanager is falling behind on sending notifications, Prometheus is timing out sending alerts to Alertmanager. Make sure the node is not overloaded, make sure the webhook receiver is working correctly and quickly, and that Alertmanager can reach it (send a webhook by hand using curl from the Alertmanager host).

I hope this gives you some pointers to find out more yourself!

/MR

shivakumar sajjan

unread,
Jan 14, 2022, 2:27:46 AM1/14/22
to Matthias Rampke, Prometheus Users
Thanks Matthias. Sure I will troubleshoot each component and get back to you if there are any issues.



Thanks,
Shiva

Reply all
Reply to author
Forward
0 new messages