Atom counter hit max and restarted nodes, looking for metric to monitor

151 views
Skip to first unread message

JK

unread,
Oct 12, 2020, 3:48:04 PM10/12/20
to rabbitmq-users
Hello RabbitMQ users,

We recently had an issue with our cluster where each node restarted itself. The crash dump reported that each node hit the max number of atoms:

> rabbitmq-server: no more index entries in atom_tab (max=5000000)

I know we can get the memory usage of the atoms, and indeed I set up monitoring to alert if there is significant change in atom memory; but none of the tools seem to show the actual number of atoms, just how much memory they use (which looks to be negligible). We want to be able to alert on that in case it gets up to 5M atoms again. Is this possible or can it be possible to add it in future releases?

Thanks,
-J

Luke Bakken

unread,
Oct 12, 2020, 6:03:29 PM10/12/20
to rabbitmq-users
Hello,

This is an extremely unusual issue. Please start by telling us what version of Erlang and RabbitMQ you're using, and on what operating system. Do you have high connection or channel churn? Are you using quorum queues?

Thanks -
Luke

JK

unread,
Oct 13, 2020, 1:31:44 PM10/13/20
to rabbitmq-users
Hello Luke,

We're running 3.7.24 on this cluster but will be upgrading later this month or early next month to 3.8.x, Erlang is 22.3, all of it running in Centos 7.6.1810. There is definitely a lot of connection churn. A lot. Like a lot, a lot. We've done many things to alleviate that churn with connection pooling and are still working to reduce it, but it has been a problem since long before I started at this company and will not be resolved any time soon. No quorum queues, just classic mirrored queues.

Thanks,
-J

Diana Corbacho

unread,
Oct 14, 2020, 5:35:07 AM10/14/20
to rabbitmq-users
Hello J,

Do you have any plugins enabled? Could you share with us the output of `rabbitmqctl report`?

Thanks

Diana Corbacho

unread,
Oct 14, 2020, 6:44:21 AM10/14/20
to rabbitmq-users
Also, do you perform any monitoring using the CLI or HTTP API?

Diana Corbacho

unread,
Oct 14, 2020, 7:25:51 AM10/14/20
to rabbitmq-users
Do you still have the crash dump? It can be inspected with the crashdump viewer [1], which will show you the atom table. We might be able to figure out where those atoms come from with that.

JK

unread,
Oct 14, 2020, 3:32:49 PM10/14/20
to rabbitmq-users
Hello Diana,

We monitor via cli commands and extract the information we need for various purposes, nothing uses HTTP except a weekly broker definition backup script. We use the following plugins:

[root@rabbit2-a ~]# rabbitmq-plugins list | grep -i '\[e'
[E*] rabbitmq_auth_backend_cache       3.7.24
[E*] rabbitmq_auth_backend_ldap        3.7.24
[E*] rabbitmq_management               3.7.24
[e*] rabbitmq_management_agent         3.7.24
[E*] rabbitmq_top                      3.7.24
[e*] rabbitmq_web_dispatch             3.7.24

The output of report is quite large and contains too much sensitive info in queue names so I'd rather not share that over the mail list.

I ran cdv and I think you are on the right track asking about our cli monitoring, since it looks like the table filled up with all the temp cli client names. We have a health check script that uses rabbitmqctl very frequently (5s intervals). We can't really reduce that frequency, it took months of tuning to get that value in our prod set up. If the only answer is to reduce frequency of our health checks then I might be able to revisit that, but it will be tight and still very frequent. We might be able to get away with 10s.

I'm not going to get into why we need it that responsive but I would like to point out that we've used that same health check in 3.6.16 up until this year and never ran into this problem with atoms. We did have other issues with what looked to be memory leaks in binaries that would block publishes until the node was rebooted, even with very little queue depth. We did uptime tests, but ended up doing regular quarterly maintenance that would involve rebooting each node, so we never got to see nodes running for more than about 3 months. We upgraded to 3.7.24 back in April and that seems to have resolve the leaky binaries. Since the upgrade we've had more frequent maintenances so we never had a chance to see nodes running as long as they were before, but things settled down so we were back to our quarterly schedule. This last uptime was only 70 days before atoms filled and crashed every node one after the other, and the frequency of health checks has been constant for years now. Before the upgrade it was >90 days uptime. I was never concerned about atoms before though, since their footprint was/is negligible when tracking down that previous memory issue with binaries.

That said, I don't expect nor want a 3.7.x fix. Our next maintenance on this cluster will be the 3.8.x upgrade I've been planning all year :) I am still interested in a way to monitor atom counts though. We want to be able to head this off and plan accordingly if atom counts get out of hand again.

Thanks,
-J
Screen Shot 2020-10-14 at 11.35.08 AM.png

Luke Bakken

unread,
Oct 14, 2020, 5:14:22 PM10/14/20
to rabbitmq-users
Hello JK -

You should never have to worry about the atom table. What CLI command(s) are you running on the 5 second interval? It is very important to know that.

For what it's worth, almost any health check that runs that frequently is likely to be detrimental.

Please compress your erl_crash.dump file and use the "reply privately to author" feature to send it to the RabbitMQ team.

Thanks,
Luke

On Wednesday, October 14, 2020 at 12:32:49 PM UTC-7, JK wrote

Luke Bakken

unread,
Oct 14, 2020, 5:15:39 PM10/14/20
to rabbitmq-users
If you'd like to monitor the atom table, here's a good article on how to do so:

JK

unread,
Oct 14, 2020, 7:09:57 PM10/14/20
to rabbitmq-users
Hi Luke,

I just checked our standby cluster in DR that is running 3.8.7 and it also shows growing atom usage. Other than using newer rabbitmq, it's the same set up just with next to no traffic currently going through it, so it's most definitely the heath checker script causing atoms to fill up. 

We scrape the output of rabbitmqctl cluster_status for the health check. At the time I set up that cluster it was the most responsive of the methods i found in the docs, though I'll likely revisit that in the near future now that we've got rabbitmq-diagnostics available.  We have one other scraper that runs once a minute and also uses rabbitmqctl, it does a health check similar to the other but its main purpose is scraping metrics on a handful of critical queues that we're most interested in. Both those scripts were created for 3.6 and I haven't had a chance to update them yet. The latest scraper I had just set up (and now have disabled after seeing the cause of all the atoms filling up) uses rabbitmq-diagnostics memory_breakdown, but only once every two minutes.

Thanks for that link on monitoring atoms, that's exactly what I'm looking for: 
[root@rmq-test1-vm ~]# rabbitmqctl eval 'erlang:system_info(atom_count).'
176741
Like I mentioned above, I've turned off our mem breakdown scraper so I'll probably leave that alone for the time being to not add to the issue :P At least I know it's there if we need to check up on it.

Regarding sending the crash dump, I tried to reply to author but it says I don't have permission in this mailing list.

Thanks,
-J

Luke Bakken

unread,
Oct 14, 2020, 7:30:22 PM10/14/20
to rabbitmq-users
Hello,

Thanks for all the information. I don't think we need the crash dump right now since we know what commands you're running. We can see if we can fix the atom table issue. Basically, when you start up a rabbitmqctl (or rabbitmq-diagnostics) command it generates a unique node name that is getting stored in the RabbitMQ server atom table due to how Erlang distribution works.

You might consider switching to the HTTP API or Prometheus for getting these stats as it would bypass this issue - https://rawcdn.githack.com/rabbitmq/rabbitmq-management/v3.8.9/priv/www/api/index.html

Thanks -
Luke

Michael Klishin

unread,
Oct 15, 2020, 4:40:30 PM10/15/20
to rabbitmq-users
JK,

We have seen this problem in the past, circa 2015. IIRC there was no easy solution as CLI tools have to use a unique node name which is an atom.
`rabbitmq-diagnostics` will not be any different in this regard compared to other tools.

As much as we would like `rabbitmq-diagnostics` and friends to be useful in interactive node state exploration, they were not designed to be used for continuous monitoring
(even though --formatter=json support for many commands might give some this idea). You should consider using Prometheus [1] and Grafana for monitoring.
This option is superior in pretty much every every way and a lot has improved since 3.6 in this area[2].

CLI tool output does not change often but it can, intentionally or not. Monitoring API endpoints are significantly less likely to change.

Michael Klishin

unread,
Oct 15, 2020, 4:43:21 PM10/15/20
to rabbitmq-users
On a somewhat related note, calling any monitoring-oriented endpoint or command every two seconds seems really excessive to me. We
recommend at least every 30 seconds [1] because in practice monitoring data is displayed with a one minute precision or so.

Michael Klishin

unread,
Oct 15, 2020, 4:52:37 PM10/15/20
to rabbitmq-users
This reminds me, back in 2015-2016 Mirantis has contributed [1]. This was before the current generation of CLI tools
and perhaps this idea was lost. The context was exactly the same — using CLI tools for frequent monitoring — and
the best solution was to limit the pool of CLI tool names we try to a narrow-ish range [0, 100] of suffixes.

Without using unique names, two concurrent invocations of CLI tools would fail, which in case of monitoring would be very annoying
and produce false positives.

We will see if [1] has to be re-applied to the CLI tools we have today. However, I'd still question the use case given that both Prometheus/Grafana
support and even management plugin that ships with RabbitMQ are much better options that side step the problem entirely.

On Thursday, October 15, 2020 at 11:40:30 PM UTC+3 Michael Klishin wrote:

JK

unread,
Oct 16, 2020, 7:07:27 PM10/16/20
to rabbitmq-users
Hi Michael,

Thanks for all that background information. That makes sense, sounds like we just happen to be lucky and not run into that limit before. I'll look into porting as much of our stuff over to rabbitmq-admin, I assume that's safe since it uses http instead. Is there any plan to move that into the package itself along with the cli tools, rather than having to download from a running instance/externally? I've managed to cobble together some horrific puppet configs for grabbing that after rabbitmq is running so it's currently a non issue, but I'm a minimalist and would love to remove that :)

I played around with the grafana that ships with rabbitmq on a test install before, back whenever you all released that feature. I can't remember why I skipped it in favor of our current set up but I'll likely revisit that too. We do use prometheus for scraping memory and queue metrics, the cli stuff was built for our much older OpenNMS set up before i joined our team. Overall it just sounds like I need to do some housekeeping on our end.

Anyways, thanks for all the help! I think I can take it from here.

Cheers,
-J

Michael Klishin

unread,
Oct 20, 2020, 12:17:12 AM10/20/20
to rabbitmq-users
`rabbitmqadmin` is a standalone zero-dependency Python tool that comes with a plugin. It is more likely to become a 100% "untangled" from the plugin
than a part of RabbitMQ core because it requires the HTTP API, which is optional (Prometheus users may choose to not even enable the management plugin, for example).
It can be downloaded from GitHub [1] or any LAN resource you'd prefer, not just a running RabbitMQ node.

In some cases, rabbitmq-diagnostics with --formatter=json provides access to the same data the HTTP API does.

We'd prefer folks to use "real" monitoring tools (Prometheus-format based or not) than `rabbitmqadmin` but it's not going away, of course,
and indeed won't have the same atom table exhaustion issue than regular CLI tools do.

Michael Klishin

unread,
Oct 20, 2020, 12:18:35 AM10/20/20
to rabbitmq-users
We are QA'ing [1] which should go out in 3.8.10. The solution is fairly trivial: limit the set of possible CLI node names.

Michael Klishin

unread,
Oct 20, 2020, 6:01:01 AM10/20/20
to rabbitmq-users
[1] has been tested and merged, will ship in `3.8.10`.

KEN KLAUS

unread,
Oct 20, 2020, 6:02:30 AM10/20/20
to rabbitm...@googlegroups.com
When will 3.8.10 get released? Thanks.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/37ab19ac-5082-42ed-b80e-7ba8edd0e298n%40googlegroups.com.

Luke Bakken

unread,
Oct 20, 2020, 7:59:08 AM10/20/20
to rabbitmq-users
Hi Ken,

The RabbitMQ team does not make release ETA promises. There will be release candidates available that will be announced on this mailing list. Please test one of them in your environment.

Thanks,
Luke

JK

unread,
Oct 20, 2020, 8:37:49 PM10/20/20
to rabbitmq-users
Wow! That was super quick. Thanks for the info about rabbitmqadmin. I look forward to 3.8.10!

Thanks,
-J

Reply all
Reply to author
Forward
0 new messages