Hello Diana,
We monitor via cli commands and extract the information we need for various purposes, nothing uses HTTP except a weekly broker definition backup script. We use the following plugins:
[root@rabbit2-a ~]# rabbitmq-plugins list | grep -i '\[e'
[E*] rabbitmq_auth_backend_cache 3.7.24
[E*] rabbitmq_auth_backend_ldap 3.7.24
[E*] rabbitmq_management 3.7.24
[e*] rabbitmq_management_agent 3.7.24
[E*] rabbitmq_top 3.7.24
[e*] rabbitmq_web_dispatch 3.7.24
The output of report is quite large and contains too much sensitive info in queue names so I'd rather not share that over the mail list.
I ran cdv and I think you are on the right track asking about our cli monitoring, since it looks like the table filled up with all the temp cli client names. We have a health check script that uses rabbitmqctl very frequently (5s intervals). We can't really reduce that frequency, it took months of tuning to get that value in our prod set up. If the only answer is to reduce frequency of our health checks then I might be able to revisit that, but it will be tight and still very frequent. We might be able to get away with 10s.
I'm not going to get into why we need it that responsive but I would like to point out that we've used that same health check in 3.6.16 up until this year and never ran into this problem with atoms. We did have other issues with what looked to be memory leaks in binaries that would block publishes until the node was rebooted, even with very little queue depth. We did uptime tests, but ended up doing regular quarterly maintenance that would involve rebooting each node, so we never got to see nodes running for more than about 3 months. We upgraded to 3.7.24 back in April and that seems to have resolve the leaky binaries. Since the upgrade we've had more frequent maintenances so we never had a chance to see nodes running as long as they were before, but things settled down so we were back to our quarterly schedule. This last uptime was only 70 days before atoms filled and crashed every node one after the other, and the frequency of health checks has been constant for years now. Before the upgrade it was >90 days uptime. I was never concerned about atoms before though, since their footprint was/is negligible when tracking down that previous memory issue with binaries.
That said, I don't expect nor want a 3.7.x fix. Our next maintenance on this cluster will be the 3.8.x upgrade I've been planning all year :) I am still interested in a way to monitor atom counts though. We want to be able to head this off and plan accordingly if atom counts get out of hand again.
Thanks,
-J