Rabbit eventually unresponsive. Status reports node down but processes are still running.

752 views
Skip to first unread message

Henry Chan

unread,
May 15, 2017, 6:03:30 PM5/15/17
to rabbitmq-users
Environment:
RabbitMQ 3.6.9
Erlang 18.3

Consumers:
Celery 3.1.25

Overview
The problem that I'm running into is that RabbitMQ still appears to be running but `rabbitmqctl status` reports the node as down.

Status of node 'rabbit@rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f' ...

Error: unable to connect to node 'rabbit@rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f': nodedown


DIAGNOSTICS

===========


attempted to contact: ['rabbit@rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f']


rabbit@rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f:

  * connected to epmd (port 4369) on rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f

  * epmd reports node 'rabbit' running on port 25672

  * TCP connection succeeded but Erlang distribution failed

  * suggestion: hostname mismatch?

  * suggestion: is the cookie set correctly?

  * suggestion: is the Erlang distribution using TLS?


current node details:

- node name: 'rabbitmq-cli-38@rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f'

- home dir: /var/lib/rabbitmq

- cookie hash: /dMRsB1K+iEInKSxqM7X/A==


The installation of Rabbit that I have is pretty fresh (in terms of setup/config, not much has been done). How I am producing this state is, with rabbit started up, I bring up all my workers and that's it. I've noticed that this issue happens when I bring up a significant amount of workers at once. Initially I thought this was an issue with max open file descriptors, but I've seen raised it to 65536 and I still see the issue.

The weird thing is that if I do nothing with my workers and just restart rabbit, Sometimes I'm good to go. I can start pushing to the queue and my consumers will do all the things. This seems to happen as I bring up (and weirdly enough, once when I brought down) all my workers.

Things I've checked and other notable things:
Checking htop, I do see that the processes are all still running and that barely any of the systems resources are touched during this time. The box is doing nothing.
Processes are still listening to both the amqp port, 5672, as well as the management port, 15672, but are unresponsive.
Rabbit is not clustered in this instance.
Nothing stood out inside of both the regular and sasl logs. No errors.
I had `rabbitmqctl status` running on cron and reporting status every 30 seconds but no trends there of note.

Any idea what might be causing this?

Michael Klishin

unread,
May 15, 2017, 6:33:48 PM5/15/17
to rabbitm...@googlegroups.com
`rabbitmqctl status` does not necessarily report that the node is down because `rabbitmqctl` failed
to authenticate with the node. If you read the details, there are hints provided:

>   * connected to epmd (port 4369) on rabbit-recipients-loadtest-terraform-0983da6d2dcfd7b9f

>   * epmd reports node 'rabbit' running on port 25672

>   * TCP connection succeeded but Erlang distribution failed

The last line is key here.

See How Nodes and CLI Tools Authentication to Each Other on http://rabbitmq.com/clustering.html,

archives of this list and after correcting the cooking for your effective user, server logs and memory usage breakdown.

It is impossible to suggest anything else with the amount of information provided.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Henry Chan

unread,
May 15, 2017, 10:44:38 PM5/15/17
to rabbitmq-users
Is there information somewhere that I could get for you that could help me troubleshoot this?

My instance of rabbit isn't clustered. Also, the instance of rabbit works through ports 5672 and 15672 up till I bring up a large amount of workers. Before rabbit becomes unresponsive, I see consumers registered with rabbit through the management ui and work, if there, is pulled off the queue. What could cause rabbit to not respond to those particular ports anymore?
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 16, 2017, 4:34:03 AM5/16/17
to rabbitm...@googlegroups.com, Henry Chan
Let's start with being more specific than "not respond".

You can monitor all kinds of things from memory usage breakdown to the number of file handles
used by the node to the number of TCP connections and their state
while increasing the number of "workers" gradually and see what can be correlated.

All of that information is easily available via RabbitMQ HTTP API, rabbitmqctl and/or logs.
> To post to this group, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
May 16, 2017, 4:47:54 AM5/16/17
to rabbitm...@googlegroups.com, Henry Chan
Also, how many is "many"? Contrary to the popular belief TCP connections aren't free,
for example. I doubt you have enough for them to consume a really large amount of RAM
in a development environment but you certainly can run out of file descriptors
as the default on most commonly used OS'es is very low (1024).

When a node is out of file descriptors it cannot accept connections or open files,
which means it cannot do much in general.


> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
> To post to this group, send an email to rabbitmq-users@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>

--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Henry Chan

unread,
May 16, 2017, 8:08:42 AM5/16/17
to rabbitmq-users, chan.h...@gmail.com
So by not responsive, the case here is that I can make a connection to the host and port but it hangs. So when curling the destination host and port and specifying verbosity, it shows that a connection is made but it doesn't go any further than that. Also when navigating to the management UI on port 15672, it just hangs and is waiting the whole time. Also seeing the behavior when starting up a new worker, it hangs when attempting to connect to the broker.

By many, I mean anywhere from 300-1000. I know that rabbit can handle way more than that because we're running 3.3.0 in production, but right now I'm load testing a datastore and at the same time checking out 3.6.9 for production use. Before the symptoms occur, I can workers can connect, I'm able to deliver to rabbit and the workers are able to consume.

Initially I thought it was file descriptors (and eventually it would've been) so I have it set to 65536. I then only intermittently saw the issue when going up to 300 workers or so but with 640 I'm setting it all the time. Some weird thing is that when rabbit gets in this state, a restart seems to put it back in a good state and it stays there as far as I can tell so this behavior seems to be happening mostly on starting up all my workers.

Not sure if it matters but Celery has some weird behaviors where on startup. It would appear to use double the connections initially and then it'll settle down I believe.

When rabbit is exhibiting these systems, it also appears that no resources are being used. I had thought that maybe it peaked out on something and then it died down so I set a script to run every 30 seconds outputting `rabbitmqctl status` to a file to see if anything is topping out but everything looks normal. I'm no where near topping out on anything.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send an email to rabbitm...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>

--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Michael Klishin

unread,
May 16, 2017, 9:05:46 AM5/16/17
to rabbitm...@googlegroups.com, Henry Chan
So your clients cannot connect. File descriptor monitoring and a Wireshark/tcpdump dump
are essential when investigating these. If there's an intermediary (a load balancer, a proxy) involved,
the same has to be done for that service (they also can have their own limits).

Like I said, the default on Linux and MacOS by default is mere 1024 and that's for file handles
and sockets *combined*. It's an OS-level limit that you can adjust, please see the docs.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Henry Chan

unread,
May 16, 2017, 10:09:36 AM5/16/17
to rabbitmq-users, chan.h...@gmail.com
I'll poke around some more and monitor network traffic and file descriptors and see if I can find anything.

I had previously set the file descriptor limit for the service from 1024 to 65536 and I was still seeing this issue. For some transparency, the box is using systemd and I have a file here `/etc/systemd/system/rabbitmq-server.service.d/limits.conf` with the following contents:

[Service]

LimitNOFILE=65536


I'm assuming it did the trick because the rabbit management UI reported that the file descriptor limit as 65536.

I also had a `rabbitmqctl status` running every 30 seconds and had logged this output within 30 seconds of rabbit becoming unresponsive, not sure if it'll be helpful:

 
[{pid,21022},
 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.6.9"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.9"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.9"},
      {rabbit,"RabbitMQ","3.6.9"},
      {amqp_client,"RabbitMQ AMQP Client","3.6.9"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.6.9"},
      {compiler,"ERTS  CXC 138 10","6.0.3"},
      {cowboy,"Small, fast, modular HTTP server.","1.0.4"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
      {syntax_tools,"Syntax tools","1.7"},
      {xmerl,"XML parser","1.3.10"},
      {os_mon,"CPO  CXC 138 46","2.4"},
      {ssl,"Erlang/OTP SSL application","7.3"},
      {public_key,"Public key infrastructure","1.1.1"},
      {cowlib,"Support library for manipulating Web protocols.","1.0.2"},
      {crypto,"CRYPTO","3.6.3"},
      {inets,"INETS  CXC 138 49","6.2"},
      {asn1,"The Erlang ASN1 compiler version 4.0.2","4.0.2"},
      {mnesia,"MNESIA  CXC 138 12","4.13.3"},
      {sasl,"SASL  CXC 138 11","2.7"},
      {stdlib,"ERTS  CXC 138 10","2.8"},
      {kernel,"ERTS  CXC 138 10","4.2"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 18 [erts-7.3] [source] [64-bit] [smp:8:8] [async-threads:128] [kernel-poll:true]\n"},
 {memory,
     [{total,569057616},
      {connection_readers,19892824},
      {connection_writers,4254904},
      {connection_channels,77150904},
      {connection_other,48268552},
      {queue_procs,19604120},
      {queue_slave_procs,0},
      {plugins,24002912},
      {other_proc,27588360},
      {mnesia,816032},
      {metrics,5414120},
      {mgmt_db,11644328},
      {msg_index,414720},
      {other_ets,9574584},
      {binary,279442856},
      {code,27545003},
      {atom,1000601},
      {other_system,17854204}]},
 {alarms,[]},
 {listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,6307482828},
 {disk_free_limit,50000000},
 {disk_free,5892489216},
 {file_descriptors,
     [{total_limit,65436},
      {total_used,835},
      {sockets_limit,58890},
      {sockets_used,833}]},
 {processes,[{limit,1048576},{used,10319}]},
 {run_queue,0},
 {uptime,8107},
 {kernel,{net_ticktime,60}}]

Also doing a netstat rabbit is still listening to the clustering, management and amqp ports. I'm not sure that this question is appropriate for this forum, but is the behavior of hitting the file descriptor limit that it'll just sit there and not doing anything? Not shut anything down? I'll do some testing as well to verify if the behavior is similar and that may give me some insight into which direction to head.
 
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 16, 2017, 10:14:10 AM5/16/17
to rabbitm...@googlegroups.com, Henry Chan
Then there is no way around tcpdump and full logs from all nodes (including proxies and such)
to understand why client connections are not accepted.

Henry Chan

unread,
May 16, 2017, 1:37:12 PM5/16/17
to rabbitmq-users, chan.h...@gmail.com
I started testing with a fresh box to see if it was an issue with the image we created and decided to try out different version of erlang then rabbitmq. Using erlang 19.3 over 18.3 on Ubuntu Xenial looks to solve the issue.

ga...@wordfence.com

unread,
May 23, 2017, 1:39:35 PM5/23/17
to rabbitmq-users, chan.h...@gmail.com
We recently experienced an identical issue to Henry's reported in this thread. We were using RabbitMQ 3.6.9 from the official repos (https://www.rabbitmq.com/install-debian.html) with Erlang 18.3.4-1 provided by Ubuntu Xenial (current at time of writing). We found that under even modest load (> 24 consumers, ~20 items in queue, rates below 1.0) Rabbit would become completely unresponsive on any/all interfaces. Upgrading to Erlang 19.3 from Erlang Solutions completely resolved the problem for us also. Thanks very much to Henry for pointing us in the right direction.

Sean Nolan

unread,
Jul 14, 2017, 5:51:13 PM7/14/17
to rabbitmq-users, chan.h...@gmail.com
+1 solved the problem for us too
Reply all
Reply to author
Forward
0 new messages