RabbitMQ consumers not able to connect

cla...@fautsch.com

unread,

Sep 29, 2014, 7:27:00 AM9/29/14

to rabbitm...@googlegroups.com

Hi all,

in our productive system we use a set of RabbitMQ Servers deployed on AWS cloud servers. They are distributed in different geographic regions.

Last friday, we had a weird problem to which we cannot find the root cause: 4 (out of 5) RabbitMQ servers in the Asian region, all of a sudden did not accept connections from consumers anymore. We tried AMQP consumers from various sites, including running on localhost, but none could connect and consume any messages.

Using rabbitmqadmin.py script, and the "get queue=..." command, we could finally consume the messages from the queue. Once the queues were empty and we restarted the RabbitMQ servers, everything was fine again.

There were no connection problems between the hosts running the RabbitMQ servers, and the hosts running the consumers. Even a telnet to the 5672 port worked perfectly fine.

The weird part is, that it stopped working on all 4 RabbitMQ servers around the same time.

The RabbitMQ log shows the following

=ERROR REPORT==== 26-Sep-2014::22:10:27 ===
    application: mochiweb
    "Accept failed error"
    "{error,enfile}"

=ERROR REPORT==== 26-Sep-2014::22:10:27 ===
{mochiweb_socket_server,295,{acceptor_error,{error,accept_failed}}}

=ERROR REPORT==== 26-Sep-2014::22:10:28 ===
** Generic server <0.5962.2319> terminating
** Last message in was {inet_async,#Port<0.14742>,45504,{error,enfile}}
** When Server state == {state,{rabbit_networking,start_client,[]},
                               #Port<0.14742>,45504}
** Reason for termination ==
** {accept_failed,enfile}

The SASL Log around the same timeframe shows

=SUPERVISOR REPORT==== 26-Sep-2014::22:06:53 ===
     Supervisor: {<0.32388.2318>,
                                           amqp_channel_sup_sup}
     Context:    shutdown_error
     Reason:     shutdown
     Offender:   [{nb_children,1},
                  {name,channel_sup},
                  {mfargs,
                      {amqp_channel_sup,start_link,[direct,<0.32387.2318>]}},
                  {restart_type,temporary},
                  {shutdown,brutal_kill},
                  {child_type,supervisor}]

=CRASH REPORT==== 26-Sep-2014::22:07:26 ===
crasher:
    initial call: mochiweb_acceptor:init/3
    pid: <0.1448.2319>
    registered_name: []
    exception exit: {error,accept_failed}
      in function mochiweb_acceptor:init/3
    ancestors: [rabbit_web_dispatch_sup_15672,rabbit_web_dispatch_sup,
                  <0.150.0>]
    messages: []
    links: [<0.303.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 377
    stack_size: 24
    reductions: 229
neighbours:

Thanks in advance for any ideas,

Regards
Claire

Simon MacMullen

unread,

Sep 29, 2014, 7:41:27 AM9/29/14

to cla...@fautsch.com, rabbitm...@googlegroups.com

On 29/09/14 12:27, cla...@fautsch.com wrote:
> Hi all,

Hi.

> Last friday, we had a weird problem to which we cannot find the root
> cause: 4 (out of 5) RabbitMQ servers in the Asian region, all of a
> sudden did not accept connections from consumers anymore. We tried AMQP
> consumers from various sites, including running on localhost, but none
> could connect and consume any messages.

Because the connection was refused? The AMQP handshake failed? Something
else?

> Using rabbitmqadmin.py script, and the "get queue=..." command, we could
> finally consume the messages from the queue. Once the queues were empty
> and we restarted the RabbitMQ servers, everything was fine again.
>
> There were no connection problems between the hosts running the RabbitMQ
> servers, and the hosts running the consumers. Even a telnet to the 5672
> port worked perfectly fine.
>
> The weird part is, that it stopped working on all 4 RabbitMQ servers
> around the same time.
>
> The RabbitMQ log shows the following
>
> =ERROR REPORT==== 26-Sep-2014::22:10:27 ===
> application: mochiweb
> "Accept failed error"
> "{error,enfile}"

So that shows mochiweb (ie. HTTP) failing to accept a connection due to
ENFILE. The usual meaning of that is that a system-wide FD limit was
reached.

I would assume that the AMQP acceptor also ran into the same limit,
although your posted log doesn't show that.

Note that this is not the usual per-process limit; RabbitMQ monitors
that and will start refusing to accept connections before hitting it.
The system-wide limit is a bit more mysterious though, it's very rare to
be able to hit it. Have you configured a very high per-process FD limit?

If you can post full logs there might be some more I could say.

Cheers, Simon

Michael Klishin

unread,

Sep 29, 2014, 7:50:53 AM9/29/14

to cla...@fautsch.com, rabbitm...@googlegroups.com

On 29 September 2014 at 15:27:07, cla...@fautsch.com (cla...@fautsch.com) wrote:
> =ERROR REPORT==== 26-Sep-2014::22:10:27 ===
> application: mochiweb
> "Accept failed error"
> "{error,enfile}"
>
> =ERROR REPORT==== 26-Sep-2014::22:10:27 ===
> {mochiweb_socket_server,295,{acceptor_error,{error,accept_failed}}}

RabbitMQ process ran out of available file descriptors and can't accept
any more connections.

For AMQP 0-9-1, STOMP and
MQTT this condition is monitored but WebSTOMP includes a 3rd party HTTP server,
which is not covered.

http://erlang.org/pipermail/erlang-questions/2012-January/063573.html

See http://docs.basho.com/riak/1.2.1/cookbooks/Open-Files-Limit/#Linux
for how to adjust the limit system-wide.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Message has been deleted

cla...@fautsch.com

unread,

Sep 30, 2014, 2:41:43 AM9/30/14

to rabbitm...@googlegroups.com, cla...@fautsch.com

Hi Simon and Michael,

Thanks for your feedback. Attached you find the logs for the given timeframe.

For each of the impacted hosts, we have a system wide FD limit of around 750k, a hard limit for Rabbit MQ of 4096 and soft limit for RabbitMQ of 1024. Except for some monitoring processes, those servers are mainly reserverd for our RabbitMQ Servers.

Thanks to our monitoring, we were able to check the FD's used during the last days on those servers. They are more or less stable always well below 1500.

On the remaining (non-impacted) servers settings and FD curves are similar.

Cheers,
Claire

logs.tar.gz

Message has been deleted

cla...@fautsch.com

unread,

Oct 28, 2014, 4:58:36 AM10/28/14

to rabbitm...@googlegroups.com, cla...@fautsch.com

Hi

As unfortunatelly we had exactly the same problem again last week, I just wanted to ask in the round if someone has any additional ideas, based on the provided logs.

From the general monitoring of our systems we do not see a problem with too many file descriptors.

It happened (again) more or less at the same time on three RabbitMQ servers distributed in 2 AWS datacenters.

Connections from publishers, consumers and shovels are affected (whereas usually the conusmers tend to quit first, but that can be due to general setup and network latency).

Thanks in advance

Claire

Simon MacMullen

unread,

Oct 28, 2014, 5:54:54 AM10/28/14

to cla...@fautsch.com, rabbitm...@googlegroups.com

On 28/10/2014 08:58, cla...@fautsch.com wrote:
> As unfortunatelly we had exactly the same problem again last week, I
> just wanted to ask in the round if someone has any additional ideas,
> based on the provided logs.
>
> From the general monitoring of our systems we do not see a problem with
> too many file descriptors.

It's not obvious what else ENFILE might mean. That's an error that comes
up from the kernel and is passed through Erlang and RabbitMQ untouched.
It means you hit a system file descriptor limit. Honestly.

Cheers, Simon

cla...@fautsch.com

unread,

Oct 28, 2014, 9:54:29 AM10/28/14

to rabbitm...@googlegroups.com, cla...@fautsch.com

Thanks a lot for your feedback anyway Simon!

Regards

Claire

Reply all

Reply to author

Forward