Hi all,
in our productive system we use a set of RabbitMQ Servers deployed on AWS cloud servers. They are distributed in different geographic regions.
Last friday, we had a weird problem to which we cannot find the root cause: 4 (out of 5) RabbitMQ servers in the Asian region, all of a sudden did not accept connections from consumers anymore. We tried AMQP consumers from various sites, including running on localhost, but none could connect and consume any messages.
Using rabbitmqadmin.py script, and the "get queue=..." command, we could finally consume the messages from the queue. Once the queues were empty and we restarted the RabbitMQ servers, everything was fine again.
There were no connection problems between the hosts running the RabbitMQ servers, and the hosts running the consumers. Even a telnet to the 5672 port worked perfectly fine.
The weird part is, that it stopped working on all 4 RabbitMQ servers around the same time.
The RabbitMQ log shows the following
=ERROR REPORT==== 26-Sep-2014::22:10:27 ===
application: mochiweb
"Accept failed error"
"{error,enfile}"
=ERROR REPORT==== 26-Sep-2014::22:10:27 ===
{mochiweb_socket_server,295,{acceptor_error,{error,accept_failed}}}
=ERROR REPORT==== 26-Sep-2014::22:10:28 ===
** Generic server <0.5962.2319> terminating
** Last message in was {inet_async,#Port<0.14742>,45504,{error,enfile}}
** When Server state == {state,{rabbit_networking,start_client,[]},
#Port<0.14742>,45504}
** Reason for termination ==
** {accept_failed,enfile}
The SASL Log around the same timeframe shows
=SUPERVISOR REPORT==== 26-Sep-2014::22:06:53 ===
Supervisor: {<0.32388.2318>,
amqp_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,1},
{name,channel_sup},
{mfargs,
{amqp_channel_sup,start_link,[direct,<0.32387.2318>]}},
{restart_type,temporary},
{shutdown,brutal_kill},
{child_type,supervisor}]
=CRASH REPORT==== 26-Sep-2014::22:07:26 ===
crasher:
initial call: mochiweb_acceptor:init/3
pid: <0.1448.2319>
registered_name: []
exception exit: {error,accept_failed}
in function mochiweb_acceptor:init/3
ancestors: [rabbit_web_dispatch_sup_15672,rabbit_web_dispatch_sup,
<0.150.0>]
messages: []
links: [<0.303.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 377
stack_size: 24
reductions: 229
neighbours:
Thanks in advance for any ideas,
Regards
Claire