[rabbitmq-discuss] RabbitMQ hangs, does not accept connections

10,518 views
Skip to first unread message

Dmitri Minaev

unread,
Dec 12, 2011, 11:24:55 AM12/12/11
to rabbitmq...@lists.rabbitmq.com
Hello,

We use RabbitMQ for about a year now. From time to time I upgraded it
and switched from one server to another. About a month ago the last
such transition took place. I installed new RabbitMQ (2.7) on a new
server and our web application was reconfigured. Quite soon we faced
new problems. After some days of stable work clients could not connect
to RabbitMQ. I could list run rabbitmqctl, list queues, kill
connections, but the server refused attempts to connect. That is, TCP
socket was available and telnet could connect to port 5672, but the
AMQP connection could not be established. There was nothing unusual in
the logs. vm_memory_high_watermark is set to 0.7 and there's still
plenty of free memory.

After a couple of such failures I tried to downgrade to 2.6.1, but the
problem remained. The last time I disabled IPv6, but today we hit the
same trouble again.

I think I must have done something wrong when setting up the
environment, but what could that be?

OS: Ubuntu 10.04 LTS.
16GB RAM.
RabbitMQ 2.6.1
Erlang R13B03 (erts-5.7.4) (package erlang-nox from Ubuntu repository)
Client: php-amqplib

--
With best regards,
Dmitri Minaev
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq...@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Simon MacMullen

unread,
Dec 13, 2011, 5:42:31 AM12/13/11
to rabbitmq...@lists.rabbitmq.com
Hmm. I can't really say anything from your description - can you post
the logs somewhere? It's possible that your definition of "nothing
unusual in the logs" differs from mine.

And when you say that "the server refused attempts to connect", what
exactly do you mean. You say that a TCP connection *could* be
established - so does your client hang during AMQP handshaking?
Disconnect? Something else?

Cheers, Simon

On 12/12/11 16:24, Dmitri Minaev wrote:
> Hello,
>
> We use RabbitMQ for about a year now. From time to time I upgraded it
> and switched from one server to another. About a month ago the last
> such transition took place. I installed new RabbitMQ (2.7) on a new
> server and our web application was reconfigured. Quite soon we faced
> new problems. After some days of stable work clients could not connect
> to RabbitMQ. I could list run rabbitmqctl, list queues, kill
> connections, but the server refused attempts to connect. That is, TCP
> socket was available and telnet could connect to port 5672, but the
> AMQP connection could not be established. There was nothing unusual in
> the logs. vm_memory_high_watermark is set to 0.7 and there's still
> plenty of free memory.
>
> After a couple of such failures I tried to downgrade to 2.6.1, but the
> problem remained. The last time I disabled IPv6, but today we hit the
> same trouble again.
>
> I think I must have done something wrong when setting up the
> environment, but what could that be?
>
> OS: Ubuntu 10.04 LTS.
> 16GB RAM.
> RabbitMQ 2.6.1
> Erlang R13B03 (erts-5.7.4) (package erlang-nox from Ubuntu repository)
> Client: php-amqplib
>


--
Simon MacMullen
RabbitMQ, VMware

Alvaro Videla

unread,
Dec 13, 2011, 5:44:44 AM12/13/11
to Simon MacMullen, rabbitmq...@lists.rabbitmq.com
You could also try running some of the demos from the official Java client to make sure is not a client problem that you have.


_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Dmitri Minaev

unread,
Dec 13, 2011, 9:26:16 AM12/13/11
to Simon MacMullen, rabbitmq...@lists.rabbitmq.com
Thank you for the reply. Yes, TCP connection could be established, but
not AMQP. We generally use PHP library, but I also tested RabbitMQ
using Python amqplib. In both cases, the client side cannot get the
connection.

Besides the common information messages (starting/closing TCP
connection), there's only one type of messages in the log files:

=WARNING REPORT==== 13-Dec-2011::16:56:51 ===
exception on TCP connection <0.14474.173> from x.x.x.x:xxx
connection_closed_abruptly

But then, again, these messages may be found even during normal
operation, this is why I don't think they're relevant.

--

With best regards,
Dmitri Minaev

Dmitri Minaev

unread,
Dec 22, 2011, 1:32:45 AM12/22/11
to Simon MacMullen, rabbitmq...@lists.rabbitmq.com
Now, I have a hanging Rabbit available for the autopsy.

Running processes (ps ax|grep rabbit):

-------------
29699 ? Ss 0:00 sh -c
RABBITMQ_PID_FILE=/var/run/rabbitmq/pid /usr/sbin/rabbitmq-server >
/var/log/rabbitmq/startup_log 2>
/var/log/rabbitmq/startup_err
29702 ? S 0:00 /bin/sh /usr/sbin/rabbitmq-server
29708 ? S 0:00 su rabbitmq -s /bin/sh -c
/usr/lib/rabbitmq/bin/rabbitmq-server
29710 ? S 0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
29711 ? Sl 4715:59 /usr/lib/erlang/erts-5.7.4/bin/beam.smp -W
w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl --
-home /var/lib/rabbitmq -- -noshell -noinput -sname rabbit@dbx
-setcookie riak -boot
/var/lib/rabbitmq/mnesia/rabbit@dbx-plugins-expand/rabbit -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -rabbit tcp_listeners [{"0.0.0.0",5672}] -sasl
errlog_type error -kernel error_logger
{file,"/var/log/rabbitmq/rab...@dbx.log"} -sasl sasl_error_logger
{file,"/var/log/rabbitmq/rab...@dbx-sasl.log"} -os_mon start_cpu_sup
true -os_mon start_disksup false -os_mon start_memsup false -mnesia
dir "/var/lib/rabbitmq/mnesia/rabbit@dbx"
-------------

Network sockets are available:
$ sudo netstat -tunlp|grep beam
tcp 0 0 0.0.0.0:5672 0.0.0.0:*
LISTEN 29711/beam.smp
tcp 0 0 0.0.0.0:60040 0.0.0.0:*
LISTEN 29711/beam.smp

$ cat /etc/rabbitmq/rabbitmq.config
[{rabbit, [{vm_memory_high_watermark, 0.7}]},
{rabbit, [{tcp_listeners, [{"0.0.0.0", 5672}]}]}].

$ cat /etc/rabbitmq/rabbitmq-env.conf
RABBITMQ_NODE_IP_ADDRESS=0.0.0.0

strace -p 29711 shows that the process is waiting in select():
select(0, NULL, NULL, NULL, NULL


Last lines in rab...@dbx.log:
---------------------------
=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.367.0> from x.x.x.26:43157
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.367.0> from x.x.x..26:43157

=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.379.0> from x.x.x.26:43160
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.379.0> from x.x.x.26:43160

=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.335.0> from x.x.x.26:43154
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.335.0> from x.x.x.26:43154

=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.467.0> from x.x.x.26:43166
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.467.0> from x.x.x.26:43166
---------------------------

PHP clients cannot connect to RabbitMQ. When I run my test Python
script which uses amqplib.client_0_8, it hangs on
amqp.Connection(host, "guest", "guest", ssl=False)

strace shows the following:

connect(3, {sa_family=AF_INET, sin_port=htons(5672),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR) = 0
sendto(3, "AMQP\1\1\t\1", 8, 0, NULL, 0) = 8
brk(0x1461000) = 0x1461000
recvfrom(3,

Now, I try to connect to the RabbitMQ node using 'erl':
$ erl -sname 'rabbit@dbx'
{error_logger,{{2011,12,22},{10,26,33}},"Protocol: ~p: register error:
~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
{error_logger,{{2011,12,22},{10,26,33}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.9.0>]},{messages,[]},{links,[#Port<0.68>,<0.18.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,377},{stack_size,24},{reductions,442}],[]]}
{error_logger,{{2011,12,22},{10,26,33}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[['rabbit@dbx',shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2011,12,22},{10,26,33}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2011,12,22},{10,26,33}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})

Is there any other information that might be useful?

Alvaro Videla

unread,
Dec 22, 2011, 1:55:28 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Hi,

A small note,

When connecting to a remote Erlang node, in this case the the rabbit node, you have to choose a different node name.

For example:

erl -sname foo

Once you are on the Erlang REPL then you can try to remotely connect to the rabbit node using net_adm:ping

-Alvaro.

Sent from my iFad

Dmitri Minaev

unread,
Dec 22, 2011, 2:05:21 AM12/22/11
to Alvaro Videla, rabbitmq...@lists.rabbitmq.com
Oh...

$ erl -sname qwer
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4]
[async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4 (abort with ^G)
(qwer@dbx)1> net_adm:names().
{ok,[{"rabbit",60040},{"qwer",58043}]}
(qwer@dbx)2> net_adm:ping(rabbit).
pang

Alvaro Videla

unread,
Dec 22, 2011, 3:35:59 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Yes, you might get a pang but make sure the user that started the erl command has the same .erlang.cookie as the user running the RabbitMQ process.

Sent form my Nokia 1100

Dmitri Minaev

unread,
Dec 22, 2011, 4:16:29 AM12/22/11
to Alvaro Videla, rabbitmq...@lists.rabbitmq.com
Actually, I get the same results whatever cookie I set.

Matthias Radestock

unread,
Dec 22, 2011, 4:43:08 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 22/12/11 06:32, Dmitri Minaev wrote:
> Now, I have a hanging Rabbit available for the autopsy.

Please send us the output of 'rabbitmqctl report'.

Matthias.

Dmitri Minaev

unread,
Dec 22, 2011, 5:17:24 AM12/22/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Here it is, in the attached file.

Fifteen minutes ago another RabbitMQ (this time v. 2.7) on another
server also refused to accept connections. I downgraded it to the last
version that worked (if I remember correctly, it is 2.1) and started
again. Let's see if it helps.


On 22 December 2011 13:43, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 06:32, Dmitri Minaev wrote:
>>
>> Now, I have a hanging Rabbit available for the autopsy.
>
>
> Please send us the output of 'rabbitmqctl report'.
>
> Matthias.

--

report.2011-12-22

Matthias Radestock

unread,
Dec 22, 2011, 5:25:20 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 22/12/11 10:17, Dmitri Minaev wrote:
> Here it is, in the attached file.

All looks fine. What memory and file descriptor limits get reported in
the rabbit log?

Can you get a connection established when connecting a client from the
same machine the broker is running on?

Dmitri Minaev

unread,
Dec 22, 2011, 5:43:34 AM12/22/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
During the last startup it was

=INFO REPORT==== 13-Dec-2011::17:16:53 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 13-Dec-2011::17:16:53 ===
Memory limit set to 11252MB.

No, I cannot connect to Rabbit even from the same server. As I said
before, TCP connection to 127.0.0.1 is established, but AMQP
connection is not established.


On 22 December 2011 14:25, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 10:17, Dmitri Minaev wrote:
>>
>> Here it is, in the attached file.
>
>
> All looks fine. What memory and file descriptor limits get reported in the
> rabbit log?
>
> Can you get a connection established when connecting a client from the same
> machine the broker is running on?
>
> Matthias.

--

With best regards,
Dmitri Minaev

Matthias Radestock

unread,
Dec 22, 2011, 5:50:15 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 22/12/11 10:43, Dmitri Minaev wrote:
> No, I cannot connect to Rabbit even from the same server. As I said
> before, TCP connection to 127.0.0.1 is established, but AMQP
> connection is not established.

On the broker machine, try

$ telnet localhost 5672

and type in 'AMQPxxxx<return>'

That should result in an output of

AMQP Connection closed by foreign host.

and a message in the rabbit.log like this:

=ERROR REPORT==== 22-Dec-2011::10:47:09 ===
exception on TCP connection <0.767.0> from [::1]:48915
{bad_version,120,120,120,120}


Do you get the same?

Matthias.

Dmitri Minaev

unread,
Dec 22, 2011, 5:51:30 AM12/22/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Here's also a graph of the number of messages in the queues (see
attached file). We have an impression that this situation is caused by
relatively high load. At least, RabbitMQ worked well since 13.12 till
today, 22.12, when message producers were started.
rabbitmq_list_queues-day.png

Dmitri Minaev

unread,
Dec 22, 2011, 6:02:20 AM12/22/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
No, I get no response from the server and nothing appears in the log.

But some new messages did appear after Rabbit had stopped accepting
connections. They are mostly in the same vein:

=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.431.0> from 212.24.56.22:49702

=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.327.0> from 212.24.56.22:49699
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.327.0> from 212.24.56.22:49699

=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.976.0> from 212.24.56.22:49720
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.976.0> from 212.24.56.22:49720

=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.1007.0> from 212.24.56.22:49721
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.1007.0> from 212.24.56.22:49721

=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.1019.0> from 212.24.56.22:49722
connection_closed_abruptly

=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.1019.0> from 212.24.56.22:49722

By the way, here's another graph, number of connections to Rabbit.
Hope it helps.


On 22 December 2011 14:50, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 10:43, Dmitri Minaev wrote:
>>
>> No, I cannot connect to Rabbit even from the same server. As I said
>> before, TCP connection to 127.0.0.1 is established, but AMQP
>> connection is not established.
>
>
> On the broker machine, try
>
> $ telnet localhost 5672
>
> and type in 'AMQPxxxx<return>'
>
> That should result in an output of
>
> AMQP    Connection closed by foreign host.
>
> and a message in the rabbit.log like this:
>
> =ERROR REPORT==== 22-Dec-2011::10:47:09 ===
> exception on TCP connection <0.767.0> from [::1]:48915
> {bad_version,120,120,120,120}
>
>
> Do you get the same?
>
> Matthias.

--

rabbitmq_connections-day.png

Matthias Radestock

unread,
Dec 22, 2011, 6:09:24 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
On 22/12/11 11:02, Dmitri Minaev wrote:
> No, I get no response from the server and nothing appears in the log.

This is all very mysterious.

Is the rabbit server process busy, cpu-wise?

Dmitri Minaev

unread,
Dec 22, 2011, 6:30:29 AM12/22/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
No, it is simply waiting. CPU is idle, RAM is free, swap is unused, IO
is very moderate.

Is there anything I could do using -remsh?


On 22 December 2011 15:09, Matthias Radestock <matt...@rabbitmq.com> wrote:
> On 22/12/11 11:02, Dmitri Minaev wrote:
>>
>> No, I get no response from the server and nothing appears in the log.
>
>
> This is all very mysterious.
>
> Is the rabbit server process busy, cpu-wise?
>
> Matthias.

--

With best regards,
Dmitri Minaev

Matthias Radestock

unread,
Dec 22, 2011, 6:36:10 AM12/22/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 22/12/11 11:30, Dmitri Minaev wrote:
> No, it is simply waiting. CPU is idle, RAM is free, swap is unused, IO
> is very moderate.

I suspect this is a problem with Erlang, somehow causing the tcp/ip
sub-system to get into a weird state.

I see you are running R13B03, which is two years old. I suggest you
replace that with Ericsson's latest - R15B - and also run the latest
rabbit (2.7.1).

Regards,

Matthias.

Dmitri Minaev

unread,
Dec 22, 2011, 6:41:17 AM12/22/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Sounds reasonable :). Ok, I will stop the hanging instance now and
upgrade Erlang to the last version. This one just happened to be found
in Ubuntu repository.

Thank you, Matthias, Alvaro, Simon and everyone else!


On 22 December 2011 15:36, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 11:30, Dmitri Minaev wrote:
>>
>> No, it is simply waiting. CPU is idle, RAM is free, swap is unused, IO
>> is very moderate.
>
>
> I suspect this is a problem with Erlang, somehow causing the tcp/ip
> sub-system to get into a weird state.
>
> I see you are running R13B03, which is two years old. I suggest you replace
> that with Ericsson's latest - R15B - and also run the latest rabbit (2.7.1).
>
> Regards,
>
> Matthias.

--

With best regards,
Dmitri Minaev

Dmitri Minaev

unread,
Dec 29, 2011, 4:27:35 AM12/29/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Hello,

I am sorry, but upgrade to Erlang R15B and RabbitMQ 2.7.1 did not
help. Today I saw the same picture: RabbitMQ does not accept new
connections, while everything else seems to be working :(

Matthias Radestock

unread,
Dec 29, 2011, 4:35:12 AM12/29/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 29/12/11 09:27, Dmitri Minaev wrote:
> I am sorry, but upgrade to Erlang R15B and RabbitMQ 2.7.1 did not
> help. Today I saw the same picture: RabbitMQ does not accept new
> connections, while everything else seems to be working :(

That's unfortunate. Though our investigation will be easier now that you
are running the latest version.

Is there any way you could give us access to the broken rabbit?

Also, while in the hung state, what does

$ scripts/rabbitmqctl eval 'file_handle_cache:info().'

return?

Regards,

Matthias.

Dmitri Minaev

unread,
Dec 30, 2011, 3:27:50 AM12/30/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Matthias,

I would be most grateful if you could have a look at our server. I
will send you more info in an off-list message. Thank you.

Here's the output of file_handle_cache:info():

$ sudo /usr/local/rabbitmq/sbin/rabbitmqctl eval 'file_handle_cache:info().'
[{obtain_count,51},{obtain_limit,829}]
...done.

On 29 December 2011 13:35, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 29/12/11 09:27, Dmitri Minaev wrote:
>>
>> I am sorry, but upgrade to Erlang R15B and RabbitMQ 2.7.1 did not
>> help. Today I saw the same picture: RabbitMQ does not accept new
>> connections, while everything else seems to be working :(
>
>
> That's unfortunate. Though our investigation will be easier now that you are
> running the latest version.
>
> Is there any way you could give us access to the broken rabbit?
>
> Also, while in the hung state, what does
>
> $ scripts/rabbitmqctl eval 'file_handle_cache:info().'
>
> return?
>
> Regards,
>
> Matthias.

--

With best regards,
Dmitri Minaev

Matthias Radestock

unread,
Dec 30, 2011, 7:11:30 AM12/30/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 30/12/11 08:27, Dmitri Minaev wrote:
> I would be most grateful if you could have a look at our server.

I have done this now.

Rabbit indeed wasn't accepting connections - it wasn't refusing them
either, i.e. it was behaving as if 'accept' hadn't been called.

The Erlang process tasked with accepting AMQP connections was alive and
well. It was simply sitting there waiting for the tcp subsystem to
notify it of new connections. Alas that never happened.

So either the acceptor process forgot to tell Erlang's tcp stack to be
notified of new connections, or Erlang's tcp stack forgot that it was
supposed to tell the acceptor process...

...and it looks like there is a path in the tcp_acceptor.erl that would
trigger the former. When the tcp stack notifies the acceptor of an error
other than 'closed', the acceptor carries on but does not invoke
prim_inet:async_accept/2 to be notified of the next connection attempt.

I will file a bug for this. Should be easy to fix, though we cannot be
certain that this is definitely the problem.

Obviously if this was happening frequently we would have heard about the
issue a long time ago - the code in question hasn't changed for >3
years. So there must be some rare circumstances triggering this.


I got the acceptor process to issue another async_accept, so rabbit is
happy for the moment. But no doubt the problem will re-occur.


Regards,

Matthias.

Matthias Radestock

unread,
Dec 30, 2011, 9:02:37 AM12/30/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 30/12/11 12:11, Matthias Radestock wrote:
> When the tcp stack notifies the acceptor of an error
> other than 'closed', the acceptor carries on but does not invoke
> prim_inet:async_accept/2 to be notified of the next connection attempt.
>
> I will file a bug for this. Should be easy to fix, though we cannot be
> certain that this is definitely the problem.

Here's a proposed fix:

http://hg.rabbitmq.com/rabbitmq-server/rev/ca0392ca0fc1

I am attaching a tcp_acceptor.beam with that fix, compiled for R15, that
you can drop in place of the existing file. I'd be interested a) if that
solves the problem for you, and b) what error gets logged - watch out
for s.t. like

=ERROR REPORT==== 30-Dec-2011::13:45:01 ===
failed to accept TCP connection on [::]:5672: some_error

in the logs.


Regards,

Matthias.

tcp_acceptor.beam

Dmitri Minaev

unread,
Dec 30, 2011, 10:34:38 AM12/30/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Thank you very much. I will install it today or tomorrow, but our
system was switched over to another installation. Besides, I do not
expect intensive load till after 10 January, the end of Russian New
Year vacations.

But I am still curious about the fact that until about two months ago
our experience with RabbitMQ was very good. We had version 2.1 or 2.2
then and it worked fine. The problems started when I moved Rabbit to
another server and upgraded it. Either the bug was introduced between
versions 2.2 and 2.6.1, or it is related to some changes in the server
environment. Is it possible to find out whether versions 2.1-2.2 were
also influenced by the problem?

--

With best regards,
Dmitri Minaev

Matthias Radestock

unread,
Dec 30, 2011, 10:44:46 AM12/30/11
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com
Dmitri,

On 30/12/11 15:34, Dmitri Minaev wrote:
> But I am still curious about the fact that until about two months ago
> our experience with RabbitMQ was very good. We had version 2.1 or 2.2
> then and it worked fine. The problems started when I moved Rabbit to
> another server and upgraded it. Either the bug was introduced between
> versions 2.2 and 2.6.1, or it is related to some changes in the server
> environment. Is it possible to find out whether versions 2.1-2.2 were
> also influenced by the problem?

As I said, the problem I identified has been around for more than three
years.

The likely trigger is some obscure condition in the network stack. So it
is quite conceivable that the move to a different server made this
happen with a much higher probability than before. For example, it looks
like you are now using IPv6 - was that the case before the move too?


Matthias.

Dmitri Minaev

unread,
Dec 30, 2011, 10:54:06 AM12/30/11
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
We did not switch to IPv6 intentionally. I didn't even notice this
until some time ago, trying to find the cause of those troubles. I do
not think, though, that this is relevant, because the problem
persisted even when Rabbit was configured to work with v4 only.

I have installed that file and restarted Rabbit. Also, I asked the
developers whether they can stress-test the server. I will let you
know as soon as possible.

Thanks!

On 30 December 2011 19:44, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 30/12/11 15:34, Dmitri Minaev wrote:
>>
>> But I am still curious about the fact that until about two months ago
>> our experience with RabbitMQ was very good. We had version 2.1 or 2.2
>> then and it worked fine. The problems started when I moved Rabbit to
>> another server and upgraded it. Either the bug was introduced between
>> versions 2.2 and 2.6.1, or it is related to some changes in the server
>> environment. Is it possible to find out whether versions 2.1-2.2 were
>> also influenced by the problem?
>
>
> As I said, the problem I identified has been around for more than three
> years.
>
> The likely trigger is some obscure condition in the network stack. So it is
> quite conceivable that the move to a different server made this happen with
> a much higher probability than before. For example, it looks like you are
> now using IPv6 - was that the case before the move too?
>
>
> Matthias.

--

With best regards,
Dmitri Minaev

Dmitri Minaev

unread,
Mar 6, 2012, 1:51:22 AM3/6/12
to Matthias Radestock, rabbitmq...@lists.rabbitmq.com
Dear friends,

Finally, I can say that the attempt to solve the problem with the
modified tcp_acceptor has failed. For a couple of months Rabbit worked
well, even under moderate load (up to 8-9 million messages), but today
it has failed again with the same symptoms. Let me remind you of the
situation.

RabbitMQ v.2.7.1 working under Erlang R15B on Ubuntu Linux 10.04,
suddenly stops accepting AMQP connections. TCP connections are being
accepted, but no response follows. rabbitmqctl works.

The nonoperating RabbitMQ server is now at my disposable for autopsy.

Simon MacMullen

unread,
Mar 6, 2012, 11:54:57 AM3/6/12
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com, Matthias Radestock
On 06/03/12 06:51, Dmitri Minaev wrote:
> Dear friends,

Hi.

> Finally, I can say that the attempt to solve the problem with the
> modified tcp_acceptor has failed. For a couple of months Rabbit worked
> well, even under moderate load (up to 8-9 million messages), but today
> it has failed again with the same symptoms.

Damn.

Did any error along the lines of "failed to accept TCP connection..."
appear in the logs?

> The nonoperating RabbitMQ server is now at my disposable for autopsy.

If I were able to look at this tomorrow that would be great.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware

Dmitri Minaev

unread,
Mar 7, 2012, 3:57:19 AM3/7/12
to Simon MacMullen, rabbitmq...@lists.rabbitmq.com, Matthias Radestock
Yes, there's quite a lot of those 'failed to accept' messages.
Actually, this is the second time when Rabbit stops since installation
of that tcp_acceptor patch, and every time those messages appear in
large number.


On 6 March 2012 20:54, Simon MacMullen <si...@rabbitmq.com> wrote:
> On 06/03/12 06:51, Dmitri Minaev wrote:
>>
>> Dear friends,
>
>
> Hi.
>
>
>> Finally, I can say that the attempt to solve the problem with the
>> modified tcp_acceptor has failed. For a couple of months Rabbit worked
>> well, even under moderate load (up to 8-9 million messages), but today
>> it has failed again with the same symptoms.
>
>
> Damn.
>
> Did any error along the lines of "failed to accept TCP connection..." appear
> in the logs?
>
>
>> The nonoperating RabbitMQ server is now at my disposable for autopsy.
>
>
> If I were able to look at this tomorrow that would be great.
>
> Cheers, Simon
>
>
> --
> Simon MacMullen
> RabbitMQ, VMware

--

With best regards,
Dmitri Minaev

Simon MacMullen

unread,
Mar 7, 2012, 10:22:37 AM3/7/12
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com, Matthias Radestock
On 07/03/12 08:57, Dmitri Minaev wrote:
> Yes, there's quite a lot of those 'failed to accept' messages.
> Actually, this is the second time when Rabbit stops since installation
> of that tcp_acceptor patch, and every time those messages appear in
> large number.

So there's few things going on here. The primary issue is that RabbitMQ
is running out of file descriptors due to too many connections being
opened, and for various reasons the internal accounting that is supposed
to prevent this from causing harm is getting out of sync with reality.
There are already improvements that will be coming in the next release
that will improve this situation and we have some ideas for how to fix
it altogether.

But in the meantime you should look at increasing the number of file
descriptors that are available to RabbitMQ.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware

Dmitri Minaev

unread,
Mar 9, 2012, 2:57:54 AM3/9/12
to Simon MacMullen, rabbitmq...@lists.rabbitmq.com, Matthias Radestock
On 7 March 2012 19:22, Simon MacMullen <si...@rabbitmq.com> wrote:
> On 07/03/12 08:57, Dmitri Minaev wrote:
>>
>> Yes, there's quite a lot of those 'failed to accept' messages.
>> Actually, this is the second time when Rabbit stops since installation
>> of that tcp_acceptor patch, and every time those messages appear in
>> large number.
>
>
> So there's few things going on here. The primary issue is that RabbitMQ is
> running out of file descriptors due to too many connections being opened,
> and for various reasons the internal accounting that is supposed to prevent
> this from causing harm is getting out of sync with reality. There are
> already improvements that will be coming in the next release that will
> improve this situation and we have some ideas for how to fix it altogether.
>
> But in the meantime you should look at increasing the number of file
> descriptors that are available to RabbitMQ.

Oh... Thanks. I thought, that if TCP connection is accepted, number of
file descriptors should have no effect on the further events, since
the socket already exists. Even now I can connect to port 5672 on that
server, but AMQP does not respond.

If I understand correctly, the number of file descriptors used by
RabbitMQ in normal situation is roughly equal to the number of
`rabbitmqctl list_connections` + some constant (~30)? In case of that
hanging server, the number of AMQP connections never was close to the
FD limit (ulimit -n is 1024, fs.file-max = 1605698). The graph
reflecting the number of open AMQP connections is attached to this
message.

rabbitmq_connections-month.png

Simon MacMullen

unread,
Mar 9, 2012, 7:37:10 AM3/9/12
to Dmitri Minaev, rabbitmq...@lists.rabbitmq.com, Matthias Radestock
On 09/03/12 07:57, Dmitri Minaev wrote:
> Oh... Thanks. I thought, that if TCP connection is accepted, number of
> file descriptors should have no effect on the further events, since
> the socket already exists. Even now I can connect to port 5672 on that
> server, but AMQP does not respond.

I think at this point it hasn't actually allocated the FD, so it can't
communicate.

> If I understand correctly, the number of file descriptors used by
> RabbitMQ in normal situation is roughly equal to the number of
> `rabbitmqctl list_connections` + some constant (~30)? In case of that
> hanging server, the number of AMQP connections never was close to the
> FD limit (ulimit -n is 1024, fs.file-max = 1605698). The graph
> reflecting the number of open AMQP connections is attached to this
> message.

It's not really a constant, but to a first approximation, yes.

Ultimately, the error I saw being passed up from the OS was ENFILE -
that's pretty unambiguous :)

It's possible that if you're churning connections then "closed"
connections in FIN_WAIT2 could account for the majority of the used FDs.
In 2.8.0 we'll set SO_LINGER to 0 to prevent this.

Dmitri Minaev

unread,
Mar 26, 2012, 1:45:45 AM3/26/12
to Simon MacMullen, rabbitmq...@lists.rabbitmq.com, Matthias Radestock
On 9 March 2012 16:37, Simon MacMullen <si...@rabbitmq.com> wrote:
> On 09/03/12 07:57, Dmitri Minaev wrote:

>> If I understand correctly, the number of file descriptors used by
>> RabbitMQ in normal situation is roughly equal to the number of
>> `rabbitmqctl list_connections` + some constant (~30)? In case of that
>> hanging server, the number of AMQP connections never was close to the
>> FD limit (ulimit -n is 1024, fs.file-max = 1605698). The graph
>> reflecting  the number of open AMQP connections is attached to this
>> message.
>
>
> It's not really a constant, but to a first approximation, yes.
>
> Ultimately, the error I saw being passed up from the OS was ENFILE - that's
> pretty unambiguous :)
>
> It's possible that if you're churning connections then "closed" connections
> in FIN_WAIT2 could account for the majority of the used FDs. In 2.8.0 we'll
> set SO_LINGER to 0 to prevent this.

I am still not sure about the role of the file descriptors in this
event. Last Friday, our Rabbit died again. Until the very last moment
the number of open file descriptors as reported by the management
plugin was 143 out of the total available number of 32765. It was
RabbitMQ v.2.7.0. On that same day I have upgraded to 2.8.1. Will
report on its behaviour later.

--
With best regards,
Dmitri Minaev

Message has been deleted

Victoriya Shintekova

unread,
Nov 1, 2012, 9:40:56 AM11/1/12
to rabbitmq...@googlegroups.com, Simon MacMullen, rabbitmq...@lists.rabbitmq.com, Matthias Radestock, min...@gmail.com
Good day! 
I have the same problem with rabbit. Did you solve it?

понедельник, 26 марта 2012 г., 9:45:45 UTC+4 пользователь Dmitri Minaev написал:

maulik thaker

unread,
Jan 22, 2014, 8:29:10 AM1/22/14
to rabbitmq...@googlegroups.com, Simon MacMullen, rabbitmq...@lists.rabbitmq.com, Matthias Radestock, min...@gmail.com, trid...@gmail.com

Hi Dmitri Minaev,

        I am facing same problem with rabbitmq 3.2.2 ( I have gone through this thread and already checked memory and number of file-descriptors are not causing the problem). If you have solved that problem then please share the solution.  

Thanks,
Maulik

maulik thaker

unread,
Jan 28, 2014, 10:14:52 PM1/28/14
to Dimitri Minaev, rabbitmq...@googlegroups.com, Simon MacMullen, rabbitmq...@lists.rabbitmq.com, Matthias Radestock, trid...@gmail.com
Hello Dimitri Minaev,

         Thanks for the reply. 

          Finally we got the issue, there was a connection leakage in other module that caused problem in our module.

Thanks,
Maulik


On Fri, Jan 24, 2014 at 6:36 PM, Dimitri Minaev <min...@gmail.com> wrote:
Hello, Maulik,

I believe, your problem is a different one, because the bug was gone since we had upgraded to 2.8.7. The final message from the support service was:

> The Erlang/OTP team believe that the problem was due to a bug in their code. A small change introduced in RabbitMQ 2.8.7 had the coincidental fortunate effect of bypassing the bug, so the problem should not occur in 2.8.7 or later versions of RabbitMQ.

Good luck!




--
With best regards,
Dimitri Minaev

pavel.f...@corp.flirchi.com

unread,
Feb 3, 2015, 6:31:30 AM2/3/15
to rabbitmq...@googlegroups.com, min...@gmail.com, si...@rabbitmq.com, rabbitmq...@lists.rabbitmq.com, matt...@rabbitmq.com, trid...@gmail.com
Hello,

we are running 3.4.3 rabbitmq on Debian and got the same issue with connections yesterday after 2.5M messages in queues.

is there any way to solve it?

Thanks,
Pavel.

среда, 29 января 2014 г., 5:14:52 UTC+2 пользователь maulik thaker написал:

Joe Oliveiro

unread,
Feb 19, 2015, 2:23:06 AM2/19/15
to rabbitmq...@googlegroups.com, rabbitmq...@lists.rabbitmq.com, min...@gmail.com
Got the same issue over here as well.

On Monday, December 12, 2011 at 5:24:55 PM UTC+1, Dmitri Minaev wrote:
Hello,

We use RabbitMQ for about a year now. From time to time I upgraded it
and switched from one server to another. About a month ago the last
such transition took place. I installed new RabbitMQ (2.7) on a new
server and our web application was reconfigured. Quite soon we faced
new problems. After some days of stable work clients could not connect
to RabbitMQ. I could list run rabbitmqctl, list queues, kill
connections, but the server refused attempts to connect. That is, TCP
socket was available and telnet could connect to port 5672, but the
AMQP connection could not be established. There was nothing unusual in
the logs. vm_memory_high_watermark is set to 0.7 and there's still
plenty of free memory.

After a couple of such failures I tried to downgrade to 2.6.1, but the
problem remained. The last time I disabled IPv6, but today we hit the
same trouble again.

I think I must have done something wrong when setting up the
environment, but what could that be?

OS: Ubuntu 10.04 LTS.
16GB RAM.
RabbitMQ 2.6.1
Erlang R13B03 (erts-5.7.4) (package erlang-nox from Ubuntu repository)
Client: php-amqplib

Joseph Sikorski

unread,
Apr 8, 2015, 11:19:30 AM4/8/15
to rabbitmq...@googlegroups.com, rabbitmq...@lists.rabbitmq.com, min...@gmail.com
We ran into the same issue (running off Windows nodes). We ended up upgrading RabbitMQ from 3.4.4 to 3.5.1 and Erlang from 17.1 to 17.5 (erts-6.1 to 6.4). Its been a few days after we upgraded and there has been no issues since.

I looked through the Erlang release notes and there looked to be a bug in 17.1 that may of contributed to the issue. The bug was listed as resolved in 17.3.

per.gunn...@gmail.com

unread,
Aug 8, 2017, 1:08:09 PM8/8/17
to rabbitmq-discuss, rabbitmq...@lists.rabbitmq.com, si...@rabbitmq.com
Alas, we also sporadically experience this problem. Erl 19.3, RabbitMq 3.6.8.

Thought we had nailed it last year, but it reared its head again this summer.

Any ideas, anyone?


Per Gunnar Hansø
Sentinel Software AS

jstr...@gmail.com

unread,
Sep 20, 2018, 5:15:21 AM9/20/18
to rabbitmq-discuss
For what it's worth, we also encountered similar problems and it appeared to have been a file descriptor issue - the host in question happened to have a measly limit of 1024. We managed to consistently crash it in a few seconds when bombarding an application with a stress test to find the performance limits of that particular host.

With Ubuntu 16.04 and systemd, you might try adding a similar line to rabbitmq-server.service, in the [Service] section:

LimitNOFILE=300000

Then run systemctl daemon-reload and restart the service. 300000 is probably overkill but it certainly fixed the problem. rabbitmqctl status should show the new limit.

RabbitMQ documentation recommends using at least 65536 for production environments.

The reason for the low limit with the host was a Puppet module that didn't correctly configure the descriptor limits. A closer look shows that we just didn't deploy the systemd module, which is mentioned as a dependency. This should provide the fact that the module depends on to decide whether it adds the value to the service unit.

Hope this helps.

jstr...@gmail.com

unread,
Sep 21, 2018, 3:06:12 AM9/21/18
to rabbitmq-discuss
On closer inspection, the reason for the module failing to set the limits seems to be that it's using the wrong service name -  "rabbitmq" and not "rabbitmq-server" - and thus the override config is never applied to the actual service. This is just a Puppet detail though.

rakesh...@paytm.com

unread,
Apr 20, 2019, 3:42:50 AM4/20/19
to rabbitmq-discuss
Increasing the file descriptor helped.
Reply all
Reply to author
Forward
0 new messages