We use RabbitMQ for about a year now. From time to time I upgraded it
and switched from one server to another. About a month ago the last
such transition took place. I installed new RabbitMQ (2.7) on a new
server and our web application was reconfigured. Quite soon we faced
new problems. After some days of stable work clients could not connect
to RabbitMQ. I could list run rabbitmqctl, list queues, kill
connections, but the server refused attempts to connect. That is, TCP
socket was available and telnet could connect to port 5672, but the
AMQP connection could not be established. There was nothing unusual in
the logs. vm_memory_high_watermark is set to 0.7 and there's still
plenty of free memory.
After a couple of such failures I tried to downgrade to 2.6.1, but the
problem remained. The last time I disabled IPv6, but today we hit the
same trouble again.
I think I must have done something wrong when setting up the
environment, but what could that be?
OS: Ubuntu 10.04 LTS.
16GB RAM.
RabbitMQ 2.6.1
Erlang R13B03 (erts-5.7.4) (package erlang-nox from Ubuntu repository)
Client: php-amqplib
--
With best regards,
Dmitri Minaev
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq...@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
And when you say that "the server refused attempts to connect", what
exactly do you mean. You say that a TCP connection *could* be
established - so does your client hang during AMQP handshaking?
Disconnect? Something else?
Cheers, Simon
On 12/12/11 16:24, Dmitri Minaev wrote:
> Hello,
>
> We use RabbitMQ for about a year now. From time to time I upgraded it
> and switched from one server to another. About a month ago the last
> such transition took place. I installed new RabbitMQ (2.7) on a new
> server and our web application was reconfigured. Quite soon we faced
> new problems. After some days of stable work clients could not connect
> to RabbitMQ. I could list run rabbitmqctl, list queues, kill
> connections, but the server refused attempts to connect. That is, TCP
> socket was available and telnet could connect to port 5672, but the
> AMQP connection could not be established. There was nothing unusual in
> the logs. vm_memory_high_watermark is set to 0.7 and there's still
> plenty of free memory.
>
> After a couple of such failures I tried to downgrade to 2.6.1, but the
> problem remained. The last time I disabled IPv6, but today we hit the
> same trouble again.
>
> I think I must have done something wrong when setting up the
> environment, but what could that be?
>
> OS: Ubuntu 10.04 LTS.
> 16GB RAM.
> RabbitMQ 2.6.1
> Erlang R13B03 (erts-5.7.4) (package erlang-nox from Ubuntu repository)
> Client: php-amqplib
>
--
Simon MacMullen
RabbitMQ, VMware
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Besides the common information messages (starting/closing TCP
connection), there's only one type of messages in the log files:
=WARNING REPORT==== 13-Dec-2011::16:56:51 ===
exception on TCP connection <0.14474.173> from x.x.x.x:xxx
connection_closed_abruptly
But then, again, these messages may be found even during normal
operation, this is why I don't think they're relevant.
--
With best regards,
Dmitri Minaev
Running processes (ps ax|grep rabbit):
-------------
29699 ? Ss 0:00 sh -c
RABBITMQ_PID_FILE=/var/run/rabbitmq/pid /usr/sbin/rabbitmq-server >
/var/log/rabbitmq/startup_log 2>
/var/log/rabbitmq/startup_err
29702 ? S 0:00 /bin/sh /usr/sbin/rabbitmq-server
29708 ? S 0:00 su rabbitmq -s /bin/sh -c
/usr/lib/rabbitmq/bin/rabbitmq-server
29710 ? S 0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
29711 ? Sl 4715:59 /usr/lib/erlang/erts-5.7.4/bin/beam.smp -W
w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl --
-home /var/lib/rabbitmq -- -noshell -noinput -sname rabbit@dbx
-setcookie riak -boot
/var/lib/rabbitmq/mnesia/rabbit@dbx-plugins-expand/rabbit -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -rabbit tcp_listeners [{"0.0.0.0",5672}] -sasl
errlog_type error -kernel error_logger
{file,"/var/log/rabbitmq/rab...@dbx.log"} -sasl sasl_error_logger
{file,"/var/log/rabbitmq/rab...@dbx-sasl.log"} -os_mon start_cpu_sup
true -os_mon start_disksup false -os_mon start_memsup false -mnesia
dir "/var/lib/rabbitmq/mnesia/rabbit@dbx"
-------------
Network sockets are available:
$ sudo netstat -tunlp|grep beam
tcp 0 0 0.0.0.0:5672 0.0.0.0:*
LISTEN 29711/beam.smp
tcp 0 0 0.0.0.0:60040 0.0.0.0:*
LISTEN 29711/beam.smp
$ cat /etc/rabbitmq/rabbitmq.config
[{rabbit, [{vm_memory_high_watermark, 0.7}]},
{rabbit, [{tcp_listeners, [{"0.0.0.0", 5672}]}]}].
$ cat /etc/rabbitmq/rabbitmq-env.conf
RABBITMQ_NODE_IP_ADDRESS=0.0.0.0
strace -p 29711 shows that the process is waiting in select():
select(0, NULL, NULL, NULL, NULL
Last lines in rab...@dbx.log:
---------------------------
=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.367.0> from x.x.x.26:43157
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.367.0> from x.x.x..26:43157
=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.379.0> from x.x.x.26:43160
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.379.0> from x.x.x.26:43160
=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.335.0> from x.x.x.26:43154
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.335.0> from x.x.x.26:43154
=WARNING REPORT==== 22-Dec-2011::09:55:44 ===
exception on TCP connection <0.467.0> from x.x.x.26:43166
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::09:55:44 ===
closing TCP connection <0.467.0> from x.x.x.26:43166
---------------------------
PHP clients cannot connect to RabbitMQ. When I run my test Python
script which uses amqplib.client_0_8, it hangs on
amqp.Connection(host, "guest", "guest", ssl=False)
strace shows the following:
connect(3, {sa_family=AF_INET, sin_port=htons(5672),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR) = 0
sendto(3, "AMQP\1\1\t\1", 8, 0, NULL, 0) = 8
brk(0x1461000) = 0x1461000
recvfrom(3,
Now, I try to connect to the RabbitMQ node using 'erl':
$ erl -sname 'rabbit@dbx'
{error_logger,{{2011,12,22},{10,26,33}},"Protocol: ~p: register error:
~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
{error_logger,{{2011,12,22},{10,26,33}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.9.0>]},{messages,[]},{links,[#Port<0.68>,<0.18.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,377},{stack_size,24},{reductions,442}],[]]}
{error_logger,{{2011,12,22},{10,26,33}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[['rabbit@dbx',shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2011,12,22},{10,26,33}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2011,12,22},{10,26,33}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}
Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
Is there any other information that might be useful?
A small note,
When connecting to a remote Erlang node, in this case the the rabbit node, you have to choose a different node name.
For example:
erl -sname foo
Once you are on the Erlang REPL then you can try to remotely connect to the rabbit node using net_adm:ping
-Alvaro.
Sent from my iFad
$ erl -sname qwer
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4]
[async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.7.4 (abort with ^G)
(qwer@dbx)1> net_adm:names().
{ok,[{"rabbit",60040},{"qwer",58043}]}
(qwer@dbx)2> net_adm:ping(rabbit).
pang
Sent form my Nokia 1100
On 22/12/11 06:32, Dmitri Minaev wrote:
> Now, I have a hanging Rabbit available for the autopsy.
Please send us the output of 'rabbitmqctl report'.
Matthias.
Fifteen minutes ago another RabbitMQ (this time v. 2.7) on another
server also refused to accept connections. I downgraded it to the last
version that worked (if I remember correctly, it is 2.1) and started
again. Let's see if it helps.
On 22 December 2011 13:43, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 06:32, Dmitri Minaev wrote:
>>
>> Now, I have a hanging Rabbit available for the autopsy.
>
>
> Please send us the output of 'rabbitmqctl report'.
>
> Matthias.
--
On 22/12/11 10:17, Dmitri Minaev wrote:
> Here it is, in the attached file.
All looks fine. What memory and file descriptor limits get reported in
the rabbit log?
Can you get a connection established when connecting a client from the
same machine the broker is running on?
=INFO REPORT==== 13-Dec-2011::17:16:53 ===
Limiting to approx 924 file handles (829 sockets)
=INFO REPORT==== 13-Dec-2011::17:16:53 ===
Memory limit set to 11252MB.
No, I cannot connect to Rabbit even from the same server. As I said
before, TCP connection to 127.0.0.1 is established, but AMQP
connection is not established.
On 22 December 2011 14:25, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 10:17, Dmitri Minaev wrote:
>>
>> Here it is, in the attached file.
>
>
> All looks fine. What memory and file descriptor limits get reported in the
> rabbit log?
>
> Can you get a connection established when connecting a client from the same
> machine the broker is running on?
>
> Matthias.
--
With best regards,
Dmitri Minaev
On 22/12/11 10:43, Dmitri Minaev wrote:
> No, I cannot connect to Rabbit even from the same server. As I said
> before, TCP connection to 127.0.0.1 is established, but AMQP
> connection is not established.
On the broker machine, try
$ telnet localhost 5672
and type in 'AMQPxxxx<return>'
That should result in an output of
AMQP Connection closed by foreign host.
and a message in the rabbit.log like this:
=ERROR REPORT==== 22-Dec-2011::10:47:09 ===
exception on TCP connection <0.767.0> from [::1]:48915
{bad_version,120,120,120,120}
Do you get the same?
Matthias.
But some new messages did appear after Rabbit had stopped accepting
connections. They are mostly in the same vein:
=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.431.0> from 212.24.56.22:49702
=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.327.0> from 212.24.56.22:49699
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.327.0> from 212.24.56.22:49699
=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.976.0> from 212.24.56.22:49720
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.976.0> from 212.24.56.22:49720
=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.1007.0> from 212.24.56.22:49721
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.1007.0> from 212.24.56.22:49721
=WARNING REPORT==== 22-Dec-2011::11:03:55 ===
exception on TCP connection <0.1019.0> from 212.24.56.22:49722
connection_closed_abruptly
=INFO REPORT==== 22-Dec-2011::11:03:55 ===
closing TCP connection <0.1019.0> from 212.24.56.22:49722
By the way, here's another graph, number of connections to Rabbit.
Hope it helps.
On 22 December 2011 14:50, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 10:43, Dmitri Minaev wrote:
>>
>> No, I cannot connect to Rabbit even from the same server. As I said
>> before, TCP connection to 127.0.0.1 is established, but AMQP
>> connection is not established.
>
>
> On the broker machine, try
>
> $ telnet localhost 5672
>
> and type in 'AMQPxxxx<return>'
>
> That should result in an output of
>
> AMQP Connection closed by foreign host.
>
> and a message in the rabbit.log like this:
>
> =ERROR REPORT==== 22-Dec-2011::10:47:09 ===
> exception on TCP connection <0.767.0> from [::1]:48915
> {bad_version,120,120,120,120}
>
>
> Do you get the same?
>
> Matthias.
--
This is all very mysterious.
Is the rabbit server process busy, cpu-wise?
Is there anything I could do using -remsh?
On 22 December 2011 15:09, Matthias Radestock <matt...@rabbitmq.com> wrote:
> On 22/12/11 11:02, Dmitri Minaev wrote:
>>
>> No, I get no response from the server and nothing appears in the log.
>
>
> This is all very mysterious.
>
> Is the rabbit server process busy, cpu-wise?
>
> Matthias.
--
With best regards,
Dmitri Minaev
On 22/12/11 11:30, Dmitri Minaev wrote:
> No, it is simply waiting. CPU is idle, RAM is free, swap is unused, IO
> is very moderate.
I suspect this is a problem with Erlang, somehow causing the tcp/ip
sub-system to get into a weird state.
I see you are running R13B03, which is two years old. I suggest you
replace that with Ericsson's latest - R15B - and also run the latest
rabbit (2.7.1).
Regards,
Matthias.
Thank you, Matthias, Alvaro, Simon and everyone else!
On 22 December 2011 15:36, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 22/12/11 11:30, Dmitri Minaev wrote:
>>
>> No, it is simply waiting. CPU is idle, RAM is free, swap is unused, IO
>> is very moderate.
>
>
> I suspect this is a problem with Erlang, somehow causing the tcp/ip
> sub-system to get into a weird state.
>
> I see you are running R13B03, which is two years old. I suggest you replace
> that with Ericsson's latest - R15B - and also run the latest rabbit (2.7.1).
>
> Regards,
>
> Matthias.
--
With best regards,
Dmitri Minaev
I am sorry, but upgrade to Erlang R15B and RabbitMQ 2.7.1 did not
help. Today I saw the same picture: RabbitMQ does not accept new
connections, while everything else seems to be working :(
On 29/12/11 09:27, Dmitri Minaev wrote:
> I am sorry, but upgrade to Erlang R15B and RabbitMQ 2.7.1 did not
> help. Today I saw the same picture: RabbitMQ does not accept new
> connections, while everything else seems to be working :(
That's unfortunate. Though our investigation will be easier now that you
are running the latest version.
Is there any way you could give us access to the broken rabbit?
Also, while in the hung state, what does
$ scripts/rabbitmqctl eval 'file_handle_cache:info().'
return?
Regards,
Matthias.
I would be most grateful if you could have a look at our server. I
will send you more info in an off-list message. Thank you.
Here's the output of file_handle_cache:info():
$ sudo /usr/local/rabbitmq/sbin/rabbitmqctl eval 'file_handle_cache:info().'
[{obtain_count,51},{obtain_limit,829}]
...done.
On 29 December 2011 13:35, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 29/12/11 09:27, Dmitri Minaev wrote:
>>
>> I am sorry, but upgrade to Erlang R15B and RabbitMQ 2.7.1 did not
>> help. Today I saw the same picture: RabbitMQ does not accept new
>> connections, while everything else seems to be working :(
>
>
> That's unfortunate. Though our investigation will be easier now that you are
> running the latest version.
>
> Is there any way you could give us access to the broken rabbit?
>
> Also, while in the hung state, what does
>
> $ scripts/rabbitmqctl eval 'file_handle_cache:info().'
>
> return?
>
> Regards,
>
> Matthias.
--
With best regards,
Dmitri Minaev
On 30/12/11 08:27, Dmitri Minaev wrote:
> I would be most grateful if you could have a look at our server.
I have done this now.
Rabbit indeed wasn't accepting connections - it wasn't refusing them
either, i.e. it was behaving as if 'accept' hadn't been called.
The Erlang process tasked with accepting AMQP connections was alive and
well. It was simply sitting there waiting for the tcp subsystem to
notify it of new connections. Alas that never happened.
So either the acceptor process forgot to tell Erlang's tcp stack to be
notified of new connections, or Erlang's tcp stack forgot that it was
supposed to tell the acceptor process...
...and it looks like there is a path in the tcp_acceptor.erl that would
trigger the former. When the tcp stack notifies the acceptor of an error
other than 'closed', the acceptor carries on but does not invoke
prim_inet:async_accept/2 to be notified of the next connection attempt.
I will file a bug for this. Should be easy to fix, though we cannot be
certain that this is definitely the problem.
Obviously if this was happening frequently we would have heard about the
issue a long time ago - the code in question hasn't changed for >3
years. So there must be some rare circumstances triggering this.
I got the acceptor process to issue another async_accept, so rabbit is
happy for the moment. But no doubt the problem will re-occur.
Regards,
Matthias.
On 30/12/11 12:11, Matthias Radestock wrote:
> When the tcp stack notifies the acceptor of an error
> other than 'closed', the acceptor carries on but does not invoke
> prim_inet:async_accept/2 to be notified of the next connection attempt.
>
> I will file a bug for this. Should be easy to fix, though we cannot be
> certain that this is definitely the problem.
Here's a proposed fix:
http://hg.rabbitmq.com/rabbitmq-server/rev/ca0392ca0fc1
I am attaching a tcp_acceptor.beam with that fix, compiled for R15, that
you can drop in place of the existing file. I'd be interested a) if that
solves the problem for you, and b) what error gets logged - watch out
for s.t. like
=ERROR REPORT==== 30-Dec-2011::13:45:01 ===
failed to accept TCP connection on [::]:5672: some_error
in the logs.
Regards,
Matthias.
But I am still curious about the fact that until about two months ago
our experience with RabbitMQ was very good. We had version 2.1 or 2.2
then and it worked fine. The problems started when I moved Rabbit to
another server and upgraded it. Either the bug was introduced between
versions 2.2 and 2.6.1, or it is related to some changes in the server
environment. Is it possible to find out whether versions 2.1-2.2 were
also influenced by the problem?
--
With best regards,
Dmitri Minaev
On 30/12/11 15:34, Dmitri Minaev wrote:
> But I am still curious about the fact that until about two months ago
> our experience with RabbitMQ was very good. We had version 2.1 or 2.2
> then and it worked fine. The problems started when I moved Rabbit to
> another server and upgraded it. Either the bug was introduced between
> versions 2.2 and 2.6.1, or it is related to some changes in the server
> environment. Is it possible to find out whether versions 2.1-2.2 were
> also influenced by the problem?
As I said, the problem I identified has been around for more than three
years.
The likely trigger is some obscure condition in the network stack. So it
is quite conceivable that the move to a different server made this
happen with a much higher probability than before. For example, it looks
like you are now using IPv6 - was that the case before the move too?
Matthias.
I have installed that file and restarted Rabbit. Also, I asked the
developers whether they can stress-test the server. I will let you
know as soon as possible.
Thanks!
On 30 December 2011 19:44, Matthias Radestock <matt...@rabbitmq.com> wrote:
> Dmitri,
>
>
> On 30/12/11 15:34, Dmitri Minaev wrote:
>>
>> But I am still curious about the fact that until about two months ago
>> our experience with RabbitMQ was very good. We had version 2.1 or 2.2
>> then and it worked fine. The problems started when I moved Rabbit to
>> another server and upgraded it. Either the bug was introduced between
>> versions 2.2 and 2.6.1, or it is related to some changes in the server
>> environment. Is it possible to find out whether versions 2.1-2.2 were
>> also influenced by the problem?
>
>
> As I said, the problem I identified has been around for more than three
> years.
>
> The likely trigger is some obscure condition in the network stack. So it is
> quite conceivable that the move to a different server made this happen with
> a much higher probability than before. For example, it looks like you are
> now using IPv6 - was that the case before the move too?
>
>
> Matthias.
--
With best regards,
Dmitri Minaev
Finally, I can say that the attempt to solve the problem with the
modified tcp_acceptor has failed. For a couple of months Rabbit worked
well, even under moderate load (up to 8-9 million messages), but today
it has failed again with the same symptoms. Let me remind you of the
situation.
RabbitMQ v.2.7.1 working under Erlang R15B on Ubuntu Linux 10.04,
suddenly stops accepting AMQP connections. TCP connections are being
accepted, but no response follows. rabbitmqctl works.
The nonoperating RabbitMQ server is now at my disposable for autopsy.
Hi.
> Finally, I can say that the attempt to solve the problem with the
> modified tcp_acceptor has failed. For a couple of months Rabbit worked
> well, even under moderate load (up to 8-9 million messages), but today
> it has failed again with the same symptoms.
Damn.
Did any error along the lines of "failed to accept TCP connection..."
appear in the logs?
> The nonoperating RabbitMQ server is now at my disposable for autopsy.
If I were able to look at this tomorrow that would be great.
Cheers, Simon
--
Simon MacMullen
RabbitMQ, VMware
On 6 March 2012 20:54, Simon MacMullen <si...@rabbitmq.com> wrote:
> On 06/03/12 06:51, Dmitri Minaev wrote:
>>
>> Dear friends,
>
>
> Hi.
>
>
>> Finally, I can say that the attempt to solve the problem with the
>> modified tcp_acceptor has failed. For a couple of months Rabbit worked
>> well, even under moderate load (up to 8-9 million messages), but today
>> it has failed again with the same symptoms.
>
>
> Damn.
>
> Did any error along the lines of "failed to accept TCP connection..." appear
> in the logs?
>
>
>> The nonoperating RabbitMQ server is now at my disposable for autopsy.
>
>
> If I were able to look at this tomorrow that would be great.
>
> Cheers, Simon
>
>
> --
> Simon MacMullen
> RabbitMQ, VMware
--
With best regards,
Dmitri Minaev
So there's few things going on here. The primary issue is that RabbitMQ
is running out of file descriptors due to too many connections being
opened, and for various reasons the internal accounting that is supposed
to prevent this from causing harm is getting out of sync with reality.
There are already improvements that will be coming in the next release
that will improve this situation and we have some ideas for how to fix
it altogether.
But in the meantime you should look at increasing the number of file
descriptors that are available to RabbitMQ.
Cheers, Simon
--
Simon MacMullen
RabbitMQ, VMware
Oh... Thanks. I thought, that if TCP connection is accepted, number of
file descriptors should have no effect on the further events, since
the socket already exists. Even now I can connect to port 5672 on that
server, but AMQP does not respond.
If I understand correctly, the number of file descriptors used by
RabbitMQ in normal situation is roughly equal to the number of
`rabbitmqctl list_connections` + some constant (~30)? In case of that
hanging server, the number of AMQP connections never was close to the
FD limit (ulimit -n is 1024, fs.file-max = 1605698). The graph
reflecting the number of open AMQP connections is attached to this
message.
I think at this point it hasn't actually allocated the FD, so it can't
communicate.
> If I understand correctly, the number of file descriptors used by
> RabbitMQ in normal situation is roughly equal to the number of
> `rabbitmqctl list_connections` + some constant (~30)? In case of that
> hanging server, the number of AMQP connections never was close to the
> FD limit (ulimit -n is 1024, fs.file-max = 1605698). The graph
> reflecting the number of open AMQP connections is attached to this
> message.
It's not really a constant, but to a first approximation, yes.
Ultimately, the error I saw being passed up from the OS was ENFILE -
that's pretty unambiguous :)
It's possible that if you're churning connections then "closed"
connections in FIN_WAIT2 could account for the majority of the used FDs.
In 2.8.0 we'll set SO_LINGER to 0 to prevent this.
>> If I understand correctly, the number of file descriptors used by
>> RabbitMQ in normal situation is roughly equal to the number of
>> `rabbitmqctl list_connections` + some constant (~30)? In case of that
>> hanging server, the number of AMQP connections never was close to the
>> FD limit (ulimit -n is 1024, fs.file-max = 1605698). The graph
>> reflecting the number of open AMQP connections is attached to this
>> message.
>
>
> It's not really a constant, but to a first approximation, yes.
>
> Ultimately, the error I saw being passed up from the OS was ENFILE - that's
> pretty unambiguous :)
>
> It's possible that if you're churning connections then "closed" connections
> in FIN_WAIT2 could account for the majority of the used FDs. In 2.8.0 we'll
> set SO_LINGER to 0 to prevent this.
I am still not sure about the role of the file descriptors in this
event. Last Friday, our Rabbit died again. Until the very last moment
the number of open file descriptors as reported by the management
plugin was 143 out of the total available number of 32765. It was
RabbitMQ v.2.7.0. On that same day I have upgraded to 2.8.1. Will
report on its behaviour later.
--
With best regards,
Dmitri Minaev
Good luck!Hello, Maulik,I believe, your problem is a different one, because the bug was gone since we had upgraded to 2.8.7. The final message from the support service was:
> The Erlang/OTP team believe that the problem was due to a bug in their code. A small change introduced in RabbitMQ 2.8.7 had the coincidental fortunate effect of bypassing the bug, so the problem should not occur in 2.8.7 or later versions of RabbitMQ.
--With best regards,
Dimitri Minaev
Hello,We use RabbitMQ for about a year now. From time to time I upgraded it
and switched from one server to another. About a month ago the last
such transition took place. I installed new RabbitMQ (2.7) on a new
server and our web application was reconfigured. Quite soon we faced
new problems. After some days of stable work clients could not connect
to RabbitMQ. I could list run rabbitmqctl, list queues, kill
connections, but the server refused attempts to connect. That is, TCP
socket was available and telnet could connect to port 5672, but the
AMQP connection could not be established. There was nothing unusual in
the logs. vm_memory_high_watermark is set to 0.7 and there's still
plenty of free memory.After a couple of such failures I tried to downgrade to 2.6.1, but the
problem remained. The last time I disabled IPv6, but today we hit the
same trouble again.I think I must have done something wrong when setting up the
environment, but what could that be?OS: Ubuntu 10.04 LTS.
16GB RAM.
RabbitMQ 2.6.1
Erlang R13B03 (erts-5.7.4) (package erlang-nox from Ubuntu repository)
Client: php-amqplib