Upgraded to RabbitMQ 3.5.6 and continue to get erl_crash.dump and rabbitmq is unresponsive

1,018 views
Skip to first unread message

vitaly numenta

unread,
Oct 18, 2015, 7:41:17 AM10/18/15
to rabbitmq-users
Rabbitmq became completely unresponsive on my Mac OS X Yosemite machine today. I was running 3.5.1 at the time. `rabbitmqctl stop` hangs. Browser can't connect to the rabbitmq Web GUI. The log appears to get stuck on "

msg_store_persistent: rebuilding indices from scratch". So, I thought I'd upgrade to 3.5.6 (via homebrew). The upgrade went without any problems. However, rabbitmq is still "stuck". I rebooted the Mac and restarted rabbitmq, but the problem just won't go away.


Please help. The crash and log files are on google drive: 


Many thanks,

Vitaly



erl_crash.dump
rabbit@localhost.log

vitaly numenta

unread,
Oct 18, 2015, 7:48:58 AM10/18/15
to rabbitmq-users
The erl_crash.dump and rab...@localhost.log are attached with my prior post. The sasl log is on google drive at https://drive.google.com/file/d/0B826DlZRam76OTBrZ3V3NmNKcW8/view?usp=sharing

Alvaro Videla

unread,
Oct 18, 2015, 7:58:27 AM10/18/15
to rabbitm...@googlegroups.com
The sasl-log shows that RabbitMQ is not able to obtain information on how much disk space is available on your machine.

What's the output of this command on your machine:

/bin/df -kP /path/to/mnesia/dir

Also, is the user running rabbitmq able to un said command against that folder?

On Sun, Oct 18, 2015 at 1:48 PM, vitaly numenta <vitaly.kru...@gmail.com> wrote:
The erl_crash.dump and rab...@localhost.log are attached with my prior post. The sasl log is on google drive at https://drive.google.com/file/d/0B826DlZRam76OTBrZ3V3NmNKcW8/view?usp=sharing

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Oct 18, 2015, 8:08:35 AM10/18/15
to rabbitm...@googlegroups.com, vitaly numenta
On 18 Oct 2015 at 14:41:24, vitaly numenta (vitaly.kru...@gmail.com) wrote:
> However, rabbitmq is still "stuck". I rebooted the Mac and restarted
> rabbitmq, but the problem just won't go away.

Check hostname resolution: does your machine’s hostname resolve to 127.0.0.1, via DNS or /etc/hosts?

On OS X it is very common to see similar issues (servers or clients “get stuck”) when VPN is in effect
because it changes the resolver.

Another possible cause for broker not finishing boot is various security mechanisms preventing
it from binding to ports, although that should much more quickly than hostname lookups .
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

vitaly numenta

unread,
Oct 18, 2015, 4:15:19 PM10/18/15
to rabbitmq-users

Hi Alvaro, Michael, et al,



here are the results:


1. This problem is happening with the VPN OFF. I am aware of issues when running with VPN, as experienced personally and discussed on numerous threads here. RabbtMQ would slow down severely with VPN ON, but then would recover once the VPN was turned OFF. Every now and then, it wouldn't recover after turning VPN OFF, and I would have to kill it, and then restart rabbit to get back to normal. This time is different: RabbitMQ seems "stuck" all the time regardless of VPN ON/OFF and rebooting the Mac OS X 10.10.5 doesn't help at all.


2. This all has been running fine for a long time with 3.5.1, except the pesky erl_crash.dump discussed in the thread https://groups.google.com/forum/#!searchin/rabbitmq-users/erl_crash.dump%7Csort:date/rabbitmq-users/3oYTb9feQr8/iSBiRGy8tZ4J. The problem in the current thread started suddenly this night, after successfully running tests and code for much of the preceding day. Now, every time I execute `rabbitmq-server -detached`, erl_crash.dump shows up in the current directory. Per referenced thread, the regular erl_crash dumps were going to be addressed. I installed 3.5.6 in hopes that it would help recover from the stuck state (before posting this thread), but it didn't help.  However, I can't tell whether the erl_crash dump in the other thread is related to the one in this thread.


3.

$ /bin/df -kP /usr/local/var/lib/rabbitmq/mnesia/

Filesystem 1024-blocks      Used Available Capacity  Mounted on

/dev/disk1   487385236 154144276 332984960    32%    /


4.
 I start rabbitmq via `rabbitmq-server -detached` from my own login user account - the same account from which I ran the above `/bin/df ...` command.

5. I believe that my hostname resolves via /etc/hosts or /private/etc/hosts, which both have the exact same content on my machine:
=====
$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1 localhost vkruglikovs-MacBook-Pro.local
255.255.255.255 broadcasthost
::1             localhost vkruglikovs-MacBook-Pro.local
=====

$ hostname 

vkruglikovs-MacBook-Pro.local


This shows that the pure DNS resolver doesn't see it:

$ host `hostname`

Host vkruglikovs-MacBook-Pro.local not found: 3(NXDOMAIN)


But the "official" system resolver on Mac OS X sees it just fine (aided by the hosts file, I guess):

$ dscacheutil -q host -a name `hostname`

name: localhost

alias: vkruglikovs-macbook-pro.local 

ipv6_address: ::1


name: localhost

alias: vkruglikovs-macbook-pro.local 

ip_address: 127.0.0.1


And ping works, too:

$ ping `hostname`

PING localhost (127.0.0.1): 56 data bytes

64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.065 ms


And in Python:

In [7]: socket.gethostname()

Out[7]: 'vkruglikovs-MacBook-Pro.local'

In [8]: socket.gethostbyname('vkruglikovs-MacBook-Pro.local')

Out[8]: '127.0.0.1'


vitaly numenta

unread,
Oct 18, 2015, 4:39:35 PM10/18/15
to rabbitmq-users
When I start rabbitmq without `-detached`, the console output gets stuck here:

$ rabbitmq-server 

              RabbitMQ 3.5.6. Copyright (C) 2007-2015 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /usr/local/var/log/rabbitmq/rab...@localhost.log
  ######  ##        /usr/local/var/log/rabbitmq/rab...@localhost-sasl.log
  ##########
              Starting broker...

Michael Klishin

unread,
Oct 18, 2015, 4:54:07 PM10/18/15
to rabbitm...@googlegroups.com, vitaly numenta
Vitaly,

This is not a particularly productive way of investigating what blocks
a process. Tracing with dtrace (or strace, or similar)  is.

Michael Klishin

unread,
Oct 18, 2015, 4:56:01 PM10/18/15
to rabbitm...@googlegroups.com, vitaly numenta
 On 18 October 2015 at 23:39:39, vitaly numenta (vitaly.kru...@gmail.com) wrote:
> 2. This all has been running fine for a long time with 3.5.1, except
> the pesky erl_crash.dump discussed in the thread https://groups.google.com/forum/#!searchin/rabbitmq-users/erl_crash.dump%7Csort:date/rabbitmq-users/3oYTb9feQr8/iSBiRGy8tZ4J.
> The problem in the current thread started suddenly this night,
> after successfully running tests and code for much of the preceding
> day. Now, every time I execute `rabbitmq-server -detached`,
> erl_crash.dump shows up in the current directory. Per referenced
> thread, the regular erl_crash dumps were going to be addressed.
> I installed 3.5.6 in hopes that it would help recover from the
> stuck state (before posting this thread), but it didn't help.
> However, I can't tell whether the erl_crash dump in the other
> thread is related to the one in this thread.

The issue in the linked thread is in no way related to RabbitMQ server being blocked
on start. Dumps there come from rabbitmqctl, not rabbitmq-server.

Have you tried copying RabbitMQ's database directory and deleting it at its original location
to see if it can start as a virgin node?

Alvaro Videla

unread,
Oct 18, 2015, 5:01:32 PM10/18/15
to rabbitm...@googlegroups.com
You seem to have two different errors on sasl logs:

1) the disk monitor is not able to tell how much disk space is available, crashes, and then restarts, and crashes and so on.

2) The rabbit.erl file (which is the file that starts the broker and all its subcomponets) fails here: https://github.com/rabbitmq/rabbitmq-server/blob/rabbitmq_v3_5_6/src/rabbit.erl#L943 which begs the question if you by chance didn't end up with a garbled update at some point? I ask that because your error logs say:

... omitted lines ...
  exception exit: {{badmatch,{error,enoent}},
                     [{filename,absname,1,[{file,"filename.erl"},{line,67}]},
                      {rabbit,config_files,0,
                          [{file,"src/rabbit.erl"},{line,863}]},
... omitted lines ...

line 863 on rabbit.erl on the 3.5.6 tag doesn't call the config_files/0 function, nor the filename:absname/1 function. That happens on line 943: https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit.erl#L943, but on the other hand, on version 3.5.1 the call to file:absname/1 was indeed on line 863 https://github.com/rabbitmq/rabbitmq-server/blob/rabbitmq_v3_5_1/src/rabbit.erl#L863

So perhaps make sure you have a clean install/upgrade? 

BTW, the file:absname seems to fail to find the rabbitmq config files.



--

vitaly numenta

unread,
Oct 18, 2015, 5:13:56 PM10/18/15
to rabbitmq-users
Being concerned about bad sw, I uninstalled rabbitmq and erlang, and then re-installed rabbitmq. (`brew install rabbitmq`). Rebooted the machine. The problem didn't go away. I subsequently got rid of the mnesia repository/logs (didn't care about contents on this development machine) and restarted rabbitmq. After this, rabbitmq was able to complete the startup sequence and appears to be functional.

Could it be that the log files became corrupted and rabbitmq couldn't deal with that in a graceful way? Makes me wish I kept a copy of the log files for debugging. I will save them if I run into this again.

Michael, thanks for the dtrace suggestion. I will try it next time things go awry.

Thank you,
Vitaly

vitaly numenta

unread,
Nov 5, 2015, 9:56:09 PM11/5/15
to rabbitmq-users
The problem is back, now entirely on 3.5.6. If you recall, last time rabbitmq got into this predicament, deleting the mnesia database files "fixed" the problem. Of course, this is not a desirable solution. I don't recall ever having this problem with 3.4.x, so it may be new to the 3.5.x series.

Here is how things got to this point today:

Running on Mac OS X Yosemite 10.10.5

I had both rabbitmq and VPN running for many hours today. I wasn't using rabbitmq for most of the day. Then, I tried running a test, and it was timing out while trying to connect to rabbitmq.

At this point, trying to load http://localhost:15672/ displays "This webpage is not available".

I waited for a while, then ran `rabbitmqctl stop`. It completed this time, and didn't seem to complain.

I disconnected from VPN.

I restarted rabbitmq via `rabbitmq-server -detatched`, but connecting to rabbitmq continued to timet out over and over.

I ran `rabbitmqctl stop`, but this time it hung (didn't return). After a while, I exited it via CTL-C.

vkruglikov@vkruglikovs-MacBook-Pro-2:~$ ps aux | grep -i rabbit

root            19212   1.9  0.6  2672648  97024 s004  S     6:00PM   0:34.58 /usr/local/Cellar/rabbitmq/3.5.6/erts-6.4/bin/../../erts-6.4/bin/beam.smp -W w -A 64 -P 1048576 -K true -B i -- -root /usr/local/Cellar/rabbitmq/3.5.6/erts-6.4/bin/../.. -progname erl -- -home /Users/vkruglikov -- -pa /usr/local/Cellar/rabbitmq/3.5.6/ebin -noshell -noinput -s rabbit boot -sname rabbit@localhost -boot /usr/local/Cellar/rabbitmq/3.5.6/releases/3.5.6/start_sasl -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"127.0.0.1",5672}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/usr/local/var/log/rabbitmq/rab...@localhost.log"} -rabbit sasl_error_logger {file,"/usr/local/var/log/rabbitmq/rab...@localhost-sasl.log"} -rabbit enabled_plugins_file "/usr/local/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/local/Cellar/rabbitmq/3.5.6/plugins" -rabbit plugins_expand_dir "/usr/local/var/lib/rabbitmq/mnesia/rabbit@localhost-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/usr/local/var/lib/rabbitmq/mnesia/rabbit@localhost" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672

root            19209   0.0  0.0  2467612    388   ??  S     6:00PM   0:00.04 /usr/local/Cellar/rabbitmq/3.5.6/erts-6.4/bin/../../erts-6.4/bin/epmd -daemon



I killed both rabbit processes via `kill PID`, and restarted `rabbitmq-server -detached` again. Ended up in the same unresponsive state.


LOGS:

dtrace output of running `sudo dtruss rabbitmq-server` after killing both rabbit processes: https://gist.github.com/vitaly-krugl/0e069eecf19db31bd24c


dtrace output for /usr/local/Cellar/rabbitmq/3.5.6/erts-6.4/bin/../../erts-6.4/bin/beam.smp log: https://gist.github.com/vitaly-krugl/739ab8b4e146c31741e1


dtrace output for /usr/local/Cellar/rabbitmq/3.5.6/erts-6.4/bin/../../erts-6.4/bin/epmd -daemon: https://gist.github.com/vitaly-krugl/fc63a0d4d336f7ebc7d7



vitaly numenta

unread,
Nov 5, 2015, 10:06:36 PM11/5/15
to rabbitmq-users
Got this crash dump on subsequent attempt to restart rabbitmq via `rabbitmq-server -detached`: https://gist.github.com/vitaly-krugl/6f091d1c84e078f4303a. This restart was performed after killing both rabbit processes (using kill, since `rabbitmqctl stop` just hang).


ps shows that only one rabbit process remains running after attempted restart:

vkruglikov@vkruglikovs-MacBook-Pro-2:~$ ps aux | grep -i rabbit

vkruglikov      20213   0.0  0.0  2432772    680 s003  S+    6:59PM   0:00.00 grep -i rabbit

vkruglikov      20165   0.0  0.0  2459420    372   ??  S     6:57PM   0:00.00 /usr/local/Cellar/rabbitmq/3.5.6/erts-6.4/bin/../../erts-6.4/bin/epmd -daemon

vkruglikov@vkruglikovs-MacBook-Pro-2:~$



vitaly numenta

unread,
Nov 5, 2015, 10:10:44 PM11/5/15
to rabbitmq-users
Just as before, after renaming /usr/local/var/lib/rabbitmq/mnesia, I was able to restart rabbitmq and start using it again successfully. I saved the old mnesia directory. Please let me know if you need to see any parts of it.

Michael Klishin

unread,
Nov 5, 2015, 10:58:53 PM11/5/15
to rabbitm...@googlegroups.com
SASL log is full of nxdomain errors which is a domain resolution failure. This may be a red herring but is specific to your environment (VPN affects hostname resolution)

Plus the node tries to rebuild indices from scratch.

How much data was there on disk at the time of restart?
--

Michael Klishin

unread,
Nov 5, 2015, 10:59:29 PM11/5/15
to rabbitm...@googlegroups.com
All of it.

vitaly numenta

unread,
Nov 6, 2015, 1:21:37 AM11/6/15
to rabbitmq-users

There was very little in mnesia at the time that rabbitmq was unresponsive:


vkruglikov@vkruglikovs-MacBook-Pro-2:/usr/local/var/lib/rabbitmq (master)$ du -sh UNRESPONSIVE-MNESIA-OLD/

4.1M UNRESPONSIVE-MNESIA-OLD/

vkruglikov@vkruglikovs-MacBook-Pro-2:/usr/local/var/lib/rabbitmq (master)$ ls UNRESPONSIVE-MNESIA-OLD/

rabbit@localhost                rabbit@localhost-plugins-expand rab...@localhost.pid

vkruglikov@vkruglikovs-MacBook-Pro-2:/usr/local/var/lib/rabbitmq (master)$ 



And there is a ton of free disk space on the system:

$ df -h /

Filesystem   Size   Used  Avail Capacity  iused    ifree %iused  Mounted on

/dev/disk1  465Gi  153Gi  312Gi    33% 40125538 81720769   33%   /


Michael Klishin

unread,
Nov 6, 2015, 1:30:45 AM11/6/15
to rabbitm...@googlegroups.com, vitaly numenta
On 6 November 2015 at 09:21:40, vitaly numenta (vitaly.kru...@gmail.com) wrote:
> There was very little in mnesia at the time that rabbitmq was unresponsive:
 >
> vkruglikov@vkruglikovs-MacBook-Pro-2:/usr/local/var/lib/rabbitmq
> (master)$ du -sh UNRESPONSIVE-MNESIA-OLD/
>
>
> 4.1M UNRESPONSIVE-MNESIA-OLD/

Can you please archive it and send it to us (off list, if necessary)?

vitaly numenta

unread,
Nov 6, 2015, 1:37:45 AM11/6/15
to rabbitmq-users, vitaly.kru...@gmail.com

Alvaro Videla

unread,
Nov 6, 2015, 4:41:18 AM11/6/15
to rabbitm...@googlegroups.com, vitaly numenta
You have the error {error,nxdomain} which might be caused by hostname resolution. This could be related to using the VPN, or it could be related to a lack of file descriptors. Erlang starts a new connection in order to see if it can resolve the hostname, so if you don't have enough file descriptors, then it will fail with the nxdomain error.

About the VPN, if the VPN changes the machine name, and then you switch the VPN off, and the machine name goes back to normal, then it's quite possible that it won't start.

On Fri, Nov 6, 2015 at 7:37 AM, vitaly numenta <vitaly.kru...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Nov 6, 2015, 5:06:54 AM11/6/15
to rabbitm...@googlegroups.com, Alvaro Videla, vitaly numenta
Álvaro,

Can we try restoring from the database archive
provided by Vitaly? I think I've seen at least one
similar report.

In particular, rebuilding indices that takes forever
on a small data set doesn't sound right.

MK

Alvaro Videla

unread,
Nov 6, 2015, 6:35:04 AM11/6/15
to rabbitm...@googlegroups.com
I can reproduce the issue, but I need to get some things ready before diving into this.

vitaly numenta

unread,
Nov 6, 2015, 11:55:08 AM11/6/15
to rabbitmq-users, vitaly.kru...@gmail.com
Alvaro,

once it gets into this state, the unresponsive state persists even after stopping VPN; and even after rebooting the machine and not starting VPN at all.

Alvaro Videla

unread,
Nov 9, 2015, 8:39:54 PM11/9/15
to rabbitm...@googlegroups.com, vitaly numenta
Hi,

After some debugging, this is what I've found:

The server is blocked during startup because the index is trying to recover, but it fails since it can't open the index journal due to lack of file descriptors.

What's your ulimit -n value?

In my Mac that value is 256 by default, but once I bumped it to 1024, then I could start RabbitMQ using the Mnesia folder you shared above.

I'll try to dig deeper into this so we are able to report the right error message back, and most probably make the broker start fail, instead of just hang.

Regards,

Alvaro

vitaly numenta

unread,
Nov 9, 2015, 8:46:54 PM11/9/15
to rabbitmq-users, vitaly.kru...@gmail.com
Hi Alvaro,

 +10 for reporting/failing and not hanging.

My `ulimit -n` is 256 as well. I will restore the amnesia files and try both with the current as well as bumped up limits, and post my results here.

vitaly numenta

unread,
Nov 9, 2015, 8:54:50 PM11/9/15
to rabbitmq-users, vitaly.kru...@gmail.com
Alvaro, what is the appropriate way to set and persist the desired `ulimit -n` setting on Mac OS X? I did some googling, and there appears to be be confusion about how to do this on the Mac.

Thx.

Alvaro Videla

unread,
Nov 9, 2015, 9:11:32 PM11/9/15
to rabbitm...@googlegroups.com, vitaly numenta

vitaly numenta

unread,
Nov 10, 2015, 1:42:23 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
Hi Alvaro, I have good news and bad news - when I set ulimit to 1024 via rabbitmq-env.conf and restart RabbitMQ with the "unresponsive" mnesia database, it appears to start up okay, with the exception of these two WARNINGS:

=WARNING REPORT==== 9-Nov-2015::22:18:24 ===

msg_store_persistent: rebuilding indices from scratch


=WARNING REPORT==== 9-Nov-2015::22:18:24 ===

Mnesia(rabbit@localhost): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}



However, if I gracefully stop RabbitMQ (`rabbitmqctl stop`), then set `ulimit -n` back to the default (256) and restart RabbitMq (`rabbitmq-server -detached`), it *hangs* again.


My expectation was that once Mnesia indices are rebuilt, RabbitMQ would be able to work fine with the default `ulimit -n` of 256 after that.


The logs with `ulimit -n`=1024 and subsequent graceful restart with `ulimit -n`=256 are below


Starting with "UNRESPONSIVE" mnesia database and `ulimit -n`=1024 ==> NO HANG

=INFO REPORT==== 9-Nov-2015::22:17:58 ===

Halting Erlang VM


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

Starting RabbitMQ 3.5.6 on Erlang 17.5

Copyright (C) 2007-2015 Pivotal Software, Inc.

Licensed under the MPL.  See http://www.rabbitmq.com/


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

node           : rabbit@localhost

home dir       : /Users/vkruglikov

config file(s) : /usr/local/etc/rabbitmq/rabbitmq.config (not found)

cookie hash    : weyu9iv/l8eIOTGaIPFcYw==

log            : /usr/local/var/log/rabbitmq/rab...@localhost.log

sasl log       : /usr/local/var/log/rabbitmq/rab...@localhost-sasl.log

database dir   : /usr/local/var/lib/rabbitmq/mnesia/rabbit@localhost


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

Memory limit set to 6006MB of 15015MB total.


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

Disk free limit set to 50MB


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

Limiting to approx 924 file handles (829 sockets)


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

FHC read buffering:  ON

FHC write buffering: ON


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

Priority queues enabled, real BQ is rabbit_variable_queue


=INFO REPORT==== 9-Nov-2015::22:18:23 ===

Management plugin: using rates mode 'basic'

 

=INFO REPORT==== 9-Nov-2015::22:18:24 ===

msg_store_transient: using rabbit_msg_store_ets_index to provide index


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

msg_store_persistent: using rabbit_msg_store_ets_index to provide index


=WARNING REPORT==== 9-Nov-2015::22:18:24 ===

msg_store_persistent: rebuilding indices from scratch


=WARNING REPORT==== 9-Nov-2015::22:18:24 ===

Mnesia(rabbit@localhost): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

started TCP Listener on 127.0.0.1:5672


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

rabbit_stomp: default user 'guest' enabled


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

started STOMP TCP Listener on [::]:61613


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

started MQTT TCP Listener on [::]:1883


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

Management plugin started. Port: 15672


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

Statistics database started.


=INFO REPORT==== 9-Nov-2015::22:18:24 ===

Server startup complete; 10 plugins started.




Restarting gracefully after above, but with `ulimit -n` = 256 ==> HANG AGAIN

=INFO REPORT==== 9-Nov-2015::22:32:13 ===

Halting Erlang VM


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

Starting RabbitMQ 3.5.6 on Erlang 17.5

Copyright (C) 2007-2015 Pivotal Software, Inc.

Licensed under the MPL.  See http://www.rabbitmq.com/


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

node           : rabbit@localhost

home dir       : /Users/vkruglikov

config file(s) : /usr/local/etc/rabbitmq/rabbitmq.config (not found)

cookie hash    : weyu9iv/l8eIOTGaIPFcYw==

log            : /usr/local/var/log/rabbitmq/rab...@localhost.log

sasl log       : /usr/local/var/log/rabbitmq/rab...@localhost-sasl.log

database dir   : /usr/local/var/lib/rabbitmq/mnesia/rabbit@localhost


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

Memory limit set to 6023MB of 15059MB total.


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

Disk free limit set to 50MB 


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

Limiting to approx 156 file handles (138 sockets)


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

FHC read buffering:  ON

FHC write buffering: ON


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

Priority queues enabled, real BQ is rabbit_variable_queue


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

Management plugin: using rates mode 'basic' 


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

msg_store_transient: using rabbit_msg_store_ets_index to provide index


=INFO REPORT==== 9-Nov-2015::22:32:16 ===

msg_store_persistent: using rabbit_msg_store_ets_index to provide index


=WARNING REPORT==== 9-Nov-2015::22:32:16 ===

Mnesia(rabbit@localhost): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}


=WARNING REPORT==== 9-Nov-2015::22:32:16 ===

Mnesia(rabbit@localhost): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}


=WARNING REPORT==== 9-Nov-2015::22:32:16 ===

Mnesia(rabbit@localhost): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}


=WARNING REPORT==== 9-Nov-2015::22:32:16 ===

Mnesia(rabbit@localhost): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}

vitaly numenta

unread,
Nov 10, 2015, 1:51:30 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
After rebuilding the indices on this database with `ulimit -n` = 1024, the number of file descriptors remains at over 600 *without any AMQP connections* besides the Web GUI management plugin. It's also over 600 on subsequent graceful restarts. However, when running on a recently-created dataset that doesn't exhibit this issue, the WebGUI displays the number of file descriptors as a little over 80.

Alvaro Videla

unread,
Nov 10, 2015, 1:53:19 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
The problem is not related to index rebuild, but to RabbitMQ running out of file descriptors. Each queue in the system needs one FD for the queue index journal. So what you see is expected behavior so to speak.
--

Alvaro Videla

unread,
Nov 10, 2015, 1:54:54 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
i.e.: the problem is about the number of queues in the system vs. the number of available FDs.

vitaly numenta

unread,
Nov 10, 2015, 2:11:36 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
Alvaro, thank you for the clarification. I recall reading somewhere in the docs that RabbitMQ maintains a cache of open file descriptors, but is able to close them as needed in order to make resources available for other operations. Did I get it wrong? If this many queue (I had almost 600) requires so many file descriptors (far above 256 that my system allowed originally), then how was RabbitMQ even able to create so many queue in the first place without running out of file descriptors?


Alvaro Videla

unread,
Nov 10, 2015, 2:20:06 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
The queue index journal FD is not created right away during queue creation.
On Tue, Nov 10, 2015 at 8:11 AM vitaly numenta <vitaly.kru...@gmail.com> wrote:
Alvaro, thank you for the clarification. I recall reading somewhere in the docs that RabbitMQ maintains a cache of open file descriptors, but is able to close them as needed in order to make resources available for other operations. Did I get it wrong? If this many queue (I had almost 600) requires so many file descriptors (far above 256 that my system allowed originally), then how was RabbitMQ even able to create so many queue in the first place without running out of file descriptors?


vitaly numenta

unread,
Nov 10, 2015, 2:25:49 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
Great, thank you! Indeed, after deleting a bunch of queues, I was able to restart RabbitMQ successfully with `ulimit -n` = 256.

It would be awesome if RabbitMQ could log something really informative and fail, instead of hanging in this scenario.

Many thanks!

Vitaly

Michael Klishin

unread,
Nov 10, 2015, 2:44:58 AM11/10/15
to rabbitm...@googlegroups.com, vitaly.kru...@gmail.com
What if RabbitMQ needs more than 256 file handles (which includes sockets) overall?

Sorry but this expectation is unreasonable.
No developer environment should have such low limits: it is dirt cheap to bump that to 100K and forget about this issue forever.

ulimit defaults in modern OS X and Linux
versions are absolutely insane, in particular for running data services. Use them at your own peril, in development environments included.

Michael Klishin

unread,
Nov 10, 2015, 2:47:01 AM11/10/15
to rabbitm...@googlegroups.com, vitaly.kru...@gmail.com
Logging is a good idea if we can do it.
I'm much less convinced about nodes stopping.

vitaly numenta

unread,
Nov 10, 2015, 2:50:52 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
Hi Michael, thanks for chiming in. I only mentioned 256 as confirmation that things work as expected, and there aren't any other problems. I get it now. I bumped up the limit to a much higher number. What caught me off-guard is that I've been running RabbitMQ for about 4 years now with lots of queues and have never run into this issue until recently.

Best regards,
Vitaly

Michael Klishin

unread,
Nov 10, 2015, 2:51:45 AM11/10/15
to rabbitm...@googlegroups.com, vitaly.kru...@gmail.com
Having a cache does not mean you would never need more than 256 file descriptors per process.

Please bump up the number to 100-500K in all environments you have and move on.

Not having enough file descriptors is not a problem data services such as RabbitMQ should be solving for developers and ops.
PostgreSQL, Riak, Cassandra docs all talk about bumping kernel limits. So does RabbitMQ's. And it is not a coincidence.

On 10 nov 2015, at 10:11, vitaly numenta <vitaly.kru...@gmail.com> wrote:

Alvaro, thank you for the clarification. I recall reading somewhere in the docs that RabbitMQ maintains a cache of open file descriptors, but is able to close them as needed in order to make resources available for other operations. Did I get it wrong? If this many queue (I had almost 600) requires so many file descriptors (far above 256 that my system allowed originally), then how was RabbitMQ even able to create so many queue in the first place without running out of file descriptors?


vitaly numenta

unread,
Nov 10, 2015, 2:53:10 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
Yes, an informative log message would be awesome all by itself.

vitaly numenta

unread,
Nov 10, 2015, 2:59:01 AM11/10/15
to rabbitmq-users, vitaly.kru...@gmail.com
Thanks Michael. I get it. I don't expect RabbitMQ to optimize for low file descriptor limits. You guys are doing an amazing job. I am personally always amazed, pleased, and surprised that you and Alvaro can be so responsive on this list.

Michael Klishin

unread,
Nov 10, 2015, 3:18:36 AM11/10/15
to rabbitm...@googlegroups.com, vitaly numenta
On 10 November 2015 at 10:53:20, vitaly numenta (vitaly.kru...@gmail.com) wrote:
> Yes, an informative log message would be awesome all by itself.

Issue filed:
https://github.com/rabbitmq/rabbitmq-server/issues/413
Reply all
Reply to author
Forward
0 new messages