Stopping and starting RabbitMQ Server breaks it on SUSE Linux Enterprise Server 11 SP1

1,256 views
Skip to first unread message

Matti Linnanvuori

unread,
Feb 4, 2015, 6:01:25 AM2/4/15
to rabbitm...@googlegroups.com
Stopping and starting RabbitMQ Server breaks it on SUSE Linux Enterprise Server 11 SP1. Both versions 3.4.3-1.suse and 3.1.1-1.suse.noarch show the same behaviour with a certain timing. I am using a Supervisor script to restart RabbitMQ Server.
rabbitmqctl status then shows

'Error: unable to connect to node 'rabbit@linux-vot7': nodedown'

 
/etc/init.d/rabbitmq-server start stalls waiting for /var/run/rabbitmq/pid that is missing.
rabbitmq 3171 0.0 0.0 10556 660 ? S 06:52 0:00 /usr/lib64/erlang/erts-6.3/bin/epmd -daemon

root 6233 0.0 0.0 11148 1336 ? S 06:58 0:00 _ /bin/bash /opt/services/rabbitmq-runner.sh
root 6235 0.0 0.0 11152 1472 ? S 06:58 0:00 | _ /bin/sh /etc/init.d/rabbitmq-server start
root 6319 0.0 0.0 11148 1404 ? S 06:58 0:00 | _ /bin/sh /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
root 6330 0.0 0.0 48336 1384 ? S 06:58 0:00 | _ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/var/run/rabbitmq/pid"
rabbitmq 6331 0.0 0.0 41184 14016 ? Sl 06:58 0:03 | _ /usr/lib64/erlang/erts-6.3/bin/beam – -root /usr/lib64/erlang -progname erl – -home /var/lib/rabbitmq – -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.4.3/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@linux-vot7" -s rabbit_control_main -nodename rabbit@linux-vot7 -extra wait /var/run/rabbitmq/pid

Michael Klishin

unread,
Feb 4, 2015, 6:03:37 AM2/4/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
On 4 February 2015 at 14:01:27, Matti Linnanvuori (matt...@gmail.com) wrote:
> 'Error: unable to connect to node 'rabbit@linux-vot7': nodedown'

Does linux-vot7 resolve to 127.0.0.1? 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Matti Linnanvuori

unread,
Feb 4, 2015, 6:08:06 AM2/4/15
to rabbitm...@googlegroups.com, matt...@gmail.com
No, linux-vot7 does not resolve to 127.0.0.1 but to another IP address.

Michael Klishin

unread,
Feb 4, 2015, 6:24:16 AM2/4/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
On 4 February 2015 at 14:08:07, Matti Linnanvuori (matt...@gmail.com) wrote:
> No, linux-vot7 does not resolve to 127.0.0.1 but to another
> IP address.

It should for rabbitmqctl (and other command line tools that ship with RabbitMQ ) to work.

Michael Klishin

unread,
Feb 4, 2015, 6:32:29 AM2/4/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
On 4 February 2015 at 14:01:27, Matti Linnanvuori (matt...@gmail.com) wrote:
> 'Error: unable to connect to node 'rabbit@linux-vot7': nodedown'

Is there more output? rabbitmqctl produces more output and in recent versions, prints some diagnostics
information. Please see if there's more context that wasn't posted. 

Michael Klishin

unread,
Feb 4, 2015, 6:33:29 AM2/4/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
On 4 February 2015 at 14:08:07, Matti Linnanvuori (matt...@gmail.com) wrote:
> It should for rabbitmqctl (and other command line tools that
> ship with RabbitMQ ) to work.

…or the hostname should resolve to the local machine.

Matti Linnanvuori

unread,
Feb 5, 2015, 2:04:44 AM2/5/15
to rabbitm...@googlegroups.com, matt...@gmail.com

Here is more output from rabbitmqctl status:


Status of node 'rabbit@linux-vot7' ...

Error: unable to connect to node 'rabbit@linux-vot7': nodedown


DIAGNOSTICS

===========


nodes in question: ['rabbit@linux-vot7']


hosts, their running nodes and ports:

- linux-vot7: [{rabbitmqctl3187,53178},{rabbitmqctl4704,45355}]


current node details:

- node name: 'rabbitmqctl4704@linux-vot7'

- home dir: /var/lib/rabbitmq

- cookie hash: wamEAVLZEVQ4zqN7X2Mdhw==

Michael Klishin

unread,
Feb 5, 2015, 2:13:11 AM2/5/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
 On 5 February 2015 at 10:04:45, Matti Linnanvuori (matt...@gmail.com) wrote:
> >
> hosts, their running nodes and ports:
>
>
> - linux-vot7: [{rabbitmqctl3187,53178},{rabbitmqctl4704,45355}]

This suggests that RabbitMQ server is indeed not running.

Please post RabbitMQ logs. 

Matti Linnanvuori

unread,
Feb 5, 2015, 2:29:26 AM2/5/15
to rabbitm...@googlegroups.com, matt...@gmail.com
RabbitMQ logs:


rab...@linux-vot7.log:


=INFO REPORT==== 5-Feb-2015::06:51:27 ===

closing AMQP connection <0.289.0> (127.0.0.1:59640 -> 127.0.0.1:5672)


=INFO REPORT==== 5-Feb-2015::06:51:27 ===

closing AMQP connection <0.313.0> (127.0.0.1:59643 -> 127.0.0.1:5672)


=ERROR REPORT==== 5-Feb-2015::06:51:27 ===

AMQP connection <0.321.0> (running), channel 1 - error:

{amqp_error,channel_error,"expected 'channel.open'",'channel.close'}


=INFO REPORT==== 5-Feb-2015::06:51:27 ===

closing AMQP connection <0.321.0> (127.0.0.1:59644 -> 127.0.0.1:5672)


=WARNING REPORT==== 5-Feb-2015::06:51:30 ===

closing AMQP connection <0.330.0> (127.0.0.1:59645 -> 127.0.0.1:5672):

connection_closed_abruptly


=INFO REPORT==== 5-Feb-2015::06:51:31 ===

closing AMQP connection <0.295.0> (127.0.0.1:59642 -> 127.0.0.1:5672)


=INFO REPORT==== 5-Feb-2015::06:51:31 ===

closing AMQP connection <0.292.0> (127.0.0.1:59641 -> 127.0.0.1:5672)


=INFO REPORT==== 5-Feb-2015::06:51:31 ===

Stopping RabbitMQ


=INFO REPORT==== 5-Feb-2015::06:51:31 ===

stopped TCP Listener on 0.0.0.0:5672


=INFO REPORT==== 5-Feb-2015::06:51:31 ===

Halting Erlang VM


shutdown_log:

Stopping and halting node 'rabbit@linux-vot7' ...

...done.


startup_log, startup_err and shutdown_err are empty. Starting RabbitMQ Server stalls and nothing is written to the logs after that.

Michael Klishin

unread,
Feb 5, 2015, 2:32:46 AM2/5/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
 On 5 February 2015 at 10:29:27, Matti Linnanvuori (matt...@gmail.com) wrote:
> rab...@linux-vot7.log:

So,  nothing related to startup in the log.

Is SELinux or similar enabled? We commonly see SELinux blocking port binding permissions on Red Hat derivatives.

Not sure if systemd is used on SUSE 11 SP1 but we've recently started seeing systemd-specific problems, too:
https://groups.google.com/forum/#!searchin/rabbitmq-users/systemd

Matti Linnanvuori

unread,
Feb 5, 2015, 3:07:01 AM2/5/15
to rabbitm...@googlegroups.com, matt...@gmail.com
I don't think SELinux and systemd are enabled. SLES 11 SP1 does not have systemd.

Michael Klishin

unread,
Feb 5, 2015, 4:01:00 AM2/5/15
to Matti Linnanvuori, rabbitm...@googlegroups.com
 On 5 February 2015 at 11:07:03, Matti Linnanvuori (matt...@gmail.com) wrote:
> I don't think SELinux and systemd are enabled. SLES 11 SP1 does
> not have systemd.

OK, anything in syslog or any other log SLES has?

If you stop the service and try starting rabbitmq in foreground (with `rabbitmq-server`), does that work?

You can temporarily override its internal database and log location (RABBITMQ_BASE) to point to your
user's $HOME/rabbitmq or other directory owned by you:
http://www.rabbitmq.com/relocate.html.

Matti Linnanvuori

unread,
Feb 5, 2015, 4:28:29 AM2/5/15
to rabbitm...@googlegroups.com, matt...@gmail.com
There is nothing special in /var/log/messages:

Feb  5 07:18:38 linux-vot7 su: (to rabbitmq) root on none

Feb  5 07:18:38 linux-vot7 su: (to rabbitmq) root on none


Trying starting rabbitmq in foreground (with `service rabbitmq-server start`) does not work but stalls after printing
Starting rabbitmq-server:
Pressing Control-C eventually makes it print
SUCCESS
rabbitmq-server but rabbitmqctl status still shows the same nodedown message.

Jean-Sébastien Pédron

unread,
Feb 5, 2015, 4:37:33 AM2/5/15
to rabbitm...@googlegroups.com
On 05.02.2015 10:28, Matti Linnanvuori wrote:
> Trying starting rabbitmq in foreground (with `service rabbitmq-server
> start`) does not work but stalls after printing

Could you please run:
sh -x /etc/init.d/rabbitmq-server start

And post the output?

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ

Matti Linnanvuori

unread,
Feb 5, 2015, 6:42:37 AM2/5/15
to rabbitm...@googlegroups.com, jean-se...@rabbitmq.com


torstai 5. helmikuuta 2015 11.37.33 UTC+2 Jean-Sébastien Pédron kirjoitti:
Could you please run:
    sh -x /etc/init.d/rabbitmq-server start

And post the output?

+ PATH=/sbin:/usr/sbin:/bin:/usr/bin

+ NAME=rabbitmq-server

+ DAEMON=/usr/sbin/rabbitmq-server

+ CONTROL=/usr/sbin/rabbitmqctl

+ DESC=rabbitmq-server

+ USER=rabbitmq

+ ROTATE_SUFFIX=

+ INIT_LOG_DIR=/var/log/rabbitmq

+ PID_FILE=/var/run/rabbitmq/pid

+ START_PROG=startproc

+ LOCK_FILE=/var/lock/subsys/rabbitmq-server

+ test -x /usr/sbin/rabbitmq-server

+ test -x /usr/sbin/rabbitmqctl

+ RETVAL=0

+ set -e

+ '[' -f /etc/default/rabbitmq-server ']'

+ case "$1" in

+ echo -n 'Starting rabbitmq-server: '

Starting rabbitmq-server: + start_rabbitmq

+ status_rabbitmq quiet

+ set +e

+ '[' quiet '!=' quiet ']'

+ /usr/sbin/rabbitmqctl status

+ '[' 2 '!=' 0 ']'

+ RETVAL=3

+ set -e

+ '[' 3 = 0 ']'

+ RETVAL=0

+ ensure_pid_dir

++ dirname /var/run/rabbitmq/pid

+ PID_DIR=/var/run/rabbitmq

+ '[' '!' -d /var/run/rabbitmq ']'

+ set +e

+ /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid

+ RABBITMQ_PID_FILE=/var/run/rabbitmq/pid

+ startproc /usr/sbin/rabbitmq-server

 

Matti Linnanvuori

unread,
Feb 5, 2015, 6:53:50 AM2/5/15
to rabbitm...@googlegroups.com, jean-se...@rabbitmq.com


torstai 5. helmikuuta 2015 13.42.37 UTC+2 Matti Linnanvuori kirjoitti:


torstai 5. helmikuuta 2015 11.37.33 UTC+2 Jean-Sébastien Pédron kirjoitti:
Could you please run:
    sh -x /etc/init.d/rabbitmq-server start

And post the output? 
After pressing Ctrl-C, I additionally get the following:

+ RETVAL=0

+ set -e

+ case "$RETVAL" in

+ echo SUCCESS

SUCCESS

+ '[' -n /var/lock/subsys/rabbitmq-server ']'

+ touch /var/lock/subsys/rabbitmq-server

+ echo rabbitmq-server.

 

Matti Linnanvuori

unread,
Feb 6, 2015, 3:24:25 AM2/6/15
to rabbitm...@googlegroups.com
Killing beam with command kill makes RabbitMQ Server start. I see that from rabbitmqctl status. rabbitmqctl wait /var/run/rabbitmq/pid stalled when the file did not exist and killing beam ends that stalling.

Jean-Sébastien Pédron

unread,
Feb 12, 2015, 11:30:40 AM2/12/15
to rabbitm...@googlegroups.com
On 05.02.2015 12:42, Matti Linnanvuori wrote:
> torstai 5. helmikuuta 2015 11.37.33 UTC+2 Jean-Sébastien Pédron kirjoitti:
>
> Could you please run:
> sh -x /etc/init.d/rabbitmq-server start
>
> And post the output?
>
> + startproc /usr/sbin/rabbitmq-server

Hi!

Sorry for not getting back to you in a week.

Could you please run:
sh -x /usr/sbin/rabbitmq-server

Matti Linnanvuori

unread,
Feb 13, 2015, 9:05:18 AM2/13/15
to rabbitm...@googlegroups.com, jean-se...@rabbitmq.com
torstai 12. helmikuuta 2015 18.30.40 UTC+2 Jean-Sébastien Pédron kirjoitti:
Could you please run:
    sh -x /usr/sbin/rabbitmq-server

+ cd /var/lib/rabbitmq

++ basename /usr/sbin/rabbitmq-server

+ SCRIPT=rabbitmq-server

++ id -u

++ id -u rabbitmq

+ '[' 0 = 111 -a rabbitmq-server = rabbitmq-server ']'

++ id -u

++ id -u rabbitmq

+ '[' 0 = 111 -o rabbitmq-server = rabbitmq-plugins ']'

++ id -u

+ '[' 0 = 0 ']'

+ su rabbitmq -s /bin/sh -c '/usr/lib/rabbitmq/bin/rabbitmq-server '


              RabbitMQ 3.1.1. Copyright (C) 2007-2013 VMware, Inc.

  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/

  ##  ##

  ##########  Logs: /var/log/rabbitmq/rab...@linux-vot7.log

  ######  ##        /var/log/rabbitmq/rab...@linux-vot7-sasl.log

  ##########

              Starting broker... completed with 6 plugins.

 

Jean-Sébastien Pédron

unread,
Feb 13, 2015, 10:05:52 AM2/13/15
to rabbitm...@googlegroups.com
On 13.02.2015 15:05, Matti Linnanvuori wrote:
> torstai 12. helmikuuta 2015 18.30.40 UTC+2 Jean-Sébastien Pédron kirjoitti:
>
> Could you please run:
> sh -x /usr/sbin/rabbitmq-server
>
> (...)
> Starting broker... completed with 6 plugins.

At this stage, do you get your shell prompt back (you shouldn't)?

Matti Linnanvuori

unread,
Feb 17, 2015, 2:35:32 AM2/17/15
to rabbitm...@googlegroups.com, jean-se...@rabbitmq.com


perjantai 13. helmikuuta 2015 17.05.52 UTC+2 Jean-Sébastien Pédron kirjoitti:
On 13.02.2015 15:05, Matti Linnanvuori wrote:
> torstai 12. helmikuuta 2015 18.30.40 UTC+2 Jean-Sébastien Pédron kirjoitti:
>
>     Could you please run:
>         sh -x /usr/sbin/rabbitmq-server
>
>  (...)
>               Starting broker... completed with 6 plugins.

At this stage, do you get your shell prompt back (you shouldn't)?

I don't get shell prompt back. 

Tuure Laurinolli

unread,
Feb 20, 2015, 2:41:50 AM2/20/15
to rabbitm...@googlegroups.com
There is a timing problem here, which is why the problem is only evident with "restart" usage, and not with "stop" and "start". When supervisord asks rabbitmq-runner.sh to stop with SIGUSR1, rabbitmq-runner runs "/etc/init.d/rabbitmq-server stop", waits for it to exit, and then itself exits to signal supervisord that it's done. Supervisord then attempts to restart rabbitmq-server by executing rabbitmq-runner.sh. Rabbitmq-runner.sh runs "/etc/init.d/rabbitmq-server start" to start the service. "/etc/init.d/rabbitmq-server start" starts the service by running "startproc /usr/sbin/rabbitmq-server" and waiting for erlang-level signal of it having started with "rabbitmqctl wait".

This is where things go wrong. Startproc notices that "/usr/sbin/rabbitmq-server" is already running because and doesn't attempt to start it. This is because there is no pidfile at /var/run/rabbitmq-server.pid and there is already a process with basename "rabbitmq-server" running (the /etc/init.d/rabbitmq-server script). Because rabbitmq isn't *actually* running, "rabbitmqctl wait" never exits, leaving rabbitmq-runner.sh stuck as well.

Possible solutions:
1) startproc -f if it doesn't matter that /usr/sbin/rabbitmq-server is run multiple times
2) rename /usr/sbin/rabbitmq-server to /user/sbin/start-rabbitmq-server

Tuure Laurinolli

unread,
Feb 20, 2015, 2:49:46 AM2/20/15
to rabbitm...@googlegroups.com
Actually thinking of this again, this can't be the actual reason, since "/etc/init.d/rabbitmq-server start" sometimes works. The point still remains that in the failing case "startproc" thinks that there is already a copy of the process running.

Michael Klishin

unread,
Feb 20, 2015, 2:55:00 AM2/20/15
to rabbitm...@googlegroups.com, Tuure Laurinolli
 On 20 February 2015 at 10:41:51, Tuure Laurinolli (tuure.la...@portalify.com) wrote:
> 1) startproc -f if it doesn't matter that /usr/sbin/rabbitmq-server
> is run multiple times

Multiple instances won't run unless you explicitly configure them to use different ports and database
directories (well, sharing the latter is a recipe for disaster).

Tuure Laurinolli

unread,
Feb 20, 2015, 6:37:38 AM2/20/15
to rabbitm...@googlegroups.com
Found root cause.

There is a race condition between invoking startproc /usr/sbin/rabbitmq-server from /etc/init.d/rabbitmq-server in the background and any following command invocations:

If you look at https://gist.github.com/tazle/c5a969b252be46bdfd2d you can see that startproc finds another process called "rabbitmq-server" and refuses to start a new /usr/sbin/rabbitmq-server instance. Note the process id of that process: 15829. Looking at the process listing later, we can see that 15829 is not an instance of rabbitmq-server, but rather rabbitmqctl. This is possible because the rabbitmqctl process is started by /etc/init.d/rabbitmq-server init script, which *is* called rabbitmq-server. When executing rabbitmqctl (or any other non-builtin), the shell first forks and then execs the requested process. There is a time interval between these actions in which the new process (15829 in this case) is still a copy of the old process, which in this case is called rabbitmq-server and prevents the startproc from completing.

Solutions to this include:
1) Adding -f to the startproc invocation to force process start. Michael in his answer seemed to say that this wouldn't be problematic
2) Using something else to start the server process
3) Renaming one of the scripts currently called rabbitmq-server

Another issue is that the startup script currently gets stuck at "rabbitmqctl wait" if starting the server fails for *any* reason. The wait should probably have a timeout.

Michael Klishin

unread,
Feb 20, 2015, 6:40:57 AM2/20/15
to rabbitm...@googlegroups.com, Tuure Laurinolli
 On 20 February 2015 at 14:37:39, Tuure Laurinolli (tuure.la...@portalify.com) wrote:
> 1) Adding -f to the startproc invocation to force process start. 

I'm actually saying that this *would* be problematic if we have 2 processes attempt to start.
The latter attempt is going to fail.

> Michael in his answer seemed to say that this wouldn't be problematic
> 2) Using something else to start the server process
> 3) Renaming one of the scripts currently called rabbitmq-server

This sounds like a better approach to me. We can't rename RabbitMQ's rabbitmq-server
but we can make changes to the RPM package.

Simon MacMullen

unread,
Feb 20, 2015, 6:44:29 AM2/20/15
to Tuure Laurinolli, rabbitm...@googlegroups.com
On 20/02/15 11:37, Tuure Laurinolli wrote:
> Another issue is that the startup script currently gets stuck at
> "rabbitmqctl wait" if starting the server fails for *any* reason. The
> wait should probably have a timeout.

Back in the day it used to have a timeout. I spent considerable effort
removing that - because any timeout we set would have people running a
slow enough server to exceed it.

Note that the semantics of "rabbitmqctl wait" are "return when the
server is up and running" - in the event of an uncontrolled shutdown
with a lot of persistent data it can take the server quite a while to
check everything for consistency.

However, "rabbitmqctl wait" should certainly exit immediately if the
process identified by the pid file is no longer running. Is the pid file
somehow getting the wrong pid in it?

Cheers, Simon

Tuure Laurinolli

unread,
Feb 20, 2015, 6:46:52 AM2/20/15
to Michael Klishin, rabbitm...@googlegroups.com

On 20 Feb 2015, at 13:40 , Michael Klishin <mkli...@pivotal.io> wrote:

> On 20 February 2015 at 14:37:39, Tuure Laurinolli (tuure.la...@portalify.com) wrote:
>> 1) Adding -f to the startproc invocation to force process start.
>
> I'm actually saying that this *would* be problematic if we have 2 processes attempt to start.
> The latter attempt is going to fail.

Fail as in the startup script is going to say “Failed to start” or fail as in something catastrophic happens?

>
>> Michael in his answer seemed to say that this wouldn't be problematic
>> 2) Using something else to start the server process
>> 3) Renaming one of the scripts currently called rabbitmq-server
>
> This sounds like a better approach to me. We can't rename RabbitMQ's rabbitmq-server
> but we can make changes to the RPM package.

Renaming the init script would be problematic as well.

Tuure Laurinolli

unread,
Feb 20, 2015, 6:49:47 AM2/20/15
to Simon MacMullen, rabbitm...@googlegroups.com
The PID file shouldn’t exist. I haven’t verified this, though. The new process certainly never gets far enough to write the pid file, since startproc never even attempts to execute rabbitmq-server.

Michael Klishin

unread,
Feb 20, 2015, 6:53:02 AM2/20/15
to Simon MacMullen, Tuure Laurinolli, rabbitm...@googlegroups.com
 On 20 February 2015 at 14:49:47, Tuure Laurinolli (tuure.la...@portalify.com) wrote:
> Fail as in the startup script is going to say “Failed to start”
> or fail as in something catastrophic happens?

The former. RabbitMQ server exits if it cannot bind to a port. Nothing catastrophic should
happen. 

Tuure Laurinolli

unread,
Feb 20, 2015, 6:58:36 AM2/20/15
to Michael Klishin, Simon MacMullen, rabbitm...@googlegroups.com

On 20 Feb 2015, at 13:52 , Michael Klishin <mkli...@pivotal.io> wrote:

> On 20 February 2015 at 14:49:47, Tuure Laurinolli (tuure.la...@portalify.com) wrote:
>> Fail as in the startup script is going to say “Failed to start”
>> or fail as in something catastrophic happens?
>
> The former. RabbitMQ server exits if it cannot bind to a port. Nothing catastrophic should
> happen.

That seems perfectly OK to me. There is only a small window to getting to multiple startproc -f invocation in the first place, since the startup script first checks rabbitmq status, and in such case it should probably be expected for one of them to fail to start the server :)

Reply all
Reply to author
Forward
0 new messages