rabbitmq management plugin hangs when rebooting one of the cluster nodes

Raul Kaubi

unread,

Feb 1, 2017, 11:21:47 AM2/1/17

to rabbitmq-users

Hi

I just today installed rabbitmq 3.6.6 and erlang 19.2 on centos7. It is 2 node cluster.

I have run into a little problem. When I shutdown one of my server (trying to imitate, if something happens with one of the server, crashes or etc), then management plugin no longer responds.

Following error Error: could not connect to server since .......

I couldn't tell, if the rabbit still works on node that didin't crash (so is it only management plugin that hangs or everything)..

When I shutdown one of the nodes (even stats node, doesn't matter which one) with specific command, for example rabbitmqctl stop_app, then everything is fine..

Also, anybody know why aliveness-test queue sometimes appears and sometimes it is gone..?

Raul

Michael Klishin

unread,

Feb 1, 2017, 11:35:47 AM2/1/17

to rabbitm...@googlegroups.com

The UI always tries to contact a single node. Use a load balancer in front of the cluster and point your browser at it.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Feb 1, 2017, 11:50:48 AM2/1/17

to rabbitm...@googlegroups.com

…this, of course, assumes every node has management plugin enabled.

So, not particularly different from most Web apps or HTTP APIs.

On Wed, Feb 1, 2017 at 7:35 PM, Michael Klishin <mkli...@pivotal.io> wrote:

The UI always tries to contact a single node. Use a load balancer in front of the cluster and point your browser at it.

On 1 Feb 2017, at 19:21, Raul Kaubi <raul...@gmail.com> wrote:

Hi

I just today installed rabbitmq 3.6.6 and erlang 19.2 on centos7. It is 2 node cluster.

I have run into a little problem. When I shutdown one of my server (trying to imitate, if something happens with one of the server, crashes or etc), then management plugin no longer responds.
Following error Error: could not connect to server since .......

I couldn't tell, if the rabbit still works on node that didin't crash (so is it only management plugin that hangs or everything)..

When I shutdown one of the nodes (even stats node, doesn't matter which one) with specific command, for example rabbitmqctl stop_app, then everything is fine..

Also, anybody know why aliveness-test queue sometimes appears and sometimes it is gone..?

Raul

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Raul Kaubi

unread,

Feb 1, 2017, 11:55:38 AM2/1/17

to rabbitm...@googlegroups.com

Just to make it clear, while "imitated crash" happens on node1. My web browser is pointing to node2.domain:15672

So either of the nodes web ui is available.

I understand yes, in this scenario, node1.domain:15672 can't answer..

Raul

Sent from my iPhone

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/U-8Em6Xwkzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,

Feb 1, 2017, 12:39:55 PM2/1/17

to rabbitm...@googlegroups.com

"could not connect to server since …" in the UI means that the node your browser is pointed

at cannot be reached, for whatever reason. A shutdown of a node that does not host the stats DB cannot

affect HTTP API queries to other nodes that have the plugin enabled.

See server logs and developer console in your browser for more data. It is impossible to suggest

more with the amount of information provided.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Raul Kaubi

unread,

Feb 1, 2017, 2:01:20 PM2/1/17

to rabbitmq-users

Hi

Sorry, you are correct, "could not connect to server since …" message comes from the node that had "imitated crash", so this is OK.

So posting some additional information:

[root@node2 ~]# tail -1000f /var/log/rabbitmq/rabbit@node2.log

.......

.........

=INFO REPORT==== 1-Feb-2017::20:25:49 ===

LDAP DECISION: login for raulk2: ok <<<<<<---- Last message from node2

<<<<<<---- At this point I ran "reboot" command from node1

[root@node1 ~]# reboot <<<<--Imitating crash

******* No log messages from 20:25:50 to 20:26:10

=ERROR REPORT==== 1-Feb-2017::20:26:10 ===

** Node rabbit@node1 not responding **

** Removing (timedout) connection **

=INFO REPORT==== 1-Feb-2017::20:26:10 ===

rabbit on node rabbit@node1 down

=INFO REPORT==== 1-Feb-2017::20:26:10 ===

node rabbit@node1 down: net_tick_timeout

........ <<<<--- From this point onwards, management plugin can be reached from node2.

........ Until this time, web UI did not respond, just crashed, even chrome page crashes with this (webpage is rabbitmq still), combo box

......

Raul

kolmapäev, 1. veebruar 2017 19:39.55 UTC+2 kirjutas Michael Klishin:

"could not connect to server since …" in the UI means that the node your browser is pointed
at cannot be reached, for whatever reason. A shutdown of a node that does not host the stats DB cannot
affect HTTP API queries to other nodes that have the plugin enabled.

See server logs and developer console in your browser for more data. It is impossible to suggest
more with the amount of information provided.

On Wed, Feb 1, 2017 at 7:21 PM, Raul Kaubi <raul...@gmail.com> wrote:

Hi

I just today installed rabbitmq 3.6.6 and erlang 19.2 on centos7. It is 2 node cluster.

I have run into a little problem. When I shutdown one of my server (trying to imitate, if something happens with one of the server, crashes or etc), then management plugin no longer responds.
Following error Error: could not connect to server since .......

I couldn't tell, if the rabbit still works on node that didin't crash (so is it only management plugin that hangs or everything)..

When I shutdown one of the nodes (even stats node, doesn't matter which one) with specific command, for example rabbitmqctl stop_app, then everything is fine..

Also, anybody know why aliveness-test queue sometimes appears and sometimes it is gone..?

Raul

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Feb 1, 2017, 3:08:45 PM2/1/17

to rabbitm...@googlegroups.com

net_tick_timeout is the interval after which a node is considered to be down by its peers:

http://www.rabbitmq.com/nettick.html

So this delay makes sense.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Raul Kaubi

unread,

Feb 2, 2017, 2:50:43 AM2/2/17

to rabbitmq-users

So, this is OK, that whole cluster is not responsive for that particular time..?

By the way, it (from the link you pasted) says that default net_ticktime is 60 seconds

This means, that whole cluster should not be responsive for 60 seconds..?

From my point of view, this timeout is less, the logs that I pasted, there were only 20 seconds delay..

In rabbitmq conf I don't have such variable declared..

Raul

Michael Klishin

unread,

Feb 2, 2017, 3:06:27 AM2/2/17

to rabbitm...@googlegroups.com

How did you arrive at the conclusion that the entire cluster is unresponsive? I find that very hard to believe.

What is unresponsive is HTTP API, because

* Up to 3.6.6, all nodes contact a single node that stores all stats

* Starting with (unreleased) 3.6.7, it will fan out to all nodes to collect and aggregate their stats

* Our JavaScript UI uses synchronous requests and even starting with 3.6.7, this makes things

appear "entirely unresponsive" where in reality what is unresponsive is the API it uses and the UI itself.

If one of the nodes is unresponsive, it often cannot be detected immediately: welcome to the world

of distributed systems. Using timeouts to detect unresponsive peers is what every other data

service, protocols such as TCP, and other systems have been using for decades.

net tick is in effect and the general problem still exists regardless of whether you configure

the value or not. You can reduce it to, say, 15 seconds. The lower you go, the higher the probability

of false positives (so values < 10 is probably not a good idea).

This is explained in http://www.rabbitmq.com/nettick.html and http://rabbitmq.com/heartbeats.html.

Sometimes detecting an unresponsive peer takes less than nettick_time (this depends on a multitide

of factors, from what exactly happened to the peer to your kernel's TCP stack configuration to

what runtime and tool timeouts are, such as nettick_time) but it would be unwise to assume

that.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Feb 2, 2017, 3:12:30 AM2/2/17

to rabbitm...@googlegroups.com

So a very dumbed down version of what's going on: if a node that hosts

the stats DB is down, peers will take ~ up to nettick_time to detect it.

In that time frame, all HTTP API requests will be routed to that node,

and wait until they time out.

We discussed a separate timeout for the management plugin but that would only

make things *more* confusing, so we continue to rely on (and users can tweak) nettick_time.

The JavaScript UI we have is not particularly responsive when the API it uses is slow

or has to wait for a response that will never arrive. That was discussed by our team several days

ago in the context of 3.6.7 and the conclusion is that even with a completely modern async JavaScript UI

and WebSocket connections instead of HTTP 1.1 for management plugin API requests, nodes

in a cluster would still wait for a response for up to nettick_time to detect peer unavailability,

and during that time HTTP API cannot return a response.

This is true for any kind of inter-node communication and has had an answer

(configurable nettick_time, which is an Erlang runtime setting) from day 1.

Raul Kaubi

unread,

Feb 2, 2017, 4:17:55 AM2/2/17

to rabbitm...@googlegroups.com

Hi

How did you arrive at the conclusion that the entire cluster is unresponsive?

Just yesterday evening I tested it, steps as follows:

1. Rebooted one of the rabbit node server (node1) (imitating crash)

1.1 At the same time I did some actions in my .net application, that publishes messages into rabbit queue (.net application posts messages into node2)

2. .net application processed the request for about 20-30 seconds (loading at the mean time), just minutes before the rabbit imitated crash, the same .net application request took only 1-2 seconds to complete (simple requests)

Raul

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/U-8Em6Xwkzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Michael Klishin

unread,

Feb 2, 2017, 4:20:47 AM2/2/17

to rabbitm...@googlegroups.com, Raul Kaubi

When nodes go down queue masters move and consumers have to be re-registered
(by RabbitMQ). Again, anything that involves inter-node traffic with the down node
will block and time out for up to ~ nettick_time seconds.

But the claim that the entire cluster is unresponsive is still incorrect.

On 2 February 2017 at 12:17:56, Raul Kaubi (raul...@gmail.com) wrote:
> Hi
>
> *How did you arrive at the conclusion that the entire cluster is
> unresponsive?*

>
> Just yesterday evening I tested it, steps as follows:

> 1. Rebooted one of the rabbit node server (*node1*) (imitating crash)

> 1.1 At the same time I did some actions in my .net application, that
> publishes messages into rabbit queue (.net application posts messages into

> *node2*)

> >> By the way, it (*from the link you pasted*) says that *default net_ticktime is
> >> 60 seconds*

> >> This means, that whole cluster should not be responsive for 60 seconds..?
> >>
> >> From my point of view, this timeout is less, the logs that I pasted,
> >> there were only 20 seconds delay..
> >>
> >> In rabbitmq conf I don't have such variable declared..
> >>
> >> Raul
> >>
> >> kolmapäev, 1. veebruar 2017 22:08.45 UTC+2 kirjutas Michael Klishin:
> >>>
> >>> net_tick_timeout is the interval after which a node is considered to be
> >>> down by its peers:
> >>> http://www.rabbitmq.com/nettick.html
> >>>
> >>> So this delay makes sense.
> >>>
> >>> On Wed, Feb 1, 2017 at 10:01 PM, Raul Kaubi wrote:
> >>>
> >>>> Hi
> >>>>

> >>>> Sorry, you are correct, *"could not connect to server since …"*
> >>>> message comes from the node that had "imitated crash", so this is *OK*.

> >>>>
> >>>> So posting some additional information:

> >>>> [root@*node2* ~]# tail -1000f /var/log/rabbitmq/rabbit@*node2*.log

> >>>> .......
> >>>> .........
> >>>> =INFO REPORT==== 1-Feb-2017::20:25:49 ===

> >>>> LDAP DECISION: login for raulk2: ok * <<<<<<----
> >>>> Last message from node2*
> >>>>
> >>>> <<<<<<---- At this point I ran "*reboot*" command from
> >>>> *node1 *
> >>>>
> >>>> [root@*node1* ~]# reboot *<<<<--Imitating
> >>>> crash*
> >>>>
> >>>> *
> >>>> ******** *No log messages from 20:25:50 to 20:26:10*

> >>>> =ERROR REPORT==== 1-Feb-2017::20:26:10 ===

> >>>> ** Node rabbit@*node1* not responding **

> >>>> ** Removing (timedout) connection **
> >>>>
> >>>> =INFO REPORT==== 1-Feb-2017::20:26:10 ===

> >>>> rabbit on node rabbit@*node1* down

> >>>>
> >>>> =INFO REPORT==== 1-Feb-2017::20:26:10 ===

> >>>> node rabbit@*node1* down: net_tick_timeout
> >>>> ........
> >>>> <<<<--- From this point onwards, management plugin *can be *reached
> >>>> from *node2*.

> >>>>>> which one) with specific command, for example *rabbitmqctl stop_app*,
> >>>>>> then everything is fine..
> >>>>>>
> >>>>>> Also, anybody know why *aliveness-test *queue sometimes appears and

> > You received this message because you are subscribed to a topic in the
> > Google Groups "rabbitmq-users" group.
> > To unsubscribe from this topic, visit https://groups.google.com/d/
> > topic/rabbitmq-users/U-8Em6Xwkzo/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to

> > rabbitmq-user...@googlegroups.com.
> > To post to this group, send email to rabbitm...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

> To post to this group, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,

Feb 2, 2017, 4:39:30 AM2/2/17

to Raul Kaubi, rabbitm...@googlegroups.com

Hi Raul,

Please always CC rabbitmq-users. I do not offer any kind of 1-on-1 support.

There are no master nodes in RabbitMQ. Queues have masters, all nodes are equal.

The node that hosts the stats DB is not special in any other way.

Nodes are expected to fail or become unavailable. You cannot avoid that entirely in practice.

Please understand what the underlying problem is because it is very fundamental to distributed

systems (and yes, your apps if they use RabbitMQ, are by definition distributed):

detection of peer unavailability is done by a timeout in just about every messaging broker,

client library, distributed database, or even most fundamental protocol such as TCP.

RabbitMQ is no different. The timeout is configurable. Any communication with a failed

node will block for up to roughly {timeout} seconds. Your applications and development team

must be aware of it and assume it will happen. The same problem applies to client connections,

where the setting is called differently in different protocols but in AMQP 0-9-1 and STOMP

it is "heartbeats", documented at http://www.rabbitmq.com/heartbeats.html.

There is no way to avoid this behaviour. What you can control is the timeout value.

Yes, our team believes it's acceptable: it's the best our industry has come up with in

a few decades.

There were milestone 3.6.7 releases announced on this list. Please search list archives.

On Thu, Feb 2, 2017 at 12:28 PM, Raul Kaubi <raul...@gmail.com> wrote:

I deliberately rebooted the node that wasn't the master (it did not contain statistics).
Node1 <-- rebooted this node
Node2 (This node contains the management statistics database)

So, queue master should not move..?

By the way, any ETA on 3.6.7 ..?

Raul

> >> email to rabbitmq-users+unsubscribe@googlegroups.com.

> >> To post to this group, send email to rabbitm...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >>
> >
> >
> >
> > --
> > MK
> >
> > Staff Software Engineer, Pivotal/RabbitMQ
> >
> > --
> > You received this message because you are subscribed to a topic in the
> > Google Groups "rabbitmq-users" group.
> > To unsubscribe from this topic, visit https://groups.google.com/d/
> > topic/rabbitmq-users/U-8Em6Xwkzo/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to

> > rabbitmq-users+unsubscribe@googlegroups.com.

> > To post to this group, send email to rabbitm...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users"
> group.

> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

> To post to this group, send an email to rabbitm...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Feb 2, 2017, 4:44:42 AM2/2/17

to Raul Kaubi, rabbitm...@googlegroups.com

Raul Kaubi

unread,

Feb 2, 2017, 8:40:18 AM2/2/17

to rabbitmq-users, raul...@gmail.com

Hi

Sorry for replying to you directly, my bad..

And thank you for posting quick replies. I got my answers.

Raul

> > You received this message because you are subscribed to a topic in the
> > Google Groups "rabbitmq-users" group.
> > To unsubscribe from this topic, visit https://groups.google.com/d/
> > topic/rabbitmq-users/U-8Em6Xwkzo/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to

> > rabbitmq-user...@googlegroups.com.
> > To post to this group, send email to rabbitm...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

> To post to this group, send an email to rabbitm...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Reply all

Reply to author

Forward