Filesystem freeze on one node causes RabbitMQ cluster to become unresponsive.

Sergey Arlashin

unread,

Sep 17, 2019, 6:38:18 AM9/17/19

to rabbitmq-users

We have a cluster of 3 RabbitMQ nodes with HA queues enabled.

Recently we had a problem with underlying filesystem on one of the nodes, and this caused two other instances of RabbitMQ on 2 other nodes to become unresponsive: publishing and consuming just stopped. Literally the whole cluster became unresponsive due to the failure of only 1 node.

When we powered off the faulty node the remaining two nodes started to work normally.

Some details about our setup:

OS: CentOS 7.x with latest updates.

RabbitMQ version in production is rabbitmq-server-3.7.8-1.el7.noarch . but we checked the latest version rabbitmq-server-3.7.17-1.el7.noarch in dev environment - same behaviour.

node names:

amqp-cl1-node1 (172.31.32.186)

amqp-cl1-node2 (172.31.32.103)

amqp-cl1-node3 (172.31.32.108)

All nodes have the following configuration:

/etc/rabbitmq/rabbitmq.config:

[
  {rabbit, [
    {cluster_nodes, {['rabbit@amqp-cl1-node1, 'rabbit@amqp-cl1-node2', 'rabbit@amqp-cl1-node3'], disc}},
    {cluster_partition_handling, pause_minority},
    {tcp_listen_options, [
         {backlog,       512},
         {nodelay,       true},
         {linger,        {true, 0}},
         {exit_on_close, false}
    ]},
    {default_user, <<"guest">>},
    {default_pass, <<"guest">>}
  ]},
  {kernel, [


  ]}
,
  {rabbitmq_management, [
    {listener, [
      {port, 15672}
    ]}
  ]}
].
% EOF

We managed to reproduce the issue in our test environment. Here are the steps:

1. we use `fsfreeze` tool on amqp-cl1-node1 (this node had the majority of master queues):

root@amqp-cl1-node1: ~ # fsfreeze --freeze /

after several seconds clients are not able to publish/consume anything.

2. we check cluster_status on other nodes:

root@amqp-cl1-node2: ~ #  rabbitmqctl cluster_status
Cluster status of node rabbit@amqp-cl1-node2 ...
[{nodes,[{disc,['rabbit@amqp-cl1-node1',
                'rabbit@amqp-cl1-node2',
                'rabbit@amqp-cl1-node3']}]},
 {running_nodes,['rabbit@amqp-cl1-node3',
                 'rabbit@amqp-cl1-node1',
                 'rabbit@amqp-cl1-node2']},
 {cluster_name,<<"rabbit@amqp-cl1-node1">>},
 {partitions,[]},
 {alarms,[{'rabbit@amqp-cl1-node3',[]},
          {'rabbit@amqp-cl1-node1',[]},
          {'rabbit@amqp-cl1-node2',[]}]}]

All 3 nodes (including amqp-cl1-node1 with faulty fs) remain in `running` state during the whole time.

3. We are not able even to execute

root@amqp-cl1-node2: ~ # rabbitmqctl list_queues -p smpp name messages pid slave_pids

command on any of the remaining nodes. the command just freezes.

4. We get the following records in RabbitMQ logs on amqp-cl1-node2

2019-09-17 06:48:21.137 [info] <0.23872.0> connection <0.23872.0> (172.31.9.79:58350 -> 172.31.32.103:5672): user 'smpp' authenticated and granted access to vhost 'smpp'
2019-09-17 06:48:48.631 [error] <0.23187.0> Supervisor {<0.23187.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23177.0>, <0.23188.0>, <0.23177.0>, <<"172.31.9.79:58291 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23178.0>, <0.23189.0>) at <0.23190.0> exit with reason killed in context shutdown_error
2019-09-17 06:49:00.130 [warning] <0.23553.0> closing AMQP connection <0.23553.0> (172.31.9.79:58317 -> 172.31.32.103:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:50:10.131 [error] <0.23563.0> Supervisor {<0.23563.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23553.0>, <0.23564.0>, <0.23553.0>, <<"172.31.9.79:58317 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23554.0>, <0.23565.0>) at <0.23566.0> exit with reason killed in context shutdown_error
2019-09-17 06:50:21.140 [warning] <0.23872.0> closing AMQP connection <0.23872.0> (172.31.9.79:58350 -> 172.31.32.103:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:51:31.141 [error] <0.23882.0> Supervisor {<0.23882.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23872.0>, <0.23883.0>, <0.23872.0>, <<"172.31.9.79:58350 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23873.0>, <0.23884.0>) at <0.23885.0> exit with reason killed in context shutdown_error

and amqp-cl1-node3 nodes:

2019-09-17 06:44:16.647 [info] <0.18748.0> accepting AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672)
2019-09-17 06:44:16.649 [info] <0.18748.0> connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672): user 'smpp' authenticated and granted access to vhost 'smpp'
2019-09-17 06:46:16.652 [warning] <0.18748.0> closing AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:47:26.653 [error] <0.18758.0> Supervisor {<0.18758.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.18748.0>, <0.18759.0>, <0.18748.0>, <<"172.31.9.79:58258 -> 172.31.32.108:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.18749.0>, <0.18760.0>) at <0.18761.0> exit with reason killed in context shutdown_error

5. We see the following records in crash.log on 2 remaining nodes:

2019-09-17 06:48:48 =SUPERVISOR REPORT====
     Supervisor: {<0.23187.0>,rabbit_channel_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.23190.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23177.0>,<0.23188.0>,<0.23177.0>,<<"172.31.9.79:58291 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23178.0>,<0.23189.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]


2019-09-17 06:50:10 =SUPERVISOR REPORT====
     Supervisor: {<0.23563.0>,rabbit_channel_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.23566.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23553.0>,<0.23564.0>,<0.23553.0>,<<"172.31.9.79:58317 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23554.0>,<0.23565.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]


2019-09-17 06:51:31 =SUPERVISOR REPORT====
     Supervisor: {<0.23882.0>,rabbit_channel_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.23885.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23872.0>,<0.23883.0>,<0.23872.0>,<<"172.31.9.79:58350 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23873.0>,<0.23884.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]

6. after we switch off the faulty node (amqp-cl1-node1), cluster_status starts reporting only rabbit@amqp-cl1-node3 and rabbit@amqp-cl1-node2 in running state.

Sergey Arlashin

unread,

Sep 17, 2019, 6:42:51 AM9/17/19

to rabbitmq-users

So the question is how can we avoid issues like this from happening.

Thanks!

Regards,

Sergey

Michael Klishin

unread,

Sep 17, 2019, 1:19:59 PM9/17/19

to rabbitmq-users

Nodes in a cluster do have shared state, so unavailability of one node will affect others, that's the definition of

a distributed system according to some.

Two specific areas where nodes share state or effectively share state are

* Schema database operations. For example, a new binding is replicated to all nodes in a transaction.

* Queue operations. All operations on a single queue go through its master replica [1].

* Channel/queue process interactions. Since a channel can be on a different node from the queue master replica it is publishing to,

channel may have to wait for acknowledgements to arrive. This is done asynchronously in most areas; where not, the timeout

is the standard channel operation timeout which is 15 seconds by default.

The point of this introduction is to outline that the possible real of failures here is broad.

Without knowing what your applications do we cannot suggest which "area" from the above is at play here.

With queues only a subset of nodes is involved, so assuming a reasonably even distribution of queue masters,

only a subset of queues and thus clients will be affected.

Schema operations always and unconditionally involve all nodes. If some of those nodes cannot perform

I/O operations then eventually transactions will time out or fail but it can take a while (minutes).

All other node state (local message store, indices, connections, channels) are node local.

1. https://www.rabbitmq.com/clustering.html#overview-what-is-replicated

Michael Klishin

unread,

Sep 17, 2019, 1:26:18 PM9/17/19

to rabbitmq-users

The channel and connection exceptions/errors make it seem like a channel operation did not complete in time

that *a client* was willing to wait. So the client closed its TCP connection [1] presumably because of a timeout or a heartbeat timeout [2]

(that's my best guess, we have zero information about the client) and that resulted in

* Connection process termination

* Forced channel shutdown

Most clients provide a way of increasing operation and heartbeat timeouts. Unless the problematic node has a chance to continue

its I/O operations, there isn't much RabbitMQ or client libraries can do: they will hit a timeout eventually.

Whether subsequent operations would succeed depends on what exactly the client attempts to do.

Trying the same thing is quite likely to end the same way, again, unless the node has a chance to perform its I/O operations.

1. https://www.rabbitmq.com/logging.html#logged-events

2. https://www.rabbitmq.com/heartbeats.html

2019-09-17 06:47:26.653 [error] <0.18758.0> Supervisor {<0.18758.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.18748<sp

Michael Klishin

unread,

Sep 17, 2019, 1:33:02 PM9/17/19

to rabbitmq-users

Lastly, it's worth focusing a bit on this part:

> All 3 nodes (including amqp-cl1-node1 with faulty fs) remain in `running` state during the whole time.

The definition of cluster membership and node's state (running, etc) is not defined by whether the node can perform its I/O operations.

Unless a node fails to respond to net ticks or closes its inter-node connection, [1] it would be considered to be up.

Unless a node is explicitly removed from the cluster it would be considered a cluster member (that could be down).

Both are pretty typical approaches for data services. Note that neither relies on node's ability to perform local I/O.

Some systems (Cassandra is one example I know reasonably well) make nodes voluntarily shut down when they detect catastrophic

local disk failures. We have plans to do the same for RabbitMQ but today a node can in theory stay up per the above definition but

fail to perform any local I/O. In practice it will likely fail after a timeout or series of timeouts and unhandled exceptions but it can take

a while (minutes).

The definition of a "catastrophic disk failure" is not entirely clear or objective and cannot be monitored by RabbitMQ well. That's one of the

reasons why we don't think the Cassandra-like shutdown behavior is a no-brainer.

1. https://www.rabbitmq.com/nettick.html

2019-09-17 06:46:16.652 [warning] <0.18748.0> closing AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672, vhost</sp

Reply all

Reply to author

Forward