Filesystem freeze on one node causes RabbitMQ cluster to become unresponsive.

334 views
Skip to first unread message

Sergey Arlashin

unread,
Sep 17, 2019, 6:38:18 AM9/17/19
to rabbitmq-users
We have a cluster of 3 RabbitMQ nodes with HA queues enabled. 

Recently we had a problem with underlying filesystem on one of the nodes, and this caused two other instances of RabbitMQ on 2 other nodes to become unresponsive: publishing and consuming just stopped. Literally the whole cluster became unresponsive due to the failure of only 1 node. 

When we powered off the faulty node the remaining two nodes started to work normally. 


Some details about our setup:

OS: CentOS 7.x with latest updates.
RabbitMQ version in production is rabbitmq-server-3.7.8-1.el7.noarch . but we checked the latest version rabbitmq-server-3.7.17-1.el7.noarch in dev environment - same behaviour. 

node names: 
amqp-cl1-node1 (172.31.32.186)
amqp-cl1-node2 (172.31.32.103)
amqp-cl1-node3 (172.31.32.108)


All nodes have the following configuration:

 /etc/rabbitmq/rabbitmq.config:

[
 
{rabbit, [
   
{cluster_nodes, {['rabbit@amqp-cl1-node1, 'rabbit@amqp-cl1-node2', 'rabbit@amqp-cl1-node3'], disc}},
    {cluster_partition_handling, pause_minority},
    {tcp_listen_options, [
         {backlog,       512},
         {nodelay,       true},
         {linger,        {true, 0}},
         {exit_on_close, false}
    ]},
    {default_user, <<"guest">>},
    {default_pass, <<"guest">>}
  ]},
  {kernel, [


  ]}
,
  {rabbitmq_management, [
    {listener, [
      {port, 15672}
    ]}
  ]}
].
% EOF



We managed to reproduce the issue in our test environment. Here are the steps:

1. we use `fsfreeze` tool on amqp-cl1-node1 (this node had the majority of master queues):

root@amqp-cl1-node1: ~ # fsfreeze --freeze /

after several seconds clients are not able to publish/consume anything. 

2. we check cluster_status on other nodes: 

root@amqp-cl1-node2: ~ #  rabbitmqctl cluster_status
Cluster status of node rabbit@amqp-cl1-node2 ...
[{nodes,[{disc,['rabbit@amqp-cl1-node1',
               
'rabbit@amqp-cl1-node2',
               
'rabbit@amqp-cl1-node3']}]},
 
{running_nodes,['rabbit@amqp-cl1-node3',
                 
'rabbit@amqp-cl1-node1',
                 
'rabbit@amqp-cl1-node2']},
 
{cluster_name,<<"rabbit@amqp-cl1-node1">>},
 
{partitions,[]},
 
{alarms,[{'rabbit@amqp-cl1-node3',[]},
         
{'rabbit@amqp-cl1-node1',[]},
         
{'rabbit@amqp-cl1-node2',[]}]}]


All 3 nodes (including amqp-cl1-node1 with faulty fs) remain in `running` state during the whole time.

3. We are not able even to execute 

root@amqp-cl1-node2: ~ # rabbitmqctl list_queues -p smpp name messages pid slave_pids

command on any of the remaining nodes. the command just freezes. 

4. We get the following records in RabbitMQ logs on amqp-cl1-node2 

2019-09-17 06:48:21.137 [info] <0.23872.0> connection <0.23872.0> (172.31.9.79:58350 -> 172.31.32.103:5672): user 'smpp' authenticated and granted access to vhost 'smpp'
2019-09-17 06:48:48.631 [error] <0.23187.0> Supervisor {<0.23187.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23177.0>, <0.23188.0>, <0.23177.0>, <<"172.31.9.79:58291 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23178.0>, <0.23189.0>) at <0.23190.0> exit with reason killed in context shutdown_error
2019-09-17 06:49:00.130 [warning] <0.23553.0> closing AMQP connection <0.23553.0> (172.31.9.79:58317 -> 172.31.32.103:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:50:10.131 [error] <0.23563.0> Supervisor {<0.23563.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23553.0>, <0.23564.0>, <0.23553.0>, <<"172.31.9.79:58317 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23554.0>, <0.23565.0>) at <0.23566.0> exit with reason killed in context shutdown_error
2019-09-17 06:50:21.140 [warning] <0.23872.0> closing AMQP connection <0.23872.0> (172.31.9.79:58350 -> 172.31.32.103:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:51:31.141 [error] <0.23882.0> Supervisor {<0.23882.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23872.0>, <0.23883.0>, <0.23872.0>, <<"172.31.9.79:58350 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23873.0>, <0.23884.0>) at <0.23885.0> exit with reason killed in context shutdown_error


and amqp-cl1-node3 nodes: 

2019-09-17 06:44:16.647 [info] <0.18748.0> accepting AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672)
2019-09-17 06:44:16.649 [info] <0.18748.0> connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672): user 'smpp' authenticated and granted access to vhost 'smpp'
2019-09-17 06:46:16.652 [warning] <0.18748.0> closing AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:47:26.653 [error] <0.18758.0> Supervisor {<0.18758.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.18748.0>, <0.18759.0>, <0.18748.0>, <<"172.31.9.79:58258 -> 172.31.32.108:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.18749.0>, <0.18760.0>) at <0.18761.0> exit with reason killed in context shutdown_error



5. We see the following records in crash.log on 2 remaining nodes:

2019-09-17 06:48:48 =SUPERVISOR REPORT====
     
Supervisor: {<0.23187.0>,rabbit_channel_sup}
     
Context:    shutdown_error
     
Reason:     killed
     
Offender:   [{pid,<0.23190.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23177.0>,<0.23188.0>,<0.23177.0>,<<"172.31.9.79:58291 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23178.0>,<0.23189.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]


2019-09-17 06:50:10 =SUPERVISOR REPORT====
     
Supervisor: {<0.23563.0>,rabbit_channel_sup}
     
Context:    shutdown_error
     
Reason:     killed
     
Offender:   [{pid,<0.23566.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23553.0>,<0.23564.0>,<0.23553.0>,<<"172.31.9.79:58317 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23554.0>,<0.23565.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]


2019-09-17 06:51:31 =SUPERVISOR REPORT====
     
Supervisor: {<0.23882.0>,rabbit_channel_sup}
     
Context:    shutdown_error
     
Reason:     killed
     
Offender:   [{pid,<0.23885.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23872.0>,<0.23883.0>,<0.23872.0>,<<"172.31.9.79:58350 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23873.0>,<0.23884.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]



6. after we switch off the faulty node (amqp-cl1-node1), cluster_status starts reporting only rabbit@amqp-cl1-node3 and rabbit@amqp-cl1-node2 in running state. 







Sergey Arlashin

unread,
Sep 17, 2019, 6:42:51 AM9/17/19
to rabbitmq-users
So the question is how can we avoid issues like this from happening.

Thanks! 

Regards,
Sergey

Michael Klishin

unread,
Sep 17, 2019, 1:19:59 PM9/17/19
to rabbitmq-users
Nodes in a cluster do have shared state, so unavailability of one node will affect others, that's the definition of
a distributed system according to some.

Two specific areas where nodes share state or effectively share state are

 *  Schema database operations. For example, a new binding is replicated to all nodes in a transaction.
 * Queue operations. All operations on a single queue go through its master replica [1].
 * Channel/queue process interactions. Since a channel can be on a different node from the queue master replica it is publishing to,
   channel may have to wait for acknowledgements to arrive. This is done asynchronously in most areas; where not, the timeout
   is the standard channel operation timeout which is 15 seconds by default.

The point of this introduction is to outline that the possible real of failures here is broad.
Without knowing what  your applications do we cannot suggest which "area" from the above is at play here.

With queues only a subset of nodes is involved, so assuming a reasonably even distribution of queue masters,
only a subset of queues and thus clients will be affected.

Schema operations always and unconditionally involve all nodes. If some of those nodes cannot perform
I/O operations then eventually transactions will time out or fail but it can take a while (minutes).

All other node state (local message store, indices, connections, channels) are node local.

Michael Klishin

unread,
Sep 17, 2019, 1:26:18 PM9/17/19
to rabbitmq-users
The channel and connection exceptions/errors make it seem like a channel operation did not complete in time
that *a client* was willing to wait. So the client closed its TCP connection [1] presumably because of a timeout or a heartbeat timeout [2]
(that's my best guess, we have zero information about the client) and that resulted in

 * Connection process termination
 * Forced channel shutdown

Most clients provide a way of increasing operation and heartbeat timeouts. Unless the problematic node has a chance to continue
its I/O operations, there isn't much RabbitMQ or client libraries can do: they will hit a timeout eventually.

Whether subsequent operations would succeed depends on what exactly the client attempts to do.
Trying the same thing is quite likely to end the same way, again, unless the node has a chance to perform its I/O operations.

2019-09-17 06:47:26.653 [error] <0.18758.0> Supervisor {<0.18758.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.18748<sp

Michael Klishin

unread,
Sep 17, 2019, 1:33:02 PM9/17/19
to rabbitmq-users
Lastly, it's worth focusing a bit on this part:

> All 3 nodes (including amqp-cl1-node1 with faulty fs) remain in `running` state during the whole time.

The definition of cluster membership and node's state (running, etc) is not defined by whether the node can perform its I/O operations.

Unless a node fails to respond to net ticks or closes its inter-node connection, [1] it would be considered to be up.
Unless a node is explicitly  removed from the cluster it would  be considered a cluster member (that could be down).
Both are pretty typical approaches for data services. Note that neither relies on node's ability to perform local I/O.

Some systems (Cassandra is one example I know reasonably well) make nodes voluntarily shut down when they detect catastrophic
local disk failures. We have plans to do the same for RabbitMQ but today a node can in theory stay up per the  above definition but
fail to perform any local I/O. In practice it will likely fail after a timeout or series of timeouts and unhandled exceptions but it can take
a while (minutes).

The definition of a  "catastrophic disk failure" is not entirely clear or objective and cannot be monitored by RabbitMQ well. That's one of the
reasons why we don't think the Cassandra-like shutdown behavior is a no-brainer.

2019-09-17 06:46:16.652 [warning] <0.18748.0> closing AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672, vhost</sp
Reply all
Reply to author
Forward
0 new messages