Recently we had a problem with underlying filesystem on one of the nodes, and this caused two other instances of RabbitMQ on 2 other nodes to become unresponsive: publishing and consuming just stopped. Literally the whole cluster became unresponsive due to the failure of only 1 node.
When we powered off the faulty node the remaining two nodes started to work normally.
Some details about our setup:
OS: CentOS 7.x with latest updates.
RabbitMQ version in production is rabbitmq-server-3.7.8-1.el7.noarch . but we checked the latest version rabbitmq-server-3.7.17-1.el7.noarch in dev environment - same behaviour.
node names:
amqp-cl1-node1 (172.31.32.186)
amqp-cl1-node2 (172.31.32.103)
amqp-cl1-node3 (172.31.32.108)
All nodes have the following configuration:
/etc/rabbitmq/rabbitmq.config:
[
{rabbit, [
{cluster_nodes, {['rabbit@amqp-cl1-node1, 'rabbit@amqp-cl1-node2', 'rabbit@amqp-cl1-node3'], disc}},
{cluster_partition_handling, pause_minority},
{tcp_listen_options, [
{backlog, 512},
{nodelay, true},
{linger, {true, 0}},
{exit_on_close, false}
]},
{default_user, <<"guest">>},
{default_pass, <<"guest">>}
]},
{kernel, [
]}
,
{rabbitmq_management, [
{listener, [
{port, 15672}
]}
]}
].
% EOF
We managed to reproduce the issue in our test environment. Here are the steps:
1. we use `fsfreeze` tool on amqp-cl1-node1 (this node had the majority of master queues):
root@amqp-cl1-node1: ~ # fsfreeze --freeze /
after several seconds clients are not able to publish/consume anything.
2. we check cluster_status on other nodes:
root@amqp-cl1-node2: ~ # rabbitmqctl cluster_status
Cluster status of node rabbit@amqp-cl1-node2 ...
[{nodes,[{disc,['rabbit@amqp-cl1-node1',
'rabbit@amqp-cl1-node2',
'rabbit@amqp-cl1-node3']}]},
{running_nodes,['rabbit@amqp-cl1-node3',
'rabbit@amqp-cl1-node1',
'rabbit@amqp-cl1-node2']},
{cluster_name,<<"rabbit@amqp-cl1-node1">>},
{partitions,[]},
{alarms,[{'rabbit@amqp-cl1-node3',[]},
{'rabbit@amqp-cl1-node1',[]},
{'rabbit@amqp-cl1-node2',[]}]}]
All 3 nodes (including amqp-cl1-node1 with faulty fs) remain in `running` state during the whole time.
3. We are not able even to execute
root@amqp-cl1-node2: ~ # rabbitmqctl list_queues -p smpp name messages pid slave_pids
command on any of the remaining nodes. the command just freezes.
4. We get the following records in RabbitMQ logs on amqp-cl1-node2
2019-09-17 06:48:21.137 [info] <0.23872.0> connection <0.23872.0> (172.31.9.79:58350 -> 172.31.32.103:5672): user 'smpp' authenticated and granted access to vhost 'smpp'
2019-09-17 06:48:48.631 [error] <0.23187.0> Supervisor {<0.23187.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23177.0>, <0.23188.0>, <0.23177.0>, <<"172.31.9.79:58291 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23178.0>, <0.23189.0>) at <0.23190.0> exit with reason killed in context shutdown_error
2019-09-17 06:49:00.130 [warning] <0.23553.0> closing AMQP connection <0.23553.0> (172.31.9.79:58317 -> 172.31.32.103:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:50:10.131 [error] <0.23563.0> Supervisor {<0.23563.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23553.0>, <0.23564.0>, <0.23553.0>, <<"172.31.9.79:58317 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23554.0>, <0.23565.0>) at <0.23566.0> exit with reason killed in context shutdown_error
2019-09-17 06:50:21.140 [warning] <0.23872.0> closing AMQP connection <0.23872.0> (172.31.9.79:58350 -> 172.31.32.103:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:51:31.141 [error] <0.23882.0> Supervisor {<0.23882.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.23872.0>, <0.23883.0>, <0.23872.0>, <<"172.31.9.79:58350 -> 172.31.32.103:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.23873.0>, <0.23884.0>) at <0.23885.0> exit with reason killed in context shutdown_error
and amqp-cl1-node3 nodes:
2019-09-17 06:44:16.647 [info] <0.18748.0> accepting AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672)
2019-09-17 06:44:16.649 [info] <0.18748.0> connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672): user 'smpp' authenticated and granted access to vhost 'smpp'
2019-09-17 06:46:16.652 [warning] <0.18748.0> closing AMQP connection <0.18748.0> (172.31.9.79:58258 -> 172.31.32.108:5672, vhost: 'smpp', user: 'smpp'):
client unexpectedly closed TCP connection
2019-09-17 06:47:26.653 [error] <0.18758.0> Supervisor {<0.18758.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(2, <0.18748.0>, <0.18759.0>, <0.18748.0>, <<"172.31.9.79:58258 -> 172.31.32.108:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]}, <<"smpp">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.18749.0>, <0.18760.0>) at <0.18761.0> exit with reason killed in context shutdown_error
5. We see the following records in crash.log on 2 remaining nodes:
2019-09-17 06:48:48 =SUPERVISOR REPORT====
Supervisor: {<0.23187.0>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.23190.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23177.0>,<0.23188.0>,<0.23177.0>,<<"172.31.9.79:58291 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23178.0>,<0.23189.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]
2019-09-17 06:50:10 =SUPERVISOR REPORT====
Supervisor: {<0.23563.0>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.23566.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23553.0>,<0.23564.0>,<0.23553.0>,<<"172.31.9.79:58317 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23554.0>,<0.23565.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]
2019-09-17 06:51:31 =SUPERVISOR REPORT====
Supervisor: {<0.23882.0>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.23885.0>},{name,channel},{mfargs,{rabbit_channel,start_link,[2,<0.23872.0>,<0.23883.0>,<0.23872.0>,<<"172.31.9.79:58350 -> 172.31.32.103:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"smpp">>,[],[{rabbit_auth_backend_internal,none}]},<<"smpp">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.23873.0>,<0.23884.0>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]
6. after we switch off the faulty node (amqp-cl1-node1), cluster_status starts reporting only rabbit@amqp-cl1-node3 and rabbit@amqp-cl1-node2 in running state.