Hi All,
I am using rabbitmq 3.8.18 version server of 3 node cluster which contains around 10 Quorum Queues which handle around 100msgs/sec in Kubernetes GCP environment
As part of functional rabbitmq high availability testing, we are restarting the nodes, it works fine for 2-4 times without any issues but after that leading to the below situation.
Logs of remaining nodes in cluster when one node is restarted.
2021-10-05 09:20:54.975 [info] <0.587.0> node 'rabbit@rabbitmq-svc-2.-rabbitmq-svc-discovery.svc.cluster.local' down: net_tick_timeout
2021-10-05 09:20:54.975 [info] <0.587.0> node 'rabbit@rabbitmq-svc-2.-rabbitmq-svc-discovery.svc.cluster.local' up
2021-10-05 09:20:55.022 [warning] <0.2262.0> Received a 'DOWN' message from 'rab...@rabbitmq-svc-2.rabbitmq-svc-discovery..svc.cluster.local' but still can communicate with it
2021-10-05 09:20:55.022 [error] <0.587.0> Partial partition detected:
* We saw DOWN from rab...@rabbitmq-svc-2.rabbitmq-svc-discovery.svc.cluster.local
* We can still see rab...@rabbitmq-svc-0.rabbitmq-svc-discovery.svc.cluster.local which can see rab...@rabbitmq-svc-2.rabbitmq-svc-discovery.svc.cluster.local
* pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers
2021-10-05 09:20:55.022 [warning] <0.587.0> Cluster minority/secondary status detected - awaiting recovery
2021-10-05 09:20:55.022 [info] <0.2263.0> RabbitMQ is asked to stop...
2021-10-05 09:20:55.250 [debug] <0.2263.0> Plugins discovery: ignoring accept, not a RabbitMQ plugin
Logs of restart pod:
10-05 09:19:42.167 [info] <0.60.0> SIGTERM received - shutting down
2021-10-05 09:19:42.167 [debug] <0.44.0> Running rabbit_prelaunch:shutdown_func() as part of `kernel` shutdown
2021-10-05 09:19:42.167 [debug] <0.44.0> Deleting PID file: /var/lib/rabbitmq/mnesia/rabbit@--rabbitmq-svc-2.--rabbitmq-svc-discovery.-.svc.cluster.local.pid
2021-10-05 09:19:42.172 [debug] <0.863.0> Stopping pg scope rabbitmq_federation_pg_scope
2021-10-05 09:19:42.178 [warning] <0.810.0> HTTP listener registry could not find context rabbitmq_management_tls
2021-10-05 09:19:42.192 [debug] <0.290.0> Change boot state to `stopping`
2021-10-05 09:19:42.193 [info] <0.290.0> Will unregister with peer discovery backend rabbit_peer_discovery_k8s
2021-10-05 09:19:42.193 [info] <0.913.0> stopped TLS (SSL) listener on [::]:5671
2021-10-05 09:19:42.194 [info] <0.893.0> stopped TCP listener on [::]:5672
2021-10-05 09:19:42.196 [error] <0.7661.0> Error on AMQP connection <0.7661.0> (
10.72.2.11:46154 ->
10.72.10.11:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
2021-10-05 09:19:42.198 [error] <0.3051.0> Supervisor {<0.3051.0>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2021-10-05 09:19:43.082 [debug] <0.681.0> queue 'ISO_8583_1993_A_shutDown_--iso-93-67b86d47c7-bznvj' in vhost '/': Leader node 'rabbit@--rabbitmq-svc-0.--rabbitmq-svc-discovery.-.svc.cluster.local' may be down, setting pre-vote timeout
2021-10-05 09:19:43.082 [debug] <0.696.0> queue 'CORE-A' in vhost '/': Leader node 'rabbit@--rabbitmq-svc-0.--rabbitmq-svc-discovery.-.svc.cluster.local' may be down, setting pre-vote timeout
2021-10-05 09:19:43.082 [debug] <0.711.0> queue 'Core-A_shutDown_--core-67cf6466c8-z8c4k' in vhost '/': Leader node 'rabbit@--rabbitmq-svc-0.--rabbitmq-svc-discovery.-.svc.cluster.local' may be down, setting pre-vote timeout
2021-10-05 09:19:45.197 [error] <0.3012.0> CRASH REPORT Process <0.3012.0> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 769
2021-10-05 09:19:45.198 [error] <0.7738.0> CRASH REPORT Process <0.7738.0> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 769
2021-10-05 09:19:45.198 [error] <0.3010.0> Supervisor {<0.3010.0>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.3011.0>, {acceptor,{0,0,0,0,0,0,0,0},5672}) at <0.3012.0> exit with reason channel_termination_timeout in context shutdown_error
2021-10-05 09:19:45.198 [error] <0.7661.0> CRASH REPORT Process <0.7661.0> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 769
2021-10-05 09:19:45.198 [error] <0.7736.0> Supervisor {<0.7736.0>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.7737.0>, {acceptor,{0,0,0,0,0,0,0,0},5672}) at <0.7738.0> exit with reason channel_termination_timeout in context shutdown_error
2021-10-05 09:19:45.198 [error] <0.7658.0> Supervisor {<0.7658.0>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.7659.0>, {acceptor,{0,0,0,0,0,0,0,0},5672}) at <0.7661.0> exit with reason channel_termination_timeout in context shutdown_error
Any inputs on this will be helpful.
Thanks & Regards,
Kolla Avinash.