Dear All,
We are running 3 node rabbitmq cluster(statefulset) in kuberenetes environment deployed via helm(bitnami chart).
Running Nodes
rab...@rabbitmq-0.rabbitmq-headless.myorg.svc.cluster.local
rab...@rabbitmq-1.rabbitmq-headless.myorg.svc.cluster.local
rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local
Versions
rab...@rabbitmq-0.rabbitmq-headless.myorg.svc.cluster.local: RabbitMQ 3.8.14 on Erlang 23.3.2
rab...@rabbitmq-1.rabbitmq-headless.myorg.svc.cluster.local: RabbitMQ 3.8.14 on Erlang 23.3.2
rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local: RabbitMQ 3.8.14 on Erlang 23.3.2
Recently, one of our applications (NodeJs) consuming a particular queue was unable to find the queue and started getting below errors in logs, we noticed that these errors were coming from the past 5 days but earlier, there was no impact on the application, until we deployed a new version of the application, and the new pods started getting the following errors and weren't able to become healthy
[ERROR] 2022-12-19T14:06:11.642Z [RabbitManager.js:80] [worker:23] :: RabbitMQ connection was disconnected. Channel closed by server: 404 (NOT-FOUND) with message "NOT_FOUND - home node 'rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local' of durable queue 'gateway.deferred' in vhost '/' is down or inaccessible"
[ERROR] 2022-12-19T14:06:11.644Z [tasks.js:307] [worker:23] :: [OperationalError: Operation failed: QueueDeclare; 404 (NOT-FOUND) with message "NOT_FOUND - home node 'rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local' of durable queue 'gateway.deferred' in vhost '/' is down or inaccessible"
at reply (/data/myorg/imt-gateway/node_modules/amqplib/lib/channel.js:134:29)
at ConfirmChannel.C.accept (/data/myorg/imt-gateway/node_modules/amqplib/lib/channel.js:417:7)
at Connection.mainAccept [as accept] (/data/myorg/imt-gateway/node_modules/amqplib/lib/connection.js:64:33)
at Socket.go (/data/myorg/imt-gateway/node_modules/amqplib/lib/connection.js:478:48)
at /data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:54:19
at Scope._activate (/data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/async_resource.js:57:14)
at Scope.activate (/data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19)
at Socket.bound (/data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:53:20)
at Socket.emit (node:events:513:28)
at Socket.emit (node:domain:489:12)
at emitReadable_ (node:internal/streams/readable:578:12)
at processTicksAndRejections (node:internal/process/task_queues:82:21)] {
cause: [Error: Operation failed: QueueDeclare; 404 (NOT-FOUND) with message "NOT_FOUND - home node 'rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local' of durable queue 'gateway.deferred' in vhost '/' is down or inaccessible"
at reply (/data/myorg/imt-gateway/node_modules/amqplib/lib/channel.js:134:29)
at ConfirmChannel.C.accept (/data/myorg/imt-gateway/node_modules/amqplib/lib/channel.js:417:7)
at Connection.mainAccept [as accept] (/data/myorg/imt-gateway/node_modules/amqplib/lib/connection.js:64:33)
at Socket.go (/data/myorg/imt-gateway/node_modules/amqplib/lib/connection.js:478:48)
at /data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:54:19
at Scope._activate (/data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/async_resource.js:57:14)
at Scope.activate (/data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19)
at Socket.bound (/data/myorg/imt-gateway/node_modules/@myorg/overseer/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:53:20)
at Socket.emit (node:events:513:28)
at Socket.emit (node:domain:489:12)
at emitReadable_ (node:internal/streams/readable:578:12)
at processTicksAndRejections (node:internal/process/task_queues:82:21)] {
code: 404,
classId: 50,
methodId: 10
},
isOperational: true,
code: 404,
classId: 50,
methodId: 10
}
RabbitMQ Logs:
----------------
This line started coming in all the rabbitmq node logs 5 days ago, and i think it was the same for the application logs
operation queue.declare caused a channel exception not_found: home node 'rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local' of durable queue 'gateway.deferred' in vhost '/' is down or inaccessible
--> The queue "gateway.deferred" is a mirrored queue with policy set as ha-2, also confirmed it via the cli:
--------------------------------------------------------------------
$ rabbitmqctl list_queues name policy pid slave_pids|grep deferred
gateway.deferred ha-2-gateway <rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local.1661957998.1152.0> [<rab...@rabbitmq-0.rabbitmq-headless.myorg.svc.cluster.local.1663848150.28030.2196>]
--------------------------------------------------------------------
The issue was resolved via restarting the rabbitmq-2 node.
Rabbitmq Logs during the restart & after:
-----------------------------------------
raabitmq-2 ->
<0.657.0> Mirrored queue 'gateway.deferred' in vhost '/': Stopping all nodes on master shutdown since no synchronised mirror (replica) is available
<0.656.0> Mirrored queue 'gateway.deferred' in vhost '/': Adding mirror on node 'rab...@rabbitmq-2.rabbitmq-headless.myorg.svc.cluster.local': <0.728.0>
^--> (This above line says adding mirror on node rabbitmq-2 but in the above rabbitmqctl output its showing it as the leader, pls correct me if my understanding is wrong)
rabbitmq-0 ->
<0.23961.952> Mirrored queue 'gateway.deferred' in vhost '/': Synchronising: all mirrors already synced
<0.654.0> Mirrored queue 'gateway.deferred' in vhost '/': Synchronising: 0 messages to synchronise
<0.654.0> Mirrored queue 'gateway.deferred' in vhost '/': Synchronising: batch size: 100
I am trying to understand what could have caused these errors and why the application was not able to find that queue on node rabbitmq-2 when it was the one acting as a leader as per the output. Secondly, what happenned after the restart that made the issue resolved?
Please let me know if more details are required.
Regards,
Neeraj