Rabbitmq client loosing connection and does not recover

1,011 views
Skip to first unread message

Michael

unread,
Sep 8, 2021, 4:02:41 PM9/8/21
to rabbitmq-users

Hello, 

I am using Rabbitmq 3.8.16 with Vertx 3.9.4 Rabbitmq client for my backend application and noticed that sometimes the client looses the connection and it fails to recover correctly (autorecovery is turned ON). It means that after the recovery, It stops to consume messages and the durable queues has no consumers.

This happens randomly and with different exceptions. Since I am using Vertx Rabbitmq client, I cannot create different channel per consumer/publisher, I have something like 10 publishers and 10 consumers on the same node (same connection same channel) sometimes reaching up to 400 messages per second. Can it cause this kind of issues ??

Other problem can be that I have a queues representing websocket connections to my backend, these websocket connections are short living so it quite often declare and then remove a queue representing this user session. Is that a problem for Rabbitmq ?

From the Rabbitmq log I found following exceptions which were related to the connection crash:

First one :

2021-09-08 08:14:15.321 [error] <0.24867.686> ** Generic server <0.24867.686> terminating ** Last message in was {'$gen_cast',{method,{'basic.publish',0,<<"NOTIFICATION_EXCHANGE">>,<<"BROADCAST_NOTIFICATION_ADDRESS">>,false,false},{content,60,none,<<0,0>>,rabbit_framing_amqp_0_9_1,[<<"{MY_MESSAGE}">>]},flow}} ** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.24846.686>,<0.24862.686>,<0.24846.686>,<<"172.20.24.177:60848 -> 172.20.28.73:5672">>,undefined,{user,<<"backend">>,[],[{rabbit_auth_backend_internal,none}]},<<"backend">>,<<"8a4a4f71-251a-4ab6-85f6-50d594db46b9">>,<0.24850.686>,[{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"basic.nack">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],none,200,134217728,900000,#{},1000000000},{lstate,<0.24877.686>,false},none,4927,{2,{[{4926,<<"amq.ctag-ckMzQHTLitAIZ86qk0_R0Q">>,1631088855286,{{'backend_SERVICE_QUEUE','rab...@rabbitmq-server-2.rabbitmq-nodes.rabbitmq'},321}}],[{4925,<<"amq.ctag-xDoqkK1VmPptIM9OG3YaPw">>,1631088855286,{{'backend_SERVICE_QUEUE','rab...@rabbitmq-server-2.rabbitmq-nodes.rabbitmq'},321}}]}},#{'backend_9fb541c5-a9ae-4bbc-99a4-ff444316b3f1' => {resource,<<"backend">>,queue,<<"9fb541c5-a9ae-4bbc-99a4-ff444316b3f1">>},'backend_5f188762-0a66-47e3-bfa8-76aa11481cb9' => {resource,<<"backend">>,queue,<<"5f188762-0a66-47e3-bfa8-76aa11481cb9">>},'backend_e4a3cec7-3f6f-4b05-bdf2-ade8a087425d' => {resource,<<"backend">>,queue,<<"e4a3cec7-3f6f-4b05-bdf2-ade8a087425d">>},'backend_e7a9834a-7ddc-4690-a317-4f19433edafd' => {resource,<<"backend">>,queue,<<"e7a9834a-7ddc-4690-a317-4f19433edafd">>},'backend_6fdcc521-4ff1-4721-bf12-cefffe1181eb' => ...,...},...} ** Reason for termination == ** {{error,noproc},[{rabbit_fifo_client,enqueue,3,[{file,"src/rabbit_fifo_client.erl"},{line,168}]},{rabbit_quorum_queue,deliver,3,[{file,"src/rabbit_quorum_queue.erl"},{line,754}]},{rabbit_amqqueue,'-deliver/3-fun-3-',4,[{file,"src/rabbit_amqqueue.erl"},{line,2239}]},{lists,foldl,3,[{file,"lists.erl"},{line,1267}]},{rabbit_amqqueue,deliver,3,[{file,"src/rabbit_amqqueue.erl"},{line,2236}]},{rabbit_channel,deliver_to_queues,2,[{file,"src/rabbit_channel.erl"},{line,2204}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1375}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]}]} 2021-09-08 08:14:15.321 [error] <0.24867.686> CRASH REPORT Process <0.24867.686> with 0 neighbours exited with reason: {error,noproc} in rabbit_fifo_client:enqueue/3 line 168 in gen_server2:terminate/3 line 1183 2021-09-08 08:14:15.322 [error] <0.24859.686> Supervisor {<0.24859.686>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.24846.686>, <0.24862.686>, <0.24846.686>, <<"172.20.24.177:60848 -> 172.20.28.73:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"backend">>,[],[{rabbit_auth_backend_internal,none}]}, <<"backend">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.24850.686>, <0.24877.686>) at <0.24867.686> exit with reason {error,noproc} in rabbit_fifo_client:enqueue/3 line 168 in context child_terminated 2021-09-08 08:14:15.322 [error] <0.24859.686> Supervisor {<0.24859.686>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.24846.686>, <0.24862.686>, <0.24846.686>, <<"172.20.24.177:60848 -> 172.20.28.73:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"backend">>,[],[{rabbit_auth_backend_internal,none}]}, <<"backend">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.24850.686>, <0.24877.686>) at <0.24867.686> exit with reason reached_max_restart_intensity in context shutdown 2021-09-08 08:14:15.323 [error] <0.24846.686> Error on AMQP connection <0.24846.686> (172.20.24.177:60848 -> 172.20.28.73:5672, vhost: 'backend', state: running), channel 1: {{error,noproc}, [{rabbit_fifo_client,enqueue,3, [{file,"src/rabbit_fifo_client.erl"},{line,168}]}, {rabbit_quorum_queue,deliver,3, [{file,"src/rabbit_quorum_queue.erl"},{line,754}]}, {rabbit_amqqueue,'-deliver/3-fun-3-',4, [{file,"src/rabbit_amqqueue.erl"},{line,2239}]}, {lists,foldl,3,[{file,"lists.erl"},{line,1267}]}, {rabbit_amqqueue,deliver,3,[{file,"src/rabbit_amqqueue.erl"},{line,2236}]}, {rabbit_channel,deliver_to_queues,2, [{file,"src/rabbit_channel.erl"},{line,2204}]}, {rabbit_channel,handle_method,3, [{file,"src/rabbit_channel.erl"},{line,1375}]}, {rabbit_channel,handle_cast,2, [{file,"src/rabbit_channel.erl"},{line,643}]}]} 2021-09-08 08:14:15.323 [warning] <0.24846.686> Non-AMQP exit reason '{{error,noproc},[{rabbit_fifo_client,enqueue,3,[{file,"src/rabbit_fifo_client.erl"},{line,168}]},{rabbit_quorum_queue,deliver,3,[{file,"src/rabbit_quorum_queue.erl"},{line,754}]},{rabbit_amqqueue,'-deliver/3-fun-3-',4,[{file,"src/rabbit_amqqueue.erl"},{line,2239}]},{lists,foldl,3,[{file,"lists.erl"},{line,1267}]},{rabbit_amqqueue,deliver,3,[{file,"src/rabbit_amqqueue.erl"},{line,2236}]},{rabbit_channel,deliver_to_queues,2,[{file,"src/rabbit_channel.erl"},{line,2204}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1375}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]}]}'

Another one that I noticed is :

2021-09-02 14:59:43.260 [error] <0.17646.16> closing AMQP connection <0.17646.16> (172.20.23.201:35420 -> 172.20.36.197:5672): missed heartbeats from client, timeout: 60s 2021-09-02 14:59:43.263 [info] <0.29592.28> Closing all channels from '172.20.23.201:35420 -> 172.20.36.197:5672' because it has been closed

And the last one :

2021-09-08 09:47:14.948 [warning] <0.7610.564> segment_writer: skipping segment as directory /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHES3OXCNJJBB4VW does not exist 2021-09-08 09:47:14.948 [warning] <0.7611.564> segment_writer: failed to open segment file /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHES0C5F3P5Y6WKO/00000001.segmenterror: enoent 2021-09-08 09:47:14.948 [warning] <0.199.564> segment_writer: failed to open segment file /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHES9RZ4RFC37MDG/00000001.segmenterror: enoent 2021-09-08 09:47:14.948 [warning] <0.7611.564> segment_writer: skipping segment as directory /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHES0C5F3P5Y6WKO does not exist 2021-09-08 09:47:14.948 [warning] <0.199.564> segment_writer: skipping segment as directory /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHES9RZ4RFC37MDG does not exist 2021-09-08 09:47:14.950 [warning] <0.7600.564> segment_writer: failed to open segment file /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHESAQL00HNNSPK4/00000001.segmenterror: enoent 2021-09-08 09:47:14.950 [warning] <0.7600.564> segment_writer: skipping segment as directory /var/lib/rabbitmq/mnesia/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/quorum/rab...@rabbitmq-server-1.rabbitmq-nodes.rabbitmq/ORCHESAQL00HNNSPK4 does not exist





On the client side I noticed :

15:11:30.816 [vert.x-eventloop-thread-0] ERROR Fail to handle message com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; cause: java.lang.IllegalStateException: Unsolicited delivery - see Channel.setDefaultConsumer to handle this case. at com.rabbitmq.client.impl.AMQConnection.ensureIsOpen(AMQConnection.java:172) at com.rabbitmq.client.impl.AMQConnection.createChannel(AMQConnection.java:550) at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.createChannel(AutorecoveringConnection.java:165) at io.vertx.rabbitmq.impl.RabbitMQClientImpl.forChannel(RabbitMQClientImpl.java:476) at io.vertx.rabbitmq.impl.RabbitMQClientImpl.basicPublish(RabbitMQClientImpl.java:194) at io.vertx.reactivex.rabbitmq.RabbitMQClient.basicPublish(RabbitMQClient.java:363) at io.vertx.reactivex.rabbitmq.RabbitMQClient.lambda$rxBasicPublish$9(RabbitMQClient.java:376) at io.vertx.reactivex.rabbitmq.RabbitMQClient$$Lambda$699/0x00000000102c38c0.accept(Unknown Source) at io.vertx.reactivex.impl.AsyncResultCompletable.subscribeActual(AsyncResultCompletable.java:44)

or

08:14:15.330 [AMQP Connection 10.100.56.15:5672] WARN com.rabbitmq.client.impl.ForgivingExceptionHandler - An unexpected connection driver error occured (Exception message: Connection reset) 08:14:15.332 [AMQP Connection 10.100.56.15:5672] INFO io.vertx.rabbitmq.impl.RabbitMQClientImpl - RabbitMQ connection shutdown! The client will attempt to reconnect automatically com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR, class-id=0, method-id=0) at com.rabbitmq.client.impl.AMQConnection.startShutdown(AMQConnection.java:916) at com.rabbitmq.client.impl.AMQConnection.shutdown(AMQConnection.java:906) at com.rabbitmq.client.impl.AMQConnection.handleConnectionClose(AMQConnection.java:844) at com.rabbitmq.client.impl.AMQConnection.processControlCommand(AMQConnection.java:799) at com.rabbitmq.client.impl.AMQConnection$1.processAsync(AMQConnection.java:242) at com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChannel.java:178) at com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:111) at com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:650) at com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:48) at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:597)

Since those errors are not clear to my, does anybody has an idea why is this happening ??

Thank you for your help

Arnaud Cogoluègnes

unread,
Sep 9, 2021, 2:00:01 AM9/9/21
to rabbitmq-users
I'm not familiar with Vert.x RabbitMQ support. Have you tried to reproduce the client error by e.g. dropping the connection yourself?

Michael

unread,
Sep 17, 2021, 11:42:09 AM9/17/21
to rabbitmq-users

Hello,

Yes I tried to do drop the connection manually but I never reproduced the issue. When I dropped the connection manually, the rabbitMQ client always managed to recover and it was working as expected. 

It seems that in my case the problem is caused by some channel errors (never the same) after which the client does not recover. It actually reconnects to RabbitMQ server but it does not register the consumers on the existing queues so messages stay in queues and they are never consumed.

I would like to ask: 

  -> Is it normal that the java RabbitMQ client with automatic recovery enabled does not recover after a channel error (channel closed)?

-> If yes, I have an application where I need to recover in any case, what is the best way to recover (recreate a new rabbitMQ client? is there any event to listen if recovery fails?)

-> I have an application where clients connect with WS connection and each WS session has a quorum queue so I can send them some messages from microservices. Could it be a problem to have many queues (representing many WS sessions) ?

-> Is it a problem for a service expecting a high message rates to have only one connection and one channel for all the communication (consumers + publishers with confirms) with rabbitMQ?


Thank you for your time
Dne čtvrtek 9. září 2021 v 8:00:01 UTC+2 uživatel Arnaud Cogoluègnes napsal:
Reply all
Reply to author
Forward
0 new messages