Unexpected Client Connection Shutdown During Message Reception

689 views
Skip to first unread message

Tim Wensky

unread,
Dec 13, 2022, 12:50:50 PM12/13/22
to rabbitmq-users
I am experiencing a strange client connection shutdown while our server is publishing a large message to our client. We are transferring videos encoded as base64 strings. The video size is limited to 100MB. Since we have to deal with slow connections the reception of the message may take a few minutes. Now, every single time after exactly 90s of message transfer the connection is unexpectedly shutdown. My team and I have pretty much tried to play around with any configuration value that might cause the problem, but without success. Transfers that take less than 90s work fine and are well tested. Our complete system is working as expected except for this issue.
Our setup:
- server: RabbitMQ 3.8.9.
- client: RabbitMQ .NET 6.4.0
The team is experienced in using the rabbit and similar systems with less data are field tested for over a year now. Therefore I am not expecting any "normal" configuration/programming issues, but we simply cannot find a reason for this behavior.
Anyone any idea? Thanx!

Luke Bakken

unread,
Dec 13, 2022, 1:43:16 PM12/13/22
to rabbitmq-users
Hello,

Just FYI RabbitMQ 3.8.x is completely out of support.

You are probably running up against a socket write timeout. Without a log file (or log messages from the time of the error), I can only guess. Please provide that.

In these cases it's probably best to chunk your messages into sizes that will not hit this limit.

Thanks,
Luke

Tim Wensky

unread,
Dec 14, 2022, 2:16:31 AM12/14/22
to rabbitmq-users
Hi Luke,
thanks for the info! We have previously been updating the Rabbit, but after the update our whole system didn't work anymore. Since we are working on a live system with customers behind things like that are extremely critical. But we should clearly find out what went wrong and finally update the Rabbit.
My idea also was to chunk our messages, but I am kind of hesitating since I really want to know what is going wrong (also changing the protocol would really be a tough challenge in a running system), but in the end this will be my solution if we cannot figure out another solution.

For the LOG file, this is what we captured yesterday ( I x-ed out the IP's), the second line is the relevant one, I suppose:
2022-12-13 15:33:30.519276+01:00 [error] <0.4650.0> closing AMQP connection <0.4650.0> (xx.xx.237.196:xxxx -> xx.xx.178.74:xxxxx):
2022-12-13 15:33:30.519276+01:00 [error] <0.4650.0> {inet_error,timeout}
2022-12-13 15:33:30.523731+01:00 [debug] <0.4713.0> Closing all channels from connection 'xx.xx.237.196:xxxx -> xx.xx.178.74:xxxxx' because it has been closed
2022-12-13 15:33:30.523864+01:00 [debug] <0.4685.0> Deleting auto-delete queue '2133FEFE1E1AEBD5_group' in vhost 'xxxx' because all of its consumers (1) were on a channel that was closed
2022-12-13 15:33:30.524032+01:00 [debug] <0.4665.0> Deleting auto-delete queue '2133FEFE1E1AEBD5_manager' in vhost 'xxxx' because all of its consumers (1) were on a channel that was closed
2022-12-13 15:33:30.534750+01:00 [debug] <0.4676.0> Deleting auto-delete queue '2133FEFE1E1AEBD5_controller' in vhost 'xxxx' because all of its consumers (1) were on a channel that was closed

Also I have captured the data flow using wireshark, see attachment image.png. What I can see is, that the data is coming in as expected and therefore a read/write timeout "should" not happen.


image.png

Gregory Green

unread,
Dec 14, 2022, 6:23:41 AM12/14/22
to rabbitm...@googlegroups.com
Hello Tim,

If you decide to rearchitect the design;

I would recommend you consider implementing a claim check pattern.

With this approach, you can send the location of the file instead of the actual video in a claim message. You can implement a video service that rabbit consumers can provide their claim.  The video service could allow consumers to stream larger video files. 

This approach may be better then splitting the video files, because 

- The claim can remain in rabbit until the file is safely transferred by the consumer 
-  you can track the unclaimed videos in rabbit
- the claim check can use the dead letter approach for missing or invalid claims
- the two previous points can simplify troubleshooting 
- you can scale the claim processing among multiple consumers without having to worry about video segments being processed out of order
- the claim check allows you to process 1 file with 1 message (limiting the number of networks calls)
- you can take advantage of protocols that are best suited for streaming larger payloads like videos (ex: using a HTTP based protocol)
- this may allow for lower latency and increased throughput for file transfers overall 

Hope this helps, 
please excuse any typos 

On Dec 14, 2022, at 2:16 AM, Tim Wensky <t...@wy-ease.com> wrote:

Hi Luke,
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/3585a481-115e-4c3f-81a9-b222842de78cn%40googlegroups.com.
<image.png>

Luke Bakken

unread,
Dec 14, 2022, 10:24:48 AM12/14/22
to rabbitmq-users
Hello,

Thanks for the info. The Wireshark screen shot doesn't quite match the log entry you have provided.

I'm assuming the time difference is due to you being in the UTC+1 time zone.

Note the difference in IP addresses between the RabbitMQ log and the Wireshark output - is there a load balancer or other network device in use?

Thanks,
Luke

Tim Wensky

unread,
Dec 14, 2022, 10:39:02 AM12/14/22
to rabbitmq-users
Hi Luke,

the time difference might be, because we did not run the log & the wireshark at the same time. We have tested this many times, the outcome was always the same.
The IP address difference might be caused by the fact that this test was running on my office pc. Wireshark logged the IP address of my pc on the local network while the server logged the IP of my router, at least I do not have any other explanation. By now we have setup a brand new rabbit (latest version) on a clean pc and tested again, the results are the same. If there was a load balancer, it would have to be somewhere on the internet route which is out of my control. We set up the test system as pure as possible.

Luke Bakken

unread,
Dec 14, 2022, 11:23:01 AM12/14/22
to rabbitmq-users
Hello,

Let's take a closer look at the Wireshark image you provided. Note that all of the packets that you show have a source IP starting with 90, and destination starting with 192. I'm assuming that 90.* is your RabbitMQ server, and 192.* is the client application.

In the entire screenshot, there are no TCP ACK packets from the client application to RabbitMQ.

Unless you have configured Wireshark to only show packets flowing in one direction, my guess is that this is the cause of your issue.

I ran RabbitMQ and PerfTest locally, configuring PerfTest to send one message every 10 seconds. Here is what it looks like in Wireshark:

perftest-wireshark.png

Note that data is flowing in both directions.

On Wednesday, December 14, 2022 at 7:39:02 AM UTC-8 t...@wy-ease.com wrote:
Hi Luke,

Tim Wensky

unread,
Dec 16, 2022, 1:10:34 AM12/16/22
to rabbitmq-users
Hi,
since our time is running low my teams has decided to redesign our system and chunk large messages. Unfortunately I will have to freeze this issue for now. Thank you very much for your help!

Luke Bakken

unread,
Dec 16, 2022, 9:47:08 AM12/16/22
to rabbitmq-users
Hi Tim,

That's the best plan to be honest. Even better would be to use a dedicated object store like S3.

I am still planning to try and reproduce this issue, stay tuned. Have a good weekend!

Luke Bakken

unread,
Dec 19, 2022, 5:28:21 PM12/19/22
to rabbitmq-users
Hello again Tim -

For what it's worth, I can reproduce the issue you report with this docker compose project:


I need to add some logging to see what is going on, but this is a good start.

Thanks,
Luke

Luke Bakken

unread,
Dec 19, 2022, 6:45:29 PM12/19/22
to rabbitmq-users
Hi Tim,

If you have time, would you please attach all of your RabbitMQ configuration.

Have you adjusted any heartbeat intervals or timeout values in RabbitMQ or your client application?

Thanks,
Luke

Luke Bakken

unread,
Dec 20, 2022, 12:17:24 PM12/20/22
to rabbitmq-users
Hi Tim,

I'm not sure if I'll hear back from you, but the reason for the error is due to the fact that the connection to your consumer is slow enough to block RabbitMQ from sending a heartbeat to the client application (because it is sending 100MiB of data). There is no way to address this without increasing heartbeat and socket send timeouts, which would have other detrimental effects. Your decision to chunk messages is the correct one.

I have opened the following PR to make the cause of the error more apparent:


Future versions of RabbitMQ will log the error as follows:

2022-12-20 08:57:12.614470-08:00 [error] <0.959.0> closing AMQP connection <0.959.0> (127.0.0.1:52052 -> 127.0.0.1:5672):
2022-12-20 08:57:12.614470-08:00 [error] <0.959.0> {inet_error,{heartbeat_send_error,timeout}}


The reason you hit this timeout at 90 seconds is due to the fact that the default heartbeat interval is 60 seconds, and the default socket send timeout is 30 seconds. So, after your consumer connects RabbitMQ attempts to send a heartbeat at 60 seconds, and that send times out 30 seconds later due to the TCP connection being busy sending all of that data.

Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages