AMQP Connection Reset : [Malformed packet : length of contained item exceeds the length of containing item]

945 views
Skip to first unread message

abhayverma

unread,
Jul 12, 2020, 8:09:36 AM7/12/20
to rabbitmq-users
OS - Amazon Linux 2
Docker - 9.3.6
Kernel - 4.14.181-140.257.amzn2.x86_64
RabbitMQ 3.8.5
Erlang 23.0.2

Attaching pcap file from the client end where we can observe TCP resets. Out of 100 messages published only 1 message pass through. 

rabbitmqctl environment output - rabbitmq-environment.txt
tcpdump captured after a delay at client end - old.pcap
tcpdump captured in begining at client end - new.pcap

We were trying move our message broker service from VM to K8s and in that process we observed this anomaly. Though the setup is working fine on VM.

Please help us to move in right direction to close this issue.
old.pcap
rabbitmq-environment.txt
new.pcap

Gerhard Lazu

unread,
Jul 13, 2020, 7:26:06 AM7/13/20
to rabbitmq-users
Your K8S network setup differs from your VM network setup. My recommendation would be to understand the delta between the two. If there is an issue at layers 4/5 in the network layer (TCP), it's likely that it will show up in layer 7 as well (AMQP in this case). As RabbitMQ is layer 7, Erlang (the runtime) is layer 6 and this error is at layers 4/5,   there is nothing to "fix" in RabbitMQ or Erlang.

Last time that I checked the simplest RabbitMQ on K8S, everything worked fine: https://github.com/rabbitmq/tgir/tree/S01E05/s01/e05. While this was DKE, I don't expect things to be any different on EKS. I intend to verify in the next episode.

Michael Klishin

unread,
Jul 13, 2020, 7:45:55 AM7/13/20
to rabbitmq-users
Please always share logs from all nodes. All `rabbitmq-diagnostics environment` can tell us here is what ports the node is listening on.

Michael Klishin

unread,
Jul 13, 2020, 2:29:56 PM7/13/20
to rabbitmq-users
In a log file sent privately we see a lot of

> 2020-07-13 19:11:24.203 [debug] <0.10437.0> closing AMQP connection <0.10437.0> ([client IP]:24561 -> [server IP]:5672):
> connection_closed_with_no_data_received

so there are inbound TCP connections but they send no data. I'm afraid the only way forward is to troubleshoot your network and load balancer setup [1]
in the new environment. This list cannot really help with that.


On Sunday, July 12, 2020 at 3:09:36 PM UTC+3 abhayverma wrote:

Michael Klishin

unread,
Jul 13, 2020, 2:32:50 PM7/13/20
to rabbitmq-users
I should have mentioned that these could be TCP load balancer health checks (several products are known to open connections but not send any data on them).
Such connections won't be logged RabbitMQ unless debug logging is enabled.

So it can be that no client connections actually get through and only load balancer TCP health checks do. Or those can be genuine connection attempts from clients.
We cannot tell as we don't know the topology of your network.

Michael Klishin

unread,
Jul 13, 2020, 3:21:48 PM7/13/20
to rabbitmq-users
There's evidence of very short-lived MQTT client connections. It could be an effect of an intermediary or applications naturally used very short-lived connections:

2020-07-13 19:11:23.707 [debug] <0.10421.0> MQTT accepting TCP connection <0.10421.0> ([client IP]:30955 -> [server IP]:1883)
2020-07-13 19:11:23.708 [debug] <0.10421.0> MQTT connection "[client IP]:30955 -> [server IP]:1883" will terminate because peer closed TCP connection
2020-07-13 19:11:23.708 [debug] <0.10421.0> MQTT: about to send will message (if any) on connection "[client IP]:30955 -> [server IP]:1883"

abhayverma

unread,
Jul 14, 2020, 2:03:39 AM7/14/20
to rabbitmq-users
Thanks Michael for analysing the logs. There is tcp health check configured on aws nlb for ports 1883 and 5672. 

From consume_event_stream (rabbitmq-diagnostics-consume_event_stream.txt), it can be seen that client actually does connect over AMQP and sends data by opening channels but those connections are short lived and there are connection.closed events which occurs after an RST packet which can be seen in the tcpdump (out.pcap) shared along with other logs. And malformed packet info is also available in tcpdump.

Sample event from the log file shared -

{"event":"connection.closed","client_properties":["{<<\"product\">>,longstr,<<\"RabbitMQ\">>}","{<<\"copyright\">>,longstr,<<\"Copyright (c) 2007-2016 Pivotal Software, Inc.\">>}","{<<\"capabilities\">>,table,[{<<\"exchange_exchange_bindings\">>,bool,true},{<<\"connection.blocked\">>,bool,true},{<<\"authentication_failure_close\">>,bool,true},{<<\"basic.nack\">>,bool,true},{<<\"publisher_confirms\">>,bool,true},{<<\"consumer_cancel_notify\">>,bool,true}]}","{<<\"information\">>,longstr,<<\"Licensed under the MPL. See http://www.rabbitmq.com/\">>}","{<<\"version\">>,longstr,<<\"3.6.5\">>}","{<<\"platform\">>,longstr,<<\"Java\">>}"],"node":"rabbit@[nodename]","pid":"<rabbit@[nodename].1594647286.5862.0>","name":"[clientIP]:42430 -> [serverIP]:5672","timestamp_in_ms":1594647518376}

abhayverma

unread,
Jul 14, 2020, 5:07:18 PM7/14/20
to rabbitmq-users
Is there any particular scenario/condition where we observe such messages in tcp stream "[Malformed Packet: AMQP: length of contained item exceeds length of containing item]" 

abhayverma

unread,
Jul 19, 2020, 2:50:11 PM7/19/20
to rabbitmq-users
So I drilled down to this particular config in rabbitmq.conf. RabbitMQ service startup was just fine but there was an issue with java client while making amqp connections. The connections were close abruptly and we were seeing "unknown channel number" and "shutdown signal" exception in client logs.
[1]
tcp_listen_options.backlog = 65536 #didn't work
tcp_listen_options.sndbuf = 16384. #didn't work
tcp_listen_options.recbuf = 16384.  #didn't work

After changing the above config to below setting, the clients were able to connect without an issue on amqp protocol.
[2]
tcp_listen_options.backlog = 4096 #worked
tcp_listen_options.sndbuf = 8192.  #worked
tcp_listen_options.recbuf = 8192.   #worked

Also, I tried another combination, where [3]
sysctl params -
net.ipv4.tcp_max_syn_backlog=65536 #didn't work
net.core.netdev_max_backlog=65536 #didn't work
and then rabbitmq conf - tcp_listen_options.backlog = 65536. #didn't work
But there was still an issue at client end - "java.util.concurrent.TimeoutException: null"

I couldn't find a relation in those configs that should be used while tuning tcp options for amqp in rabbitmq config. Can you please help me understand if there is a relation between tcp_listen_options.backlog and (tcp_listen_options.sndbuf/tcp_listen_options.recbuf).

Michael Klishin

unread,
Jul 20, 2020, 5:46:14 AM7/20/20
to rabbitmq-users
There is no relation. One controls the maximum length of the pending inbound TCP connection queue (no relation to RabbitMQ queues).
Two others control TCP connection buffer sizes.

Certain libraries, tools, and even appliances can have implicit assumptions about TCP buffer size being a multiple of, say, 4096 or 8192. This seems to
be the case in your examples. I doubt RabbitMQ Java client has any such assumptions. Unfortunately it can be really difficult to track things down with
multiple hops and intermediaries between the client and the server.
Reply all
Reply to author
Forward
0 new messages