Networking issue with AWS NLB, quorum queues and rabbit 3.10 -> messages redelivered

662 views
Skip to first unread message

davide....@gmail.com

unread,
Nov 17, 2022, 11:40:43 AM11/17/22
to rabbitmq-users
Hello,
We have recently migrated 2 clusters of 3 nodes each running for almost 3 years with the version 3.8.3 to the new 3.10.10 (we are using Ubuntu 22.04.1 LTS). We have switched from Classic Queue Mirroring to Quorum queues. We are on AWS and we use a NLB in front of the cluster.

Since we have done the migration with the same software we have started experiencing some strange messages getting redelivered, we never had this problem in the past 3 years with old versions of rabbit.

What we have noticed in the logs of our processes is that the messages arriving with flag redelivered were not previously processed by our software so that has led us to investigate the network layer.

We have found this interesting article
https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda

After reading it we have checked the graph for “Load balancer reset count” and we have found that sometimes 8 connections are closed by the load balancer. As described by the article they are closed silently by the load balancer and when the server tries to push a message it receives a RST packet and we are guessing that this is causing the message to be redelivered.

We had always used heartbeat in our software configured to 50 seconds, we can see that value both on the HTTP management console and on the command line with:
rabbitmqctl list_connections timeout

So the heartbeat seems configured correctly, to fix this problem we have tried to:

Set the heartbeat to 30 seconds
Configure the TCP Keepalives (leaving also the heartbeat to 30 seconds enabled)

For TCP Keepalives we have used has described in the doc:
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4

but the problem is still present, we still see regular connections reset by the load balancer in the graph “Load balancer reset count” and we get random messages with redelivered flag that we cannot find already processed in the log.

To track down this problem we have tried to log the RST packed generated from the load balancer with
tcpdump -n -vv  -s 1500 -i ens5 'dst port 5674 && (tcp[tcpflags] & (tcp-rst) != 0)'

we have registered some packets but not generated from the NLB but from other clusters node

Any suggestion?
Thank you

Luke Bakken

unread,
Nov 17, 2022, 5:08:29 PM11/17/22
to rabbitmq-users
Questions in-line...
 
What we have noticed in the logs of our processes is that the messages arriving with flag redelivered were not previously processed by our software so that has led us to investigate the network layer.


How long does it usually take for a consumer to process a message in your environment?

What AMQP client library are you using?
 
After reading it we have checked the graph for “Load balancer reset count” and we have found that sometimes 8 connections are closed by the load balancer

I'm assuming you mean "AMQP client connections", and not connections between nodes in your cluster (because they shouldn't be going through the ELB).

For TCP Keepalives we have used has described in the doc:
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4

Just to double-check ... you're using ipv4 and not ipv6, correct?

Have you used tcpdump to verify that keepalive packets are being sent?

To track down this problem we have tried to log the RST packed generated from the load balancer with
tcpdump -n -vv  -s 1500 -i ens5 'dst port 5674 && (tcp[tcpflags] & (tcp-rst) != 0)'

we have registered some packets but not generated from the NLB but from other clusters node

Can you please explain what "other clusters node" means? 

davide....@gmail.com

unread,
Nov 18, 2022, 7:17:27 AM11/18/22
to rabbitmq-users
Hi Luke,
first of all thank you for the interest in this issue.

Just a bit of background: we are running rabbitmq in production from at least 5 years with 2 clusters of 3 nodes each.
Our system is based on a microservice architecture and, at the moment, we are processing about 1bln messages per year, for each message we generate about 10 Rabbitmq messages so more or less 10bln rabbitmq messages per year.
Our services are the same on both clusters, we use 2 different clusters just to separate two kind of traffic that we manage.

Here the responses in line:

Il giorno giovedì 17 novembre 2022 alle 23:08:29 UTC+1 luker...@gmail.com ha scritto:
Questions in-line...
 
What we have noticed in the logs of our processes is that the messages arriving with flag redelivered were not previously processed by our software so that has led us to investigate the network layer.

yes we know about the message redelivered, in the past we got them only when a connection went broken, usually we got one redelivered message and usually an exception on our java log saying that the connection was broke, note that we are notified with a warn on our monitoring system each time we receive a message re-delivered, so in the years we have monitored very closed the redelivered messages and we always get them only when there was a real network problem like a connection broken.
 
 
How long does it usually take for a consumer to process a message in your environment?

we have a microservice architecture, usually the processing of a message is just milliseconds, we have one service that can took long because we do some HTTP callback to our customers with a timeout of connection and read of 10 secs, but the redelivered that we are getting now are not related to that particular service,  we got the randomly across all our services.
 

What AMQP client library are you using?

before with the old cluster we was using the java client 5.7.3. After the migration to rabbit 3.10, since we have started to get this random redelivered messages, we have upgrade to the java client 5.16.0 to try to fix the problem but nothing has changed
 
 
After reading it we have checked the graph for “Load balancer reset count” and we have found that sometimes 8 connections are closed by the load balancer

I'm assuming you mean "AMQP client connections", and not connections between nodes in your cluster (because they shouldn't be going through the ELB).

yes, exactly the connection between clients and rabbit, the cluster is formed using static FQDNs pointing to the right node without using the NLB
 

For TCP Keepalives we have used has described in the doc:
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4

Just to double-check ... you're using ipv4 and not ipv6, correct?

yes correct, we do not use the standard port, we use do not use the standard 5672 but we use 5673 and 5674, this has never been a problem in the past
for example
netstat -ltpn | grep 5674
tcp6 0 0 :::5674 :::* LISTEN 587/beam.smp

rabbit is listen both on ipv4 and ipv6 but we use ipv4 for communication, eventually we can bind only to ipv4 but I image it will not fix the problem, anyway

tcpdump -n -vv  -s 1500 -i ens5  ip6
tcpdump: listening on ens5, link-type EN10MB (Ethernet), snapshot length 1500 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

no packet is transmitted on ipv6
 

Have you used tcpdump to verify that keepalive packets are being sent?

do you mean AMQP heartbeats or other TCP packets? heartbeats are sent for sure
heartbeat.png

To track down this problem we have tried to log the RST packed generated from the load balancer with
tcpdump -n -vv  -s 1500 -i ens5 'dst port 5674 && (tcp[tcpflags] & (tcp-rst) != 0)'

we have registered some packets but not generated from the NLB but from other clusters node

Can you please explain what "other clusters node" means? 

This is what we see on the load balancer:
 load-balancer.png

this graph on the old NLB with the same microservices was always zero, since our assumption was that for same strange reason the NLB was silently terminating the connections and then sending RST packets to rabbit when he was trying to send a message we have put on each cluster node for some hours this:
tcpdump -n -vv  -s 1500 -i ens5 'dst port 5674 && (tcp[tcpflags] & (tcp-rst) != 0)'

on node: 172.31.21.87 we have registered
17:25:31.317306 IP (tos 0x0, ttl 64, id 37412, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.21.87.5674 > 172.31.10.87.5674: Flags [R.], cksum 0x7813 (incorrect -> 0xcdda), seq 708467865, ack 109426256, win 490, options [nop,nop,TS val 1123426699 ecr 607860302], length 0

on node 172.31.33.113 we have registered
12:08:02.592811 IP (tos 0x0, ttl 64, id 23563, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.33.113.5674 > 172.31.39.149.5674: Flags [R.], cksum 0xa16b (incorrect -> 0xdff8), seq 3130096743, ack 1714846305, win 490, options [nop,nop,TS val 732890850 ecr 3744701927], length 0

12:42:32.593012 IP (tos 0x0, ttl 64, id 41312, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.33.113.5674 > 172.31.39.149.5674: Flags [R.], cksum 0xa16b (incorrect -> 0x5bc9), seq 2278980070, ack 512178122, win 490, options [nop,nop,TS val 734960851 ecr 3746771930], length 0

12:49:24.059993 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.31.68.47.57168 > 172.31.33.113.5674: Flags [R], cksum 0x1f9a (correct), seq 1578204891, win 0, length 0
-> this is the only RST packet logged from a client

17:09:52.593841 IP (tos 0x0, ttl 64, id 49475, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.33.113.5674 > 172.31.39.149.5674: Flags [R.], cksum 0xa16b (incorrect -> 0x1049), seq 3795890382, ack 1289830684, win 490, options [nop,nop,TS val 751000851 ecr 3762811954], length 0

note that on the client we are not getting any type of error, no connection broken, no heartbeat missed, no exception.

As already said, when we get a messages redelivered we log the ID for the message (we have unique id for each message) and in the log of the specific microservice the ID has never been logged before, so on the client side it seems all ok.

We have started getting this random redelivered when we have migrated to new cluster, the changed made were:
- new version of rabbit from 3.8 to 3.10 (new standalone deploy no upgrade)
- migration to quorum queue from classic queue mirroring
- new AWS NLB

since on the client we haven't change anything, we are not getting any exception and the client doesn't have a double log for the same message ID, I guess that the problem should be in rabbit or related to same strange behaviour of the NLB.

thank again for you time and for any hint or suggestion
regards

Luke Bakken

unread,
Nov 18, 2022, 10:24:55 AM11/18/22
to rabbitmq-users
Hi Davide -

This is a tricky one!


Have you used tcpdump to verify that keepalive packets are being sent?

do you mean AMQP heartbeats or other TCP packets? heartbeats are sent for sure

I do mean the TCP keepalives and not the AMQP heartbeats, but it's clear there is activity on the connection.

note that on the client we are not getting any type of error, no connection broken, no heartbeat missed, no exception.

As already said, when we get a messages redelivered we log the ID for the message (we have unique id for each message) and in the log of the specific microservice the ID has never been logged before, so on the client side it seems all ok.

We have started getting this random redelivered when we have migrated to new cluster, the changed made were:
- new version of rabbit from 3.8 to 3.10 (new standalone deploy no upgrade)
- migration to quorum queue from classic queue mirroring
- new AWS NLB

since on the client we haven't change anything, we are not getting any exception and the client doesn't have a double log for the same message ID, I guess that the problem should be in rabbit or related to same strange behaviour of the NLB

I am not familiar with AWS NLBs. Is there a way you can check the "version" of the old NLB instance vs the new one?

Did you upgrade Erlang as well? I'm assuming you did and I would like to know the versions involved.

Was there an operating system upgrade as well? I'm assuming that's likely.

Is it possible for you to not use an NLB and use HAproxy instead, or no load balancer at all, maybe in a non-production environment where you see this issue (if such an env exists)?

Thanks -
Luke

davide....@gmail.com

unread,
Nov 18, 2022, 11:04:42 AM11/18/22
to rabbitmq-users
Hi Luke

This is a tricky one!

yes, you're right!
 

Have you used tcpdump to verify that keepalive packets are being sent?

do you mean AMQP heartbeats or other TCP packets? heartbeats are sent for sure

I do mean the TCP keepalives and not the AMQP heartbeats, but it's clear there is activity on the connection.

I can check but as you said there is traffic on the socket
 

I am not familiar with AWS NLBs. Is there a way you can check the "version" of the old NLB instance vs the new one?

as far as i know there aren't any version but I will investigate


Did you upgrade Erlang as well? I'm assuming you did and I would like to know the versions involved.
Was there an operating system upgrade as well? I'm assuming that's likely.

yes, we have upgraded Erlang and all the operating system of the cluster, before we was using
rabbitmq 3.8.3 Erlang 21.3.8.14 on Ubuntu 18.04.4 LTS
rabbitmq 3.8.3 Erlang 21.3.8.15 on Ubuntu 18.04.4 LTS
 
and now we have
RabbitMQ 3.10.10 Erlang 24.2.1 on Ubuntu 22.04.1 LTS
 
Is it possible for you to not use an NLB and use HAproxy instead, or no load balancer at all, maybe in a non-production environment where you see this issue (if such an env exists)?

in the other environments we do not use the clustered version of rabbit but just a single instance with the same version of production so we do not have the problem there, we can reproduce the same env of production but the main problem is the traffic, we should also simulate it.
We are getting about 30/40 random messaged redelivered in a day and the cluster manage about 30mln of message in a day so not so super easy to reproduce

Anyway there is just one last thing that we can try, use the old NLB with the new cluster, it will require some time to configure it but at least we can understand if is a problem related to AWS or to rabbit, we will make the test and let you know

thanks again
Davide


davide....@gmail.com

unread,
Nov 21, 2022, 10:16:23 AM11/21/22
to rabbitmq-users
Hi Luke,
we have tryed to use the old load balancer with the new cluster, unfortunately we still got random redelivered messages but on the new load balancer the value of "Load balancer reset count" is zero, so the two problems are not related and with this tests now we know that the problem in not on the AWS / load balancer side.

Out of curiosity we want to understand what was causing the "Load balancer reset count" on the new NLB, we have seen this feature https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html that basically can give you detailed logs for each connection of the load balancer and we found that the connection that are resetted by the NLB are from an our tool of monitoring that is using the rabbit HTTP console and not AMPQ. We use the same NLB both for AMPQ and HTTP management console...

So at least we have figured out that the problem is not on the network side, strange AWS NLB problem, as initially supposed, that also explain why we didn't see the TCP RST packets on the AMPQ port as initially supposed.

But now the million dollar question is still the same: why are we getting random redelivered messages without any network problem? Is possible that some changes has been made to rabbitmq from 3.8 to 3.10 regarding the redelivered flags?

thanks
Davide

Luke Bakken

unread,
Nov 23, 2022, 10:06:17 AM11/23/22
to rabbitmq-users
Hi Davide,

This is an interesting issue. How much of a pain would it be to switch back to classic mirrored queues for a while to see if this issue is resolved?

To summarize what changed in your environment -
  • Load balancer type
  • Operating system
  • Erlang version
  • RabbitMQ version
  • Queue type
  • Client library version
Of the above list, the switch in RabbitMQ version and queue type is the most likely culprit at this point.

Thanks,
Luke

Luke Bakken

unread,
Nov 23, 2022, 10:23:40 AM11/23/22
to rabbitmq-users
Hey Davide -

I just spoke with the team member who implemented quorum queues. He's wondering if your code ever calls  basicCancel on a channel to cancel a consumer.

The other thing I would be interested to see are RabbitMQ logs from all nodes. Please provide the timestamp of when one of these mystery "redelivered" messages happen so I can correlate it with the logs.

Please archive and compress the log files. You can use the Google Groups Web UI "Reply to author" feature to send them to me directly.

Thanks,
Luke

davide....@gmail.com

unread,
Nov 23, 2022, 3:18:05 PM11/23/22
to rabbitmq-users
Hi Luke,
we can make a test switching back to classic mirrored queues with the new version of Rabbit but we cannot make this test now, we can scheduled it for the beginning of january.

From your list I can say that we can exclude:

* Load balancer type (the load balancer is the same type and anyway did not work also with the old one)
* Client library version (on the first switch we did not change the client version and we immediately got the problem)

I can image is "less likely" that the problem is on
* Operating system
* Erlang version

so what remain is:
* RabbitMQ version
* Queue type

the test you suggest for sure could be useful to understand if the problem is related to rabbit version of the queue type

Regards
Davide

davide....@gmail.com

unread,
Nov 23, 2022, 3:27:32 PM11/23/22
to rabbitmq-users
Hi Luke,
in the code I do not see any single call to basicCancel, I will investigate with the java developers but I think we do not use it. We use spring, we will also investigate if for strange reason spring call it ...

our log config is quite basic:
log.file = rabbit.log
log.file.level = info
log.file.rotation.date = $D0
log.file.rotation.count = 20

If you want I can send you them but I have checked and there is nothing in the times when redelivered happens.

One thing I forgot to mention is that we use this policy on our queues:
dead-letter-strategy:at-least-once
overflow:
reject-publish

queue-leader-locator:
balanced

and the queues has configured DLX and DLK, I don't know if this can be relevant for this  issue or not

thank you again
Davide

Karl Nilsson

unread,
Nov 23, 2022, 3:34:33 PM11/23/22
to rabbitm...@googlegroups.com
Spring does call basic cancel at various points (IIRC) which isn't a problem in itself but say you have a consumer from a queue with messages on and you cancel whilst messages are on route between the queue and channel the messages arriving at the channel will be returned to the queue without first being handled by the channel and will show as redelivered. A similar thing would happen if you shut the channel with an active consumer and a message was on route.

This is preferable to messages potentially never getting handled by clients and get "stuck" because they don't get handled.

I suspect that is what you are seeing. it is possible that classic queues behave somehow differently and this never happens.

Cheers
Karl



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/94408840-f740-4749-b916-b86c60100b83n%40googlegroups.com.


--
Karl Nilsson

davide....@gmail.com

unread,
Nov 27, 2022, 9:28:48 AM11/27/22
to rabbitmq-users
Hi Karl,
thanks for your response, I think you're right and you have found the problem, as you said spring calls basicCancel, for example:

- when the process is shutting down

- when it reduce the number of active consumers

we have noticed that during the deploy of new versions of our micro services we see more redelivered messages so it is related to the first case, while the random redelivered messages I guess are generated by the second case. We have a traffic with e regular pattern and in the evening, with less traffic, we usually see more redelivered messages.

So I think you're right and you have found the problem, at this point I guess the different behaviour is caused by the quorum queues.

So now the question is: is this behaviour correct or it was correct the previous one with the classic queue that did not flag the messages as redelivered?

regards
Davide

davide....@gmail.com

unread,
Dec 14, 2022, 2:56:22 AM12/14/22
to rabbitmq-users
Hi all,
any news about this issue?

Thank you
Davide

kjnilsson

unread,
Dec 14, 2022, 7:54:34 AM12/14/22
to rabbitmq-users
Apologies I didn't realise there was an issue. Redelivered messages are part and parcel of at-least-once messaging systems and can occur when consumers are terminated. Quorum queues and classic queues are completely different and independent queue implementations so there is likely to be some behavioural differences but as long as the messaging guarantees that are provided (at least once delivery for example) I'd hesitate to call this an issue. Are redelivered messages causing an issue for you?

davide....@gmail.com

unread,
Dec 14, 2022, 8:58:09 AM12/14/22
to rabbitmq-users
Hi,
no, there is not any particular issue, I just wanted to know if this the correct behaviour or it was correct the previous one with the classic queue that did not flag the messages as redelivered. You have now answered my question saying that the two types of queues are completely different and independent and so does not make a lot of sense to compare them...
Thanks again for the support

regards
Davide
Reply all
Reply to author
Forward
0 new messages