first of all thank you for the interest in this issue.
Just
a bit of background: we are running rabbitmq in production from at least 5 years with 2 clusters of 3 nodes each.
Our
system is based on a microservice architecture and, at the moment, we are
processing about 1bln messages per year, for each message we generate
about 10 Rabbitmq messages so more or less 10bln rabbitmq messages per
year.
Our services are the same on both clusters, we use 2 different clusters just to separate two kind of traffic that we manage.
on node: 172.31.21.87 we have registered
17:25:31.317306 IP (tos 0x0, ttl 64, id 37412, offset 0, flags [DF], proto TCP (6), length 52)
172.31.21.87.5674 > 172.31.10.87.5674: Flags [R.], cksum 0x7813 (incorrect -> 0xcdda), seq 708467865, ack 109426256, win 490, options [nop,nop,TS val 1123426699 ecr 607860302], length 0
on node 172.31.33.113 we have registered
12:08:02.592811 IP (tos 0x0, ttl 64, id 23563, offset 0, flags [DF], proto TCP (6), length 52)
172.31.33.113.5674 > 172.31.39.149.5674: Flags [R.], cksum 0xa16b (incorrect -> 0xdff8), seq 3130096743, ack 1714846305, win 490, options [nop,nop,TS val 732890850 ecr 3744701927], length 0
12:42:32.593012 IP (tos 0x0, ttl 64, id 41312, offset 0, flags [DF], proto TCP (6), length 52)
172.31.33.113.5674 > 172.31.39.149.5674: Flags [R.], cksum 0xa16b (incorrect -> 0x5bc9), seq 2278980070, ack 512178122, win 490, options [nop,nop,TS val 734960851 ecr 3746771930], length 0
12:49:24.059993 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 40)
172.31.68.47.57168 > 172.31.33.113.5674: Flags [R], cksum 0x1f9a (correct), seq 1578204891, win 0, length 0
-> this is the only RST packet logged from a client
17:09:52.593841 IP (tos 0x0, ttl 64, id 49475, offset 0, flags [DF], proto TCP (6), length 52)
172.31.33.113.5674 > 172.31.39.149.5674: Flags [R.], cksum 0xa16b (incorrect -> 0x1049), seq 3795890382, ack 1289830684, win 490, options [nop,nop,TS val 751000851 ecr 3762811954], length 0
note that on the client we are not getting any type of error, no connection broken, no heartbeat missed, no exception.
As already said, when we get a messages redelivered we log the ID for the message (we have unique id for each message) and in the log of the specific microservice the ID has never been logged before, so on the client side it seems all ok.
We have started getting this random redelivered when we have migrated to new cluster, the changed made were:
- new version of rabbit from 3.8 to 3.10 (new standalone deploy no upgrade)
- migration to quorum queue from classic queue mirroring
- new AWS NLB
since on the client we haven't change anything, we are not getting any exception and the client doesn't have a double log for the same message ID, I guess that the problem should be in rabbit or related to same strange behaviour of the NLB.
thank again for you time and for any hint or suggestion
regards