Quorum missing messages

527 views
Skip to first unread message

Clément Delgrange

unread,
Aug 29, 2023, 10:53:52 AM8/29/23
to rabbitmq-users
Hi, 

During a benchmark we realized that quorum queues missed some messages with a multi threaded producer. During our benchmark we just started a single node, do you know if that could be the reason?

Steps to reproduce:
- Start a single node (version 3.12.4)
- Declare a DirectExchange and two queues (one classic and one quorum) and their bindings (same routingkey).
- send 1 000 000 messages in parallel.

The classic queue will always end up with 1 000 000 messages but the quorum queue will few messages missing (from the admin UI) between 4 and 100. If we slowdown the producers or send less messages (max around 600000) both queues are working well.

 Do you have any idea of what is wrong with our setup?

ps: Nothing in RabbitMQ logs

Thanks a lot
Clement


Michal Kuratczyk

unread,
Aug 29, 2023, 11:30:37 AM8/29/23
to rabbitm...@googlegroups.com
Please provide executable reproduction steps.

A quick test based on your description (I assume by "send in parallel" you mean multiple publishers, so here I'm using 10 publishers, each sending 100k messages):

# declare the queues
perf-test -x 0 -z 1 -qq -u qq -e test -t direct -k key
perf-test -x 0 -z 1 -ad false -f persistent -u cq -e test -t direct -k key

# publish 10 x 100000 messages
perf-test -y 0 -c 100 -p -e test -t direct -k key -C 1000000

# check messages ready
rabbitmqctl list_queues
name    messages
cq      1000000
qq      1000000

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/07060356-ae7e-4ea1-b22f-746a982316ben%40googlegroups.com.


--
Michał
RabbitMQ team

Clément Delgrange

unread,
Aug 30, 2023, 2:59:27 AM8/30/23
to rabbitmq-users
Hi, thank you Michal for your help. 

We are using the Spring Rabbit client (v2.4.5) and by parallel I mean that we have 100 threads calling the RabbitTemplate#send method 10 000 times.
I've attached a demo app (Java 11) and its source file.

Here is the result after running this app:
result.jpg
App.java

Clément Delgrange

unread,
Aug 30, 2023, 3:03:43 AM8/30/23
to rabbitmq-users
Here is a link to the Jar file: https://we.tl/t-9itAauAAx6

Michal Kuratczyk

unread,
Aug 30, 2023, 5:16:17 AM8/30/23
to rabbitm...@googlegroups.com
You are not using publisher confirms. Please do and retry.
Additionally, examples like that are prone to race conditions where async tasks are still running when the application exits,
so make sure you wait for all these tasks before you exit (or just add Thread.sleep(10_000); as a quick workaround).

Generally speaking, I'd highly recommend trying to reproduce the behaviour (and any other suspected server issues)
using perf-test. If you can't, then it's likely a client-side issue. If you can, then it's much easier for us to fix it.

Best,

On Wed, Aug 30, 2023 at 9:03 AM Clément Delgrange <cl.del...@gmail.com> wrote:
Here is a link to the Jar file: https://we.tl/t-9itAauAAx6

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.


--
Michał
RabbitMQ team

Clément Delgrange

unread,
Aug 30, 2023, 10:40:53 AM8/30/23
to rabbitmq-users
After some investigations it seems related to the number of channels and threads on the client side. I think can reproduce the error with this PerfTest command:

java -jar .\perf-test-2.19.0.jar -y 0 -x 1 -X 1000 -niot 1000 -p -e test -t direct -k key -C 100

When I run this command, I get the error "Error in producer (java.io.IOException: Frame enqueuing failed)". Maybe the Spring client ignores this error somewhere I will take a look. But what I don't understand is why the queues end up with a different number of messages if the error is on the client side? Once the message has reached the server, is it not the server who dispatches the event in the different queues?

Clément Delgrange

unread,
Aug 30, 2023, 10:47:58 AM8/30/23
to rabbitmq-users
In addition, with this command java -jar .\perf-test-2.19.0.jar  -y 0 -x 1 -X 500 -niot 3000 -p -e test -t direct -k key -C 100 I don't get any error but the queues report different size at the end.

Arnaud Cogoluègnes

unread,
Aug 31, 2023, 5:28:10 AM8/31/23
to rabbitmq-users
I reproduced the issue with the JAR file you provided, but I get the expected number of messages in both queues with a program of my own (which uses publish confirms and does some synchronization to avoid exiting before all messages are confirmed). So that's a client-side issue.

As for PerfTest, there must be something wrong with the way it stops producers once they publish their expected number of messages. It should wait for all messages to be confirmed, but that is not the case apparently. I'll investigate more.

Arnaud Cogoluègnes

unread,
Aug 31, 2023, 6:07:06 AM8/31/23
to rabbitmq-users
OK, I reproduced the issue with PerfTest, by using the same command as you, which does *not* use publish confirms (I got confused between the -C and -c flags). Try adding -c 100 to your command, it means PerfTest will use publish confirms and stop publishing to wait a bit if it reaches 100 outstanding confirms for a producer. I got the expected number of messages in both queues with publish confirms on.

About the "Error in producer (java.io.IOException: Frame enqueuing failed)" error: the -niot flag enables NIO, which adds even more asynchronicity to the mix (the write requests are enqueued to be sent by the NIO loop asynchronously). The error means PerfTest accumulates write requests faster than the NIO loop can actually send them to the network (we may have hundreds of available threads, but we use only one of them because there's only 1 connection). NIO are useful in trying to reproduce this very issue because they are more "fire-and-forget" than blocking IO but they are not particularly appropriate for such a use case.

Clément Delgrange

unread,
Aug 31, 2023, 6:22:57 AM8/31/23
to rabbitmq-users
And do you know why classic queues always get the correct number of messages and not quorum queues? We don't plan to use publish confirms, from the documentation they seem to be useful only in case of server failures or when a queue reaches its limit. But here, they shouldn't be needed right? 

Clément Delgrange

unread,
Aug 31, 2023, 8:33:28 AM8/31/23
to rabbitmq-users
FYI, I've just tested with another RabbitMQ version (3.9) and it works well :(

Arnaud Cogoluègnes

unread,
Aug 31, 2023, 8:47:41 AM8/31/23
to rabbitmq-users
>  We don't plan to use publish confirms, from the documentation they seem to be useful only in case of server failures or when a queue reaches its limit.

Where did you read that from? The first paragraph of the publish confirms documentation [1] covers why they may be needed.

If you're worried about losing messages, you should use publish confirms. The fact you observe oddities without them does prove they are useful.

Arnaud Cogoluègnes

unread,
Aug 31, 2023, 9:20:18 AM8/31/23
to rabbitmq-users
> And do you know why classic queues always get the correct number of messages and not quorum queues?

It's likely the program exits and thus the connection and channels are closed while some messages are still on their way. A quorum queue ignores messages published by a closed channel and a classic queue does not. That's an implementation detail.

kjnilsson

unread,
Aug 31, 2023, 9:41:52 AM8/31/23
to rabbitmq-users
Quorum queues must be used with publisher confirms and consumer acks, usage without isn't something we'd ever recommend.

That said in this case we can probably address the particular behaviour that is causing it and may do so in a future release.

Still, use confirms!

Cheers
Karl

Clément Delgrange

unread,
Aug 31, 2023, 11:31:25 AM8/31/23
to rabbitmq-users
> Where did you read that from?

Indeed, after reading a lot about this my brain finally retained what I wanted to be true ;).
Nonetheless, there is still a logic about why we didn't want to use publisher confirms. In our case, we expected our network and the RabbitMQ cluster to be reliable enough to only lose an acceptable amount of messages, but in the present case, the lost of message is systematic and there is no network or server issue.

> It's likely the program exits and thus the connection and channels are closed while some messages are still on their way. A quorum queue ignores messages published by a closed channel and a classic queue does not. That's an implementation detail

This explains a lot! The Spring client use a cache to store the channels with a default size of 25. Our simulation created more than 1000 (one for each user request thread) so 975 channel were closed at the end of the simulation beacause they didn't fit in the cache. By increasing it, the simulation works.
The fact that we see no issue using publisher confirm seems to be more a side effect than an intended behavior. In the later case, wouldn't it have been more appropriate to receive a nack from the server with a cause saying: "ignored beacause the channel was closed" than an ack?


> Quorum queues must be used with publisher confirms and consumer acks, usage without isn't something we'd ever recommend

I think we will reconsider the use of publisher confirms, thanks!

kjnilsson

unread,
Aug 31, 2023, 12:08:03 PM8/31/23
to rabbitmq-users
We think the issue that is causing the observed behaviour is due to this: https://github.com/rabbitmq/ra/issues/393

This issue shouldn't be too hard to fix so I'm sure we'll roll something out in the not so distant future

Using confirms have benefits and is just really how modern messaging should work, e.g. our custom stream protocol doesn't even have an option not to use them as we also use them for flow control. 

Luke Bakken

unread,
Aug 31, 2023, 2:39:12 PM8/31/23
to rabbitmq-users
If you care about data safety, you MUST use publisher confirmations.

Arnaud Cogoluègnes

unread,
Sep 1, 2023, 3:18:55 AM9/1/23
to rabbitmq-users
> In the later case, wouldn't it have been more appropriate to receive a nack from the server with a cause saying: "ignored because the channel was closed" than an ack?

The channel is closed, so there's nothing to receive notification on at this point.

Reply all
Reply to author
Forward
0 new messages