Hi Mark,
I'm on vacation this week but have starred this conversation for follow-up.
I would normally say that the best way to help us help you is to provide code to demonstrate the issue, but since you're using MassTransit I'll have to see if I can reproduce it using the .NET client on its own.> When publishing a (small) burst of messages (300-400, peaking at ~15/s) we see sometimes see big delays (>5s, sometimes >10s) publishing messages.
I can probably abstract the code out if that helps, I have spent a fair bit of time digging into MassTransit. Apart from the queue setup it isn't doing that much (it creates a message and add itself to a set of things waiting for an ack, then waits for a TaskCompletionSource to get completed and it hands the task back to the client code as the result of publishing the message, so effectively if you wait on publish you wait on the ack). That code looks ok to me, and I think the library is pretty widely used.
I can reproduce the problem with something that basically does:
If I run 5-6 copies of this app then one of them will see the Publish() call taking >5s which I have printing a warning. (I hope this is the same problem that we actually see and not something else, it looks a lot like it). After this quick reproduction things settle down and there are steady message rates of about 25-30 msg/s/client.
I'm assuming that you have more than one queue in your system. Can you accurately describe the workload when the issue happens -
- How many queues exhibit the high confirm latency
- Are these the queues who have just had a burst of 500 byte messages just sent to them?
- What else is going on in your cluster? Other queues, monitoring, etc?
There is a lot in the cluster, but mostly it's not doing anything (this is largely a test cluster, we have a production cluster which also exhibits the same issue). With the reproduction noted above I get a peak of ~150 messages per second which seems quite low compared to what can be pushed through (if I turn off waiting for publish confirmations I can get sustained rates of >10k/s). We have prometheus monitoring a whole bunch of stuff and with that reproduction I can't really tell any difference in CPU / RAM usage -- there is plenty available of both (on this test cluster we have 3x6 cores with total background load of about 0.5 cores/host and about 15/64Gb used on each host).
Thanks for your help,
Mark
Please note that if this issue is urgent, paid support is available - https://www.rabbitmq.com/#supportThanksLuke
On Sunday, October 9, 2022 at 11:49:01 AM UTC-7 mark....@trelica.com wrote:
I have updated the rabbit cluster to 3.10.8 and switched to quorum queues, however high publish/wait for ack latency is still observed sometimes (>10s). It seems a bit harder to trigger than before. I'm a bit stuck as to where to go from here so any suggestions would be gratefully received.
Thanks,
Mark
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/e032b7b4-e8b2-4371-abab-ef07c3c0ab7dn%40googlegroups.com.
Hi Mark,
Thanks for all of the information.
If you could point out the place in Mass Transit where confirms are waited I could probably do something similar using the .NET client. Or, even better, provide a RabbitMQ .NET client-based console app that demonstrates this issue.
I'll try to find time to put together an application that reproduces this error. It must not be common because I think we'd hear about it more often (the .NET client is used by many RabbitMQ users, as is Mass Transit).
Hi,
I'm trying to reproduce this without MassTransit and while doing so I've noticed that some of the time (but not most all of the time) stalls are correlated with rabbit running out of memory and writing a memory resource limit alarm to the logs. So I started looking at the memory usage and it seems that it does run quite close to being out of memory all the time


There are no messages in any queues at this point, and 238Mb of
the 239Mb quorum queue tables are the queue that my test app was
shoving messages into. Is this normal? Things have remained in
this state despite no message load for a good number of hours. I'm
wondering if the stalls are garbage collection maybe? I tried
triggering garbage collection via (rabbitmqctl force_gc as per
documentation) but it didn't seem to affect the memory usage.
What operating system and version are you using to run RabbitMQ?
We have 2x3 x86-64 node clusters all running ubuntu 22.04, microk8s 1.24, rabbitmqoperator/cluster-operator:2.0.0, and rabbitmq:3.10-8-management
If you think it might help I can upgrade this to 3.11 (we use the delayed message plugin which has just been updated, although no delayed messages are involved in the latency issue).
Thanks,
Mark
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/2ebdbd8d-a857-4375-8dc5-7bd39bae91c7n%40googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/6aa0dfe1-9691-e791-4c7a-715369f2a5a6%40trelica.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CAHC35TAAt4npO_cddsv7F-x7SrzAC%2B%3DkJDCvDK0ehBBEmZAdqA%40mail.gmail.com.
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/S1fsSef0BQw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cddb52bd-cc7f-007d-307c-4b736d62b80c%40trelica.com.
I am not sure this represents a real problem for us (as we are using MT, then this should just create one connection per process, and we probably aren't spawning 10 clients all at the same time), but it does seem to indicate a problem. By comparison the normal time to establish a connection seems to be in the 90-100ms bracket. I don't see any RAM issues (the on any node highest is 490Mb/1.3Gb) anymore or anything logged other than the connections being accepted/broken.
Thanks,
Mark
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d6183528-0986-4e9b-8394-8d0fecef2969n%40googlegroups.com.
Hi Mark -
Are you starting 10 connections within the same process or are you starting 10 processes, each opening a single connection?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/7e9280df-7bbb-41eb-a1f6-945a5273e8f1n%40googlegroups.com.
Some questions -
- What is "MT"?
- What OS are you running the client apps on? What version? What version of .NET? What version of the .NET client?
- What are the specs of the machine / VM you're running the client apps on?
- Is there anything "special" about your X509 certs? Long key lengths, etc?
TCP connections aren't free and TLS connections take even more resources so my guess is that this is just due to resource contention.
We are only talking about 10 connections, and the latency spikes
are 50x the 'normal' connection latency of 100ms.
Thanks,
Mark
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/ac1ca1bd-16ea-4040-8436-16ef34b2ecb4n%40googlegroups.com.
Also, is this a new test you are doing or one that you have done before, without seeing this issue?
It's my attempt to abstract the problem out of our code base.
I've also applied your recommendations (upgrading the rabbit
server to 3.10.10; switching to quorum queues and increasing the
amount of memory available). So I suppose at this point it's sort
of a new test? It's a similar pattern to the original problem (a
big spike in latency) but in this round of testing I have not
encountered it after connection startup (perhaps it would
reproduce by having a bunch of clients generating traffic, and
then starting new ones -- not sure).
Thanks,
Mark
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/158ef632-152c-4aee-8c0a-78c3e0eef72fn%40googlegroups.com.
Thanks for the information. I have to ask those questions because most of the time people are running in resource-constrained VMs in the cloud somewhere.
If you can share the code you're using to demonstrate the issue, that would be great. I can try to reproduce it locally.
Sure, I've put it on github: https://github.com/blushingpenguin/RabbitMQProblem
It currently doesn't declare the receiving end, just the
exchange. In reality it is connected to a single quorum queue, on
my local test cluster I just create one using the management ui
and then connect the exchange to it after running the process
once.
For testing I'm triggering it from a pod running in the cluster
having copied over some certs that can authenticate the
connection, and then triggering it like this:
root@message-test:/app/publish#
./RabbitMQProblem & ./RabbitMQProblem &
./RabbitMQProblem & ./RabbitMQProblem &
./RabbitMQProblem & ./RabbitMQ
Problem & ./RabbitMQProblem & ./RabbitMQProblem &
./RabbitMQProblem & ./RabbitMQProblem &
...
I get output like below when it goes wrong (it does not every
time). I have been trying to trigger this on a cluster built on
VMs on my local machine but haven't managed it so far.
root@message-test:/app/publish#
******* WAITED FOR 228ms for conn start (228 conn, 0 confirm, 0
declare)
******* WAITED FOR 296ms for conn start (292 conn, 0 confirm, 4
declare)
******* WAITED FOR 324ms for conn start (320 conn, 0 confirm, 4
declare)
******* WAITED FOR 336ms for conn start (332 conn, 4 confirm, 0
declare)
******* WAITED FOR 364ms for conn start (360 conn, 0 confirm, 4
declare)
******* WAITED FOR 352ms for conn start (352 conn, 0 confirm, 0
declare)
******* WAITED FOR 396ms for conn start (396 conn, 0 confirm, 0
declare)
******* WAITED FOR 416ms for conn start (416 conn, 0 confirm, 0
declare)
******* WAITED FOR 316ms to publish message
******* WAITED FOR 316ms to publish message
******* WAITED FOR 716ms for conn start (716 conn, 0 confirm, 0
declare)
Sent 76 in 2.00 (rate=37.93/s)
Sent 75 in 2.02 (rate=37.13/s)
Sent 74 in 2.01 (rate=36.86/s)
Sent 75 in 2.01 (rate=37.30/s)
Sent 72 in 2.02 (rate=35.71/s)
Sent 76 in 2.01 (rate=37.89/s)
Sent 74 in 2.01 (rate=36.75/s)
Sent 74 in 2.00 (rate=36.97/s)
Sent 78 in 2.00 (rate=38.92/s)
******* WAITED FOR 320ms to publish message
Sent 158 in 4.02 (rate=39.30/s)
Sent 156 in 4.04 (rate=38.62/s)
Sent 154 in 4.01 (rate=38.38/s)
Sent 154 in 4.01 (rate=38.39/s)
Sent 157 in 4.02 (rate=39.08/s)
Sent 151 in 4.04 (rate=37.39/s)
Sent 154 in 4.02 (rate=38.27/s)
Sent 150 in 4.03 (rate=37.24/s)
Sent 159 in 4.01 (rate=39.67/s)
******* WAITED FOR 5364ms for conn start (5364 conn, 0 confirm,
0 declare)
Sent 249 in 6.05 (rate=41.18/s)
Sent 250 in 6.06 (rate=41.28/s)
Sent 248 in 6.02 (rate=41.22/s)
Sent 246 in 6.02 (rate=40.89/s)
Sent 248 in 6.03 (rate=41.12/s)
Sent 245 in 6.05 (rate=40.52/s)
Sent 247 in 6.03 (rate=40.96/s)
Sent 244 in 6.05 (rate=40.36/s)
Sent 257 in 6.02 (rate=42.69/s)
Thanks,
Mark
On Friday, October 28, 2022 at 8:59:01 AM UTC-7 mark....@trelica.com wrote:
On 28/10/2022 16:34, Luke Bakken wrote:
Also, is this a new test you are doing or one that you have done before, without seeing this issue?
It's my attempt to abstract the problem out of our code base. I've also applied your recommendations (upgrading the rabbit server to 3.10.10; switching to quorum queues and increasing the amount of memory available). So I suppose at this point it's sort of a new test? It's a similar pattern to the original problem (a big spike in latency) but in this round of testing I have not encountered it after connection startup (perhaps it would reproduce by having a bunch of clients generating traffic, and then starting new ones -- not sure).
Thanks,
Mark
On Friday, October 28, 2022 at 8:32:21 AM UTC-7 Luke Bakken wrote:
Some questions -
- What is "MT"?
- What OS are you running the client apps on? What version? What version of .NET? What version of the .NET client?
- What are the specs of the machine / VM you're running the client apps on?
- Is there anything "special" about your X509 certs? Long key lengths, etc?
TCP connections aren't free and TLS connections take even more resources so my guess is that this is just due to resource contention.Thanks
Luke
On Friday, October 28, 2022 at 8:23:45 AM UTC-7 mark....@trelica.com wrote:
On 28/10/2022 16:21, Luke Bakken wrote:
Hi Mark -
Are you starting 10 connections within the same process or are you starting 10 processes, each opening a single connection?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/2580d073-9456-4478-a179-8e4af834f550n%40googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/954017ce-ebd2-4cea-b48c-64c0275194a5n%40googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d9193051-d514-d78b-19f6-ee0728819ab1%40trelica.com.