occasional spikes in publishing latency

Mark Weaver

unread,

Oct 6, 2022, 9:28:54 AM10/6/22

to rabbitmq-users

We've got a rabbitmq (3.10.1) cluster running in kubernetes, and have .NET clients publishing messages to it using MassTransit. The underlying hardware is relatively modern with gen3 nvme storage and not much extra load. There are 3 nodes in the cluster, and all of the queues are durable and mirrored (with classic mirroring and 1 replica). When publishing a (small) burst of messages (300-400, peaking at ~15/s) we see sometimes see big delays (>5s, sometimes >10s) publishing messages.

From reading the MassTransit code it looks like it sends a message using the RabbitMQ .NET client library, then waits for the publish confirmation message before returning control to the .NET application. At this point I'm not sure if the initial send or the wait for confirmation part is the part that's causing the delay.

My understanding from the documentation is that this waiting for confirmation should be of the order of a few hundred milliseconds, i.e. ~2 orders of magnitude less than the delays we are seeing. I have tried looking at the metrics with prometheus -- this shows a small spike in message publication coincident with the delays rising to 15/s over a couple of minutes, but didn't really highlight where the problem is.

I'm kind of stuck as to where to look next for what is causing the spikes - no errors/warnings are logged, the metrics look kind of sensible and the messages do actually get delivered. Any pointers as to potential causes or extra diagnostic tools would be greatly appreciated.

Thanks,

Mark

kjnilsson

unread,

Oct 6, 2022, 11:04:17 AM10/6/22

to rabbitmq-users

Hi,

What size are the messages? Do you have metrics collection services that poll the management HTTP on a regular basis?

Apart from particularly large message bodies and mgmt http API overload I can't think of any reasons for this behaviour. It is worth mentioning though that classic mirrored queues are deprecated (in favour of quorum queues) and we don't really work on them much anymore. (classic non-mirrored queues are being improved however).

You could try enabling classic queues v2 and see if this helps. Also getting the latest 3.10 patch may also be a good idea.

https://blog.rabbitmq.com/posts/2022/05/rabbitmq-3.10-release-overview/#classic-queues

Cheers

Karl

Mark Weaver

unread,

Oct 6, 2022, 3:04:47 PM10/6/22

to rabbitm...@googlegroups.com

On 06/10/2022 16:04, kjnilsson wrote:
> Hi,
>
> What size are the messages? Do you have metrics collection services
> that poll the management HTTP on a regular basis?
>

~500 bytes

yes, we have enabled the prometheus metrics (although the problem
precedes the metrics -- enabling the metrics was an attempt to get
diagnostic information)

> Apart from particularly large message bodies and mgmt http API
> overload I can't think of any reasons for this behaviour. It is worth
> mentioning though that classic mirrored queues are deprecated (in
> favour of quorum queues) and we don't really work on them much
> anymore. (classic non-mirrored queues are being improved however).
>
> You could try enabling classic queues v2 and see if this helps. Also
> getting the latest 3.10 patch may also be a good idea.
>

Ok, I will try these things. Thanks for your help.

Mark

> --
> You received this message because you are subscribed to the Google
> Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rabbitmq-user...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/rabbitmq-users/faccbb71-d7dd-4eb9-8a55-b839358d86ccn%40googlegroups.com
> <https://groups.google.com/d/msgid/rabbitmq-users/faccbb71-d7dd-4eb9-8a55-b839358d86ccn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Mark Weaver

unread,

Oct 9, 2022, 2:49:01 PM10/9/22

to rabbitmq-users

I have updated the rabbit cluster to 3.10.8 and switched to quorum queues, however high publish/wait for ack latency is still observed sometimes (>10s). It seems a bit harder to trigger than before. I'm a bit stuck as to where to go from here so any suggestions would be gratefully received.

Thanks,

Mark

Luke Bakken

unread,

Oct 11, 2022, 11:32:07 AM10/11/22

to rabbitmq-users

Hi Mark,

I'm on vacation this week but have starred this conversation for follow-up.

I would normally say that the best way to help us help you is to provide code to demonstrate the issue, but since you're using MassTransit I'll have to see if I can reproduce it using the .NET client on its own.

> When publishing a (small) burst of messages (300-400, peaking at ~15/s) we see sometimes see big delays (>5s, sometimes >10s) publishing messages.

I'm assuming that you have more than one queue in your system. Can you accurately describe the workload when the issue happens -

How many queues exhibit the high confirm latency
Are these the queues who have just had a burst of 500 byte messages just sent to them?
What else is going on in your cluster? Other queues, monitoring, etc?

Please note that if this issue is urgent, paid support is available - https://www.rabbitmq.com/#support

Thanks

Luke

Mark Weaver

unread,

Oct 13, 2022, 10:14:59 AM10/13/22

to rabbitm...@googlegroups.com

On 11/10/2022 16:32, Luke Bakken wrote:

Hi Mark,

I'm on vacation this week but have starred this conversation for follow-up.

I would normally say that the best way to help us help you is to provide code to demonstrate the issue, but since you're using MassTransit I'll have to see if I can reproduce it using the .NET client on its own.

> When publishing a (small) burst of messages (300-400, peaking at ~15/s) we see sometimes see big delays (>5s, sometimes >10s) publishing messages.

I can probably abstract the code out if that helps, I have spent a fair bit of time digging into MassTransit. Apart from the queue setup it isn't doing that much (it creates a message and add itself to a set of things waiting for an ack, then waits for a TaskCompletionSource to get completed and it hands the task back to the client code as the result of publishing the message, so effectively if you wait on publish you wait on the ack). That code looks ok to me, and I think the library is pretty widely used.

I can reproduce the problem with something that basically does:

        /* MT setup */
 	var sp = services.BuildServiceProvider();
	int messages = 10000;
	var msg = new {} // some message type we have a queue for
        var publishEndpoint = sp.GetRequiredService<IPublishEndpoint>();
        for (int i = 0; i < messages; ++i)
        {
            publishEndpoint.Publish(msg).Wait();
        }

If I run 5-6 copies of this app then one of them will see the Publish() call taking >5s which I have printing a warning. (I hope this is the same problem that we actually see and not something else, it looks a lot like it). After this quick reproduction things settle down and there are steady message rates of about 25-30 msg/s/client.

I'm assuming that you have more than one queue in your system. Can you accurately describe the workload when the issue happens -

How many queues exhibit the high confirm latency

It seems to be just one queue at a time (usually the busiest, which in our case takes audit messages), but we have seen it on others.

Are these the queues who have just had a burst of 500 byte messages just sent to them?

I think so

What else is going on in your cluster? Other queues, monitoring, etc?

There is a lot in the cluster, but mostly it's not doing anything (this is largely a test cluster, we have a production cluster which also exhibits the same issue). With the reproduction noted above I get a peak of ~150 messages per second which seems quite low compared to what can be pushed through (if I turn off waiting for publish confirmations I can get sustained rates of >10k/s). We have prometheus monitoring a whole bunch of stuff and with that reproduction I can't really tell any difference in CPU / RAM usage -- there is plenty available of both (on this test cluster we have 3x6 cores with total background load of about 0.5 cores/host and about 15/64Gb used on each host).

Thanks for your help,

Mark

Please note that if this issue is urgent, paid support is available - https://www.rabbitmq.com/#support

Thanks

Luke

On Sunday, October 9, 2022 at 11:49:01 AM UTC-7 mark....@trelica.com wrote:

I have updated the rabbit cluster to 3.10.8 and switched to quorum queues, however high publish/wait for ack latency is still observed sometimes (>10s). It seems a bit harder to trigger than before. I'm a bit stuck as to where to go from here so any suggestions would be gratefully received.

Thanks,

Mark

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/e032b7b4-e8b2-4371-abab-ef07c3c0ab7dn%40googlegroups.com.

Luke Bakken

unread,

Oct 24, 2022, 3:43:47 PM10/24/22

to rabbitmq-users

Hi Mark,

Thanks for all of the information.

If you could point out the place in Mass Transit where confirms are waited I could probably do something similar using the .NET client. Or, even better, provide a RabbitMQ .NET client-based console app that demonstrates this issue.

I'll try to find time to put together an application that reproduces this error. It must not be common because I think we'd hear about it more often (the .NET client is used by many RabbitMQ users, as is Mass Transit).

What operating system and version are you using to run RabbitMQ?

Thanks,
Luke

Mark Weaver

unread,

Oct 27, 2022, 9:29:22 AM10/27/22

to rabbitm...@googlegroups.com

On 24/10/2022 20:43, Luke Bakken wrote:

Hi Mark,

Thanks for all of the information.

If you could point out the place in Mass Transit where confirms are waited I could probably do something similar using the .NET client. Or, even better, provide a RabbitMQ .NET client-based console app that demonstrates this issue.

I'll try to find time to put together an application that reproduces this error. It must not be common because I think we'd hear about it more often (the .NET client is used by many RabbitMQ users, as is Mass Transit).

Hi,

I'm trying to reproduce this without MassTransit and while doing so I've noticed that some of the time (but not most all of the time) stalls are correlated with rabbit running out of memory and writing a memory resource limit alarm to the logs. So I started looking at the memory usage and it seems that it does run quite close to being out of memory all the time

There are no messages in any queues at this point, and 238Mb of the 239Mb quorum queue tables are the queue that my test app was shoving messages into. Is this normal? Things have remained in this state despite no message load for a good number of hours. I'm wondering if the stalls are garbage collection maybe? I tried triggering garbage collection via (rabbitmqctl force_gc as per documentation) but it didn't seem to affect the memory usage.

What operating system and version are you using to run RabbitMQ?

We have 2x3 x86-64 node clusters all running ubuntu 22.04, microk8s 1.24, rabbitmqoperator/cluster-operator:2.0.0, and rabbitmq:3.10-8-management

If you think it might help I can upgrade this to 3.11 (we use the delayed message plugin which has just been updated, although no delayed messages are involved in the latency issue).

Thanks,

Mark

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/2ebdbd8d-a857-4375-8dc5-7bd39bae91c7n%40googlegroups.com.

Karl Nilsson

unread,

Oct 27, 2022, 9:51:45 AM10/27/22

to rabbitm...@googlegroups.com

The quorum queue ETS memory use is normal yes.

https://www.rabbitmq.com/quorum-queues.html#resource-use

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/6aa0dfe1-9691-e791-4c7a-715369f2a5a6%40trelica.com.

--

Karl Nilsson

Mark Weaver

unread,

Oct 28, 2022, 2:04:16 AM10/28/22

to rabbitm...@googlegroups.com

Thanks for the pointer.

Does each quorum queue use a separate WAL? The sizing recommendation says that rabbit should have 3-4x the default WAL buffer size (512Mb). It currently has 2Gb allocated (with a HWM of 0.5). Do I need to take into account the HWM when applying the recommendation, i.e. it should be 4Gb (or 5Gb for the default 0.4 HWM)?

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CAHC35TAAt4npO_cddsv7F-x7SrzAC%2B%3DkJDCvDK0ehBBEmZAdqA%40mail.gmail.com.

Karl Nilsson

unread,

Oct 28, 2022, 3:52:06 AM10/28/22

to rabbitm...@googlegroups.com

No all quorum queues share a wal per rabbit node. You can configure a smaller wal size to reduce peak memory use.

https://rabbitmq.com/quorum-queues.html#resource-use

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/S1fsSef0BQw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cddb52bd-cc7f-007d-307c-4b736d62b80c%40trelica.com.

--

Karl Nilsson

kjnilsson

unread,

Oct 28, 2022, 5:23:04 AM10/28/22

to rabbitmq-users

So I've just read back a bit on this thread and yes a limit of 800MB isn't going to be enough with the RabbitMQ quorum queue defaults as they are. I would recommend configuring a max WAL size of 256MiB which should roughly half peak QQ table use. The WAL size has a relation to peak memory use but it isn't one to one as it i depends on message sizes and so on which is why we recommend some multiple of thereof.

The 0.4 memory limit default is getting somewhat old these days. You can quite safely increase it to, say, 0.6 (maybe even 0.8) unless you have a lot of v1 classic queues. (v2 has much better memory profile).

I'm sure one you stop working near the memory limit that your latency issues will go away but let us know if not.

Cheers

Karl

Mark Weaver

unread,

Oct 28, 2022, 11:16:10 AM10/28/22

to rabbitm...@googlegroups.com

Hi,

Thanks for that -- I've given the rabbit containers more ram (2Gb), set the max WAL size to 256MiB and increased the HWM to 0.8. This seems to have fixed the in flight latency issues -- I'm not seeing anything over 300ms for publishing a single message now (the log threshold so the latencies are all lower than that).

There is still an issue with connection latency though -- if I start ~10 clients at once then 1-2 of them will take >5s to connect. Once they are connected and publishing messages they appear to work ok. This is just with the official rabbit .NET client and the connection code looks like below:

var cert = X509Certificate2.CreateFromPem(File.ReadAllText(certPath), File.ReadAllText(keyPath));
var certs = new X509CertificateCollection { cert };
factory.AuthMechanisms = new IAuthMechanismFactory[] { new ExternalMechanismFactory() };
factory.HostName = "hoppy.rabbitmq.svc.cluster.local";
factory.Port = 5671;
factory.Ssl = new SslOption
{
    Certs = certs,
    Enabled = true,
    ServerName = factory.HostName
};
factory.VirtualHost = "dev";
using (var connection = factory.CreateConnection())
using (var channel = connection.CreateModel())

I am not sure this represents a real problem for us (as we are using MT, then this should just create one connection per process, and we probably aren't spawning 10 clients all at the same time), but it does seem to indicate a problem. By comparison the normal time to establish a connection seems to be in the 90-100ms bracket. I don't see any RAM issues (the on any node highest is 490Mb/1.3Gb) anymore or anything logged other than the connections being accepted/broken.

Thanks,

Mark

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d6183528-0986-4e9b-8394-8d0fecef2969n%40googlegroups.com.

Luke Bakken

unread,

Oct 28, 2022, 11:21:30 AM10/28/22

to rabbitmq-users

Hi Mark -

Are you starting 10 connections within the same process or are you starting 10 processes, each opening a single connection?

Mark Weaver

unread,

Oct 28, 2022, 11:23:45 AM10/28/22

to rabbitm...@googlegroups.com

On 28/10/2022 16:21, Luke Bakken wrote:

Hi Mark -

Are you starting 10 connections within the same process or are you starting 10 processes, each opening a single connection?

10 processes each opening a single connection

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/7e9280df-7bbb-41eb-a1f6-945a5273e8f1n%40googlegroups.com.

Luke Bakken

unread,

Oct 28, 2022, 11:32:21 AM10/28/22

to rabbitmq-users

Some questions -

What is "MT"?
What OS are you running the client apps on? What version? What version of .NET? What version of the .NET client?
What are the specs of the machine / VM you're running the client apps on?
Is there anything "special" about your X509 certs? Long key lengths, etc?

TCP connections aren't free and TLS connections take even more resources so my guess is that this is just due to resource contention.

Thanks

Luke

Luke Bakken

unread,

Oct 28, 2022, 11:34:57 AM10/28/22

to rabbitmq-users

Also, is this a new test you are doing or one that you have done before, without seeing this issue?

Mark Weaver

unread,

Oct 28, 2022, 11:55:04 AM10/28/22

to rabbitm...@googlegroups.com

On 28/10/2022 16:32, Luke Bakken wrote:

Some questions -

What is "MT"?

MassTransit, sorry -- our client code was using this. As far as I can tell that library only creates one connection per process.

What OS are you running the client apps on? What version? What version of .NET? What version of the .NET client?

We're running in an ubuntu 20.04 container (based off the docker image mcr.microsoft.com/dotnet/aspnet:6.0-focal); dotnet --version gives 6.0.402. The RabbitMQ client is 6.4.0 (which is the same one that MassTransit uses).

What are the specs of the machine / VM you're running the client apps on?

The machines in the cluster are running ubuntu 22.04 and are Intel(R) Xeon(R) E-2386G CPU @ 3.50GHz, 64Gb, with 2xSAMSUNG MZVL2512HCJQ-00B07 NVMe drives

Is there anything "special" about your X509 certs? Long key lengths, etc?

No, they are just 2048 bit RSA certificates issued from a self signed cert-manager based CA

TCP connections aren't free and TLS connections take even more resources so my guess is that this is just due to resource contention.

We are only talking about 10 connections, and the latency spikes are 50x the 'normal' connection latency of 100ms.

Thanks,

Mark

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/ac1ca1bd-16ea-4040-8436-16ef34b2ecb4n%40googlegroups.com.

Mark Weaver

unread,

Oct 28, 2022, 11:59:01 AM10/28/22

to rabbitm...@googlegroups.com

On 28/10/2022 16:34, Luke Bakken wrote:

Also, is this a new test you are doing or one that you have done before, without seeing this issue?

It's my attempt to abstract the problem out of our code base. I've also applied your recommendations (upgrading the rabbit server to 3.10.10; switching to quorum queues and increasing the amount of memory available). So I suppose at this point it's sort of a new test? It's a similar pattern to the original problem (a big spike in latency) but in this round of testing I have not encountered it after connection startup (perhaps it would reproduce by having a bunch of clients generating traffic, and then starting new ones -- not sure).

Thanks,

Mark

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/158ef632-152c-4aee-8c0a-78c3e0eef72fn%40googlegroups.com.

Luke Bakken

unread,

Oct 28, 2022, 12:02:19 PM10/28/22

to rabbitmq-users

Thanks for the information. I have to ask those questions because most of the time people are running in resource-constrained VMs in the cloud somewhere.

If you can share the code you're using to demonstrate the issue, that would be great. I can try to reproduce it locally.

Mark Weaver

unread,

Oct 28, 2022, 12:35:46 PM10/28/22

to rabbitm...@googlegroups.com

On 28/10/2022 17:02, Luke Bakken wrote:

Thanks for the information. I have to ask those questions because most of the time people are running in resource-constrained VMs in the cloud somewhere.

If you can share the code you're using to demonstrate the issue, that would be great. I can try to reproduce it locally.

Sure, I've put it on github: https://github.com/blushingpenguin/RabbitMQProblem

It currently doesn't declare the receiving end, just the exchange. In reality it is connected to a single quorum queue, on my local test cluster I just create one using the management ui and then connect the exchange to it after running the process once.

For testing I'm triggering it from a pod running in the cluster having copied over some certs that can authenticate the connection, and then triggering it like this:

root@message-test:/app/publish# ./RabbitMQProblem & ./RabbitMQProblem & ./RabbitMQProblem & ./RabbitMQProblem & ./RabbitMQProblem & ./RabbitMQ
Problem & ./RabbitMQProblem & ./RabbitMQProblem & ./RabbitMQProblem & ./RabbitMQProblem &
...

I get output like below when it goes wrong (it does not every time). I have been trying to trigger this on a cluster built on VMs on my local machine but haven't managed it so far.

root@message-test:/app/publish# ******* WAITED FOR 228ms for conn start (228 conn, 0 confirm, 0 declare)
******* WAITED FOR 296ms for conn start (292 conn, 0 confirm, 4 declare)
******* WAITED FOR 324ms for conn start (320 conn, 0 confirm, 4 declare)
******* WAITED FOR 336ms for conn start (332 conn, 4 confirm, 0 declare)
******* WAITED FOR 364ms for conn start (360 conn, 0 confirm, 4 declare)
******* WAITED FOR 352ms for conn start (352 conn, 0 confirm, 0 declare)
******* WAITED FOR 396ms for conn start (396 conn, 0 confirm, 0 declare)
******* WAITED FOR 416ms for conn start (416 conn, 0 confirm, 0 declare)
******* WAITED FOR 316ms to publish message
******* WAITED FOR 316ms to publish message
******* WAITED FOR 716ms for conn start (716 conn, 0 confirm, 0 declare)
Sent 76 in 2.00 (rate=37.93/s)
Sent 75 in 2.02 (rate=37.13/s)
Sent 74 in 2.01 (rate=36.86/s)
Sent 75 in 2.01 (rate=37.30/s)
Sent 72 in 2.02 (rate=35.71/s)
Sent 76 in 2.01 (rate=37.89/s)
Sent 74 in 2.01 (rate=36.75/s)
Sent 74 in 2.00 (rate=36.97/s)
Sent 78 in 2.00 (rate=38.92/s)
******* WAITED FOR 320ms to publish message
Sent 158 in 4.02 (rate=39.30/s)
Sent 156 in 4.04 (rate=38.62/s)
Sent 154 in 4.01 (rate=38.38/s)
Sent 154 in 4.01 (rate=38.39/s)
Sent 157 in 4.02 (rate=39.08/s)
Sent 151 in 4.04 (rate=37.39/s)
Sent 154 in 4.02 (rate=38.27/s)
Sent 150 in 4.03 (rate=37.24/s)
Sent 159 in 4.01 (rate=39.67/s)
******* WAITED FOR 5364ms for conn start (5364 conn, 0 confirm, 0 declare)
Sent 249 in 6.05 (rate=41.18/s)
Sent 250 in 6.06 (rate=41.28/s)
Sent 248 in 6.02 (rate=41.22/s)
Sent 246 in 6.02 (rate=40.89/s)
Sent 248 in 6.03 (rate=41.12/s)
Sent 245 in 6.05 (rate=40.52/s)
Sent 247 in 6.03 (rate=40.96/s)
Sent 244 in 6.05 (rate=40.36/s)
Sent 257 in 6.02 (rate=42.69/s)

Thanks,

Mark

On Friday, October 28, 2022 at 8:59:01 AM UTC-7 mark....@trelica.com wrote:

On 28/10/2022 16:34, Luke Bakken wrote:

Also, is this a new test you are doing or one that you have done before, without seeing this issue?

It's my attempt to abstract the problem out of our code base. I've also applied your recommendations (upgrading the rabbit server to 3.10.10; switching to quorum queues and increasing the amount of memory available). So I suppose at this point it's sort of a new test? It's a similar pattern to the original problem (a big spike in latency) but in this round of testing I have not encountered it after connection startup (perhaps it would reproduce by having a bunch of clients generating traffic, and then starting new ones -- not sure).

Thanks,

Mark

On Friday, October 28, 2022 at 8:32:21 AM UTC-7 Luke Bakken wrote:

Some questions -

What is "MT"?

What OS are you running the client apps on? What version? What version of .NET? What version of the .NET client?

What are the specs of the machine / VM you're running the client apps on?

Is there anything "special" about your X509 certs? Long key lengths, etc?

TCP connections aren't free and TLS connections take even more resources so my guess is that this is just due to resource contention.

Thanks

Luke

On Friday, October 28, 2022 at 8:23:45 AM UTC-7 mark....@trelica.com wrote:

On 28/10/2022 16:21, Luke Bakken wrote:

Hi Mark -

Are you starting 10 connections within the same process or are you starting 10 processes, each opening a single connection?

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/2580d073-9456-4478-a179-8e4af834f550n%40googlegroups.com.

Luke Bakken

unread,

Oct 30, 2022, 11:49:31 AM10/30/22

to rabbitmq-users

Hi Mark,

Thanks for providing that code. It is very helpful to have a git repository from which to start!

I took your code and made some modifications here to run your client application on Windows - https://github.com/lukebakken/RabbitMQProblem/tree/lukebakken/master

In my home environment, I'm running RabbitMQ on an Arch Linux workstation, and your test program on my Windows 10 laptop workstation.

After several runs of the test process (starting up multiple client instances) I still haven't run into the long connection issue. In the past, we have seen issues around networking and Docker, so maybe that's what you're running into?

If you have a moment to link to the code in MassTransit that waits for publisher confirmations, I'll modify the test program to use a similar method. Maybe I can reproduce the confirm latency issue.

In general I try to not reproduce issues like these using Docker just to take one factor out of the mix.

Thanks,

Luke

Mark Weaver

unread,

Nov 1, 2022, 3:27:10 PM11/1/22

to rabbitm...@googlegroups.com

Hi Luke,

Thanks for testing that. I went a bit further on and figured out where the issue was in the RabbitMQ .NET client code, and it turned out to be taking 5s to do a DNS lookup occasionally (calling Dns.GetHostAddresses in SocketFrameHandler) so clearly nothing at all to do with rabbit.

The details, just in case anybody else hits this, were that after a bit more digging it turned out that I could reproduce the issue with virtually any language (node, C, .NET). Checking with tcpdump showed I was always getting correct questions + valid replies from the DNS server (coredns configured according to https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/) but there were still delays + retries. This left the common buggy thing pointing at getaddrinfo() and eventually I found this bug: https://sourceware.org/bugzilla/show_bug.cgi?id=26600 which is what I think I was hitting. I can reproduce the issue running on an ubuntu 20.04 container, but I can't reproduce the issue running on ubuntu 22.04 container (which has a later glibc which should include the fix for that bug).

I think this matches my original problem because the delays I was seeing were either nearly 5s or 10s + a few hundred ms, corresponding to a DNS lookup plus a connect, send, then waiting for a publish confirmation message.

I'm sorry to have incorrectly suspected rabbit was at fault. Thanks very much for your help, and Karl's help in diagnosing this and helping me move things over to QQ -- it was greatly appreciated.

Mark

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/954017ce-ebd2-4cea-b48c-64c0275194a5n%40googlegroups.com.

Mark Weaver

unread,

Nov 3, 2022, 3:04:16 AM11/3/22

to rabbitm...@googlegroups.com

So somewhat annoyingly it seems like I fixed a different problem with what I did below (albeit a real one). Unfortunately I still show big spikes in publisher confirmation latency (>5s) when running our actual code. In the graph below (which is from grafana with prometheus collecting metrics on rabbit deployed in a kubernetes cluster) the spikes in "Messages unconfirmed to publishers / s" corresponds with when I see the high publish/wait for confirm latency. Is there anything I can dig into to find out why that's happening?

Thanks,

Mark

Michal Kuratczyk

unread,

Nov 3, 2022, 12:04:31 PM11/3/22

to rabbitm...@googlegroups.com

Hi,

Can you reproduce this issue by generating a similar workload using https://perftest.rabbitmq.com/ ?

If yes, please share the command used so that we can see how it works for us.

Best,

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d9193051-d514-d78b-19f6-ee0728819ab1%40trelica.com.

--

Michał

RabbitMQ team

Reply all

Reply to author

Forward