RabbitMQ 3.6.5 consumer performance degradation

1,598 views
Skip to first unread message

ni...@bluejeansnet.com

unread,
May 8, 2017, 2:42:30 PM5/8/17
to rabbitmq-users
I'll preface this by saying this is by no means any kind of issue/bug report, and I already plan to update to the latest 3.6.

What I'm hoping to discover is whether anyone else has seen consumer performance run along quite happily for some time, and then slowly begin to degrade over a period of hours (in our case it's been about 16 hours)  to the point at which consuming from the queue stops entirely and never recovers.

This seems to be happening to individual long-lived consumers in isolation and at different times; queue performance remains typical throughout, and we only tend to notice once sufficiently many consumers have seized up that we start getting a queue backlog.

Restarting consumers fixes the trouble immediately; however, the way I've done this is by restarting our service on a node, so I can't absolutely rule out our application (Java) doing something wrong.

Thus my goal here is to hear if this sounds familiar, especially if it sounds like a bug from 3.6.5 that has already been fixed, in which case I won't sweat it too much until we have a maintenance window to upgrade.  Otherwise, I'll do some more running around with my hair on fire, because that usually helps.

Anyone seen something similar?

   Nick

Michael Klishin

unread,
May 8, 2017, 3:11:15 PM5/8/17
to rabbitmq-users
It's hard to tell without having consumer metric charts.

We've seen this a scenario that causes something similar but it wasn't necessarily time-dependent. Our permanently running
environment with varying workload and periodic backlogs no longer reproduce it, or at least it goes
away quickly enough that it's not easy to spot when publisher rates vary by design.

We collect rates, network throughput, disk I/O and total Ready message backlog.

ni...@bluejeansnet.com

unread,
May 8, 2017, 3:29:07 PM5/8/17
to rabbitmq-users
Alas I was too focused on getting the cow out of the ditch to grab the charts at the time.  If it happens again, I will do that.  In the scenario we've seen, the consumer never recovers once it drops below 1 message per minute.

I think I'll also add some lightweight instrumentation to the activities we perform inside the consumer with some loud logging in the event that they start taking an abnormal duration, and also schedule an upgrade to 3.6.9 as well.  3.6.5 is pretty ancient at this point.

That way, we'll have all the bases covered and either the problem will go away entirely, or at least we'll have more information if it happens again.

   Nick

Michael Klishin

unread,
May 8, 2017, 3:37:10 PM5/8/17
to rabbitm...@googlegroups.com
Thank you.

Do you observe ingress network traffic to those consumers degrade at a comparable pace?
What about consumer utilization as reported by RabbitMQ?




--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

ni...@bluejeansnet.com

unread,
May 10, 2017, 4:49:58 PM5/10/17
to rabbitmq-users
Unfortunately it looks like I disabled a lot of stats way back when there were memory leakage problems in the stats collector (that's how long ago it was since we had any issues).

However I saw the same behavior this afternoon on two of the consumers.  Consumer utilization was reported as 0% with ~200 messages backlogged.  Three out of 5 consumers were showing a significant uptick in the number of reductions per second on their channels, and the other two were showing effectively 0 reductions per second.  Restarting those consumers (on the client side) helped, and utilization jumped up to 100% after that.

I didn't have to restart all of the consumers to get things moving again, so I think we can (probably) rule out any problem with the queue itself.

Unfortunately, don't think I can draw any conclusions with regard to whether the client side wasn't sending an ACK/NAK, or whether the server side wasn't delivering a message based on the available data.  However, our next release will contain more debugging messages on the client side to indicate how long certain operations are taking so that we should be able to hone in on the problem should it continue to occur post-3.6.9-update.

   Nick


On Monday, May 8, 2017 at 12:37:10 PM UTC-7, Michael Klishin wrote:
Thank you.

Do you observe ingress network traffic to those consumers degrade at a comparable pace?
What about consumer utilization as reported by RabbitMQ?



On Mon, May 8, 2017 at 10:29 PM, <ni...@bluejeansnet.com> wrote:
Alas I was too focused on getting the cow out of the ditch to grab the charts at the time.  If it happens again, I will do that.  In the scenario we've seen, the consumer never recovers once it drops below 1 message per minute.

I think I'll also add some lightweight instrumentation to the activities we perform inside the consumer with some loud logging in the event that they start taking an abnormal duration, and also schedule an upgrade to 3.6.9 as well.  3.6.5 is pretty ancient at this point.

That way, we'll have all the bases covered and either the problem will go away entirely, or at least we'll have more information if it happens again.

   Nick

On Monday, May 8, 2017 at 12:11:15 PM UTC-7, Michael Klishin wrote:
It's hard to tell without having consumer metric charts.

We've seen this a scenario that causes something similar but it wasn't necessarily time-dependent. Our permanently running
environment with varying workload and periodic backlogs no longer reproduce it, or at least it goes
away quickly enough that it's not easy to spot when publisher rates vary by design.

We collect rates, network throughput, disk I/O and total Ready message backlog.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 10, 2017, 5:03:02 PM5/10/17
to rabbitm...@googlegroups.com
That's a step forward in the investigation nevertheless.

When you say "messages backlogged", do you mean that they were ready for delivery or delivered but unacknowledged?

With 0% utilization no or virtually no reductions on channels only used to consume is more or less expected: those channels aren't doing any work
(to be honest and precise, channels don't really do the last mile delivery to consumers; they delegate to their friends channel writers but that still
counts for "work" when it comes to counting reductions in the runtime).

I agree with your conclusion that at this point it looks more likely to be an issue with the consumers rather than the queue
but let's see what adding more metrics (both for RabbitMQ and your apps) will reveal.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ni...@bluejeansnet.com

unread,
May 10, 2017, 6:45:48 PM5/10/17
to rabbitmq-users
Correct; pardon the imprecise characterization on my part.  I meant messages that were in a Ready state.  (We think of it as a "backlog" because it's a work queue with jobs to perform in it.)

For this particular queue, we set a qos of 1, so we'll never have more unacknowledged messages than the number of consumers.  I regret that I didn't think to note the # of messages in the Unacked state; next time I'll just take a screenshot like a reasonable person, so I don't have to remember anything.  It might be interesting to see if that number is 3 rather than 5 in the case of 2 out of the 5 consumers having trouble, or if it remains 5 while things are processing.

One thing that's somewhat interesting is that we probably should have been able to handle the load with just the 3 consumers that were working, yet it seemed like we were making no progress until the unresponsive consumers were restarted.  If it happens again, I'll test this theory by first just stopping those consumers without restarting them, and seeing if the other consumers begin to make progress.

Currently targeting May 20 for both an app and RabbitMQ update, so one way or another we'll know something more useful soon.

Thanks for your help as always!

   Nick


On Wednesday, May 10, 2017 at 2:03:02 PM UTC-7, Michael Klishin wrote:
That's a step forward in the investigation nevertheless.

When you say "messages backlogged", do you mean that they were ready for delivery or delivered but unacknowledged?

With 0% utilization no or virtually no reductions on channels only used to consume is more or less expected: those channels aren't doing any work
(to be honest and precise, channels don't really do the last mile delivery to consumers; they delegate to their friends channel writers but that still
counts for "work" when it comes to counting reductions in the runtime).

I agree with your conclusion that at this point it looks more likely to be an issue with the consumers rather than the queue
but let's see what adding more metrics (both for RabbitMQ and your apps) will reveal.

Michael Klishin

unread,
May 10, 2017, 7:45:33 PM5/10/17
to rabbitm...@googlegroups.com
Since you already have N consumers working in parallel off of a single queue,
perhaps https://github.com/rabbitmq/rabbitmq-sharding/ could be used to improve delivery parallelism
and possibly sidestep the problem altogether.

By May 20th RabbitMQ 3.6.10 should be out ;)

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ni...@bluejeansnet.com

unread,
May 11, 2017, 10:17:08 AM5/11/17
to rabbitmq-users
I have some good news, which won't change my plans to upgrade (since we need to anyway) but sheds some light on the problem.

There was a rare and obscure condition that could cause our worker loop to throw an unchecked exception, which in turn would of course cause the loop to stop executing, leaving us with a consumer having just consumed a message but never sending an ACK or a NAK for it.  The nature of the exception was that it would kill multiple worker loops around the same time of day across the entire cluster.  So from the RMQ server side, it of course looked like we had a bunch of consumers that were just stuck doing nothing.

I'm not 100% sure that was the cause of our trouble in this case, but I am about 95% sure of it.  I found the trouble when I went looking for all errors and exceptions in the log around the time that the trouble began.  Fortunately, I'd heeded some good advice I'd gotten long ago about long-running threads in Java: always provide an UncaughtExceptionHandler that logs uncaught exceptions (loudly); otherwise your thread can die and leave you guessing as to why.

Hopefully this advice is helpful for the next person to read it, too :-)

Will look into the sharding plugin again; I had meant to do this some time ago, but there's always something with a higher priority.  It's likely to scale far better in the long run.

   Nick
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages