RabbitMQ 3.6.5 - sudden and arbitrary ack slowness

anthon...@bmo.com

unread,

Aug 8, 2016, 7:58:47 PM8/8/16

to rabbitmq-users

Hi all,

Having recently upgraded to RabbitMQ 3.6.5 from 3.6.2, I have been experiencing an odd behaviour, which admittedly I can't tie 100% to the upgrade, other than the fact that said upgrade was done over the weekend with no other changes. This is the impetus to another question I have about downgrading (https://groups.google.com/forum/#!topic/rabbitmq-users/x9xc_AOcVMY)

I have a standard set of central consumer queues each of which can normally handle 1000s/sec throghput, sustained, pretty much all day. These consumers are set *not* to auto-ack, and do so individually. After running for a period of time (maybe 20 minutes or so) since upgrading, the acknowledgement rate will suddenly plunge from multiples of 1000 acks/sec to around 150/sec. This then causes the queues in question to backlog increasingly, to the point I have to purge the queues and restart the consumers. This same behaviour is not exhibited on 3.6.2 with the same client code.

I fully understand the above is hardly enough to go by but am a little unsure what information would be useful to furnish. If anyone has any insights I would be most appreciative.

Thank you,

Anthony

ni...@bluejeansnet.com

unread,

Aug 9, 2016, 12:10:38 PM8/9/16

to rabbitmq-users

Any alarms or other messages reported in the regular or SASL logs around the times that you see this drop in performance? It might also be interesting to look at the "Top Processes" display in the Admin section of the UI to see if there's anything notable happening when you're running into trouble that isn't happening otherwise, either in terms of CPU usage or memory consumption.

It may also be a worthwhile exercise to update to Erlang 19, given that there were some bad mnesia problems fixed later in 18.3 (thanks to the RabbitMQ team in fact). We had experienced with 18.3 a cluster that was working fine for months, only to suddenly experience a spate of issues over the span of about a week, before returning to normal again. I'll go out on a limb and purely *speculate* that the problem might not be RabbitMQ version-specific, but possibly a consequence of having made any change (we've had that happen too)... it might be interesting to spin up a completely pristine cluster and see if you encounter the same problem.

I haven't seen this myself on 3.6.4 or 3.6.5 yet, however I'm not pooling my consumers (yet) in quite the way that you are. I'm also running in a Linux environment with Erlang 19. Because I have plans to move in the direction of pooled consumers, I'm interested to hear what you discover.

Nick

Auer, Anthony

unread,

Aug 9, 2016, 12:41:17 PM8/9/16

to rabbitm...@googlegroups.com

I’m fairly certain there weren’t any other changes. I was wondering about upgrading to Erlang 19, but confess I’ve become somewhat gun-shy about “early-adoption” upgrades to x.0 versions (3.6.0 to name one)… I’ll give it a shot in one of my environments. Unfortunately I’m not in a position to spin up a “pristine” cluster (if by that you mean new server images).

One of the surprising things about this behaviour is that CPU and memory consumption are both relatively normal. It seems like Rabbit is somehow throttling acks and/or publishes for no immediately obvious reason.

I tried turning queue acks off, which helped a lot (though isn’t a solution I can move to production), though there still seemed to be some sudden drops in throughput which would result in issues for our clients, without any obvious resource contention. These would eventually resolve themselves.

I’ll report back anything interesting in the logs in a bit.

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ni...@bluejeansnet.com

unread,

Aug 9, 2016, 1:43:28 PM8/9/16

to rabbitmq-users

By "pristine" I just meant one that hasn't previously existed as an earlier version of RabbitMQ, though certainly that can be a lot of work to set up. When we had our incident a few weeks ago, I had to manually kill some processes and delete some queues that kept giving us trouble to get everything back into an acceptable state; merely restarting clustered nodes one at a time did not actually solve the problem, as though some cruft were being left over under the hood or persisted because I never took down the entire cluster as a whole.

I can understand and appreciate hesitation to move to a .0 release of something, though Erlang seems more or less "evolutionary" rather than "revolutionary" at this point in its development. FWIW, we've observed no trouble with 19 yet, but we haven't gone into full production mode with it yet either.

It also might be interesting to look at the "maybe stuck" output when this is happening. Perhaps you're getting a stuck process, and things are waiting to time out trying to interact with it, thereby slowing everything down. Totally a guess on my part, but we did see that too when we were having mnesia troubles.

Also, are you using HiPE? We haven't had problems with HiPE per se, but I have noticed that "maybe_stuck" produces a lot of spurious output when HiPE is turned on.

Nick

anthon...@bmo.com

unread,

Aug 9, 2016, 3:40:41 PM8/9/16

to rabbitmq-users

On Tuesday, August 9, 2016 at 1:43:28 PM UTC-4, ni...@bluejeansnet.com wrote:

By "pristine" I just meant one that hasn't previously existed as an earlier version of RabbitMQ, though certainly that can be a lot of work to set up. When we had our incident a few weeks ago, I had to manually kill some processes and delete some queues that kept giving us trouble to get everything back into an acceptable state; merely restarting clustered nodes one at a time did not actually solve the problem, as though some cruft were being left over under the hood or persisted because I never took down the entire cluster as a whole.

I see... in that case my cluster is as pristine as I can get without an OS-level reimage; I wiped Erlang and RabbitMQ right off the server, searching out stray db files/config in %appdata% directories, registry keys, the whole shot, but still see the activity.

I can understand and appreciate hesitation to move to a .0 release of something, though Erlang seems more or less "evolutionary" rather than "revolutionary" at this point in its development. FWIW, we've observed no trouble with 19 yet, but we haven't gone into full production mode with it yet either.

You're probably right, I'll give that a shot and see what happens.

It also might be interesting to look at the "maybe stuck" output when this is happening. Perhaps you're getting a stuck process, and things are waiting to time out trying to interact with it, thereby slowing everything down. Totally a guess on my part, but we did see that too when we were having mnesia troubles.

I have historically had my consumers individually ack each method; not having dived into the changelog details (yet), I'd be curious to see if the new RabbitMQ version is more sensitive to being bombarded with acks. To that end I've implemented batch-acking to see if that mitigates the issue.

Also, are you using HiPE? We haven't had problems with HiPE per se, but I have noticed that "maybe_stuck" produces a lot of spurious output when HiPE is turned on.

Last I checked HiPE was not an option on Windows.

anthon...@bmo.com

unread,

Aug 11, 2016, 8:06:11 AM8/11/16

to rabbitmq-users

So, nothing in the logs other than the usual notifications of AMQP connections being terminated, and the same behaviour is exhibited on 3.6.5 and 3.6.4, with both Erlang 18.3 and 19. I should also say that "sudden and arbitrary" was probably a mischaracterization - it just seems that throughput is being throttled, more or less, during spikes in activity (a spike for us is 50k-100k/s published, with a temporary queue backlog lasting a few seconds. Our normal sustained rates are 3k-5k/s).

I managed to figure out my downgrade issue (https://groups.google.com/forum/#!topic/rabbitmq-users/x9xc_AOcVMY) and have confirmed that on reversion to 3.6.2 this behaviour disappears. Moreover, the CPU usage on 3.6.5 during the perceived throttling is well below what I normally see during spikes on 3.6.2 - we hit 90% or so on a 16-core VM with 3.6.2, and never crack 50% on 3.6.4 or .5, leading me to suspect

https://github.com/rabbitmq/rabbitmq-server/issues/612

Having only the layest of lay understanding of Erlang parameters, I noticed in the discussion of #612 is talk of the option to change the default with an environment variable, but I'm not clear on where that would be (RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS) or what the appropriate flags would be to revert to the 3.6.2 default, or if that is even in fact what I would want. I'd be interested to try it though, to see if it's a dead-end or worth further investigation...

Is there presently a way to do this, or is it slated for a future milestone?

As further colour, our setup is a two node cluster - ESXi VMs with 8 virtual cores each. We do use queue mirroring for these "pooled" consumers.

ni...@bluejeansnet.com

unread,

Aug 11, 2016, 3:36:50 PM8/11/16

to rabbitmq-users

Maybe this commit is instructive:

https://github.com/rabbitmq/rabbitmq-server/pull/873/files

It looks like you can set a RABBITMQ_SCHEDULER_BIND_TYPE to override the default, but it also sounds like the default is "db" which is also the Erlang default.

Oddly though the testing in the discussion of the issue seems to indicate that setting +stbt db has different behavior than not setting +stbt at all... which is not what I'd have expected.

However I note a contradiction / confusing wording in the Erlang docs [emphasis mine]:

u
unbound - Schedulers will not be bound to logical processors, i.e., the operating system decides where the scheduler threads execute, and when to migrate them. This is the default.

db
default_bind - Binds schedulers the default way. Currently the default is thread_no_node_processor_spread (which might change in the future).

So you might try setting RABBITMQ_SCHEDULER_BIND_TYPE to "u" in your rabbitmq-defaults and see if that gives you back the same kind of behavior you saw in 3.6.2 and earlier. I think what the docs are trying to say is that the default behavior is not to bind at all; but if you DO choose to bind, then "db" chooses the default means of binding.

Nick

Michael Klishin

unread,

Aug 11, 2016, 3:57:15 PM8/11/16

to rabbitm...@googlegroups.com

It's worth mentioning that queue processes can be busy doing GC or paging to disk or doing other things

(which are relatively rare) during which they won't deliver anything to consumers (or confirm back to publishing channels).

The fact that GC is per-queue and not global/stop-the-world doesn't mean it is entirely pauseless

in the runtime.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

anthon...@bmo.com

unread,

Aug 11, 2016, 4:54:19 PM8/11/16

to rabbitmq-users

This is perfect, thanks. I had seen the script, but short of modifying rabbitmq-defaults.bat (something I would only want to do absent a more accepted method), I wasn't sure what to replace "db" with. I'll try this out and see if it works.

anthon...@bmo.com

unread,

Aug 11, 2016, 5:00:36 PM8/11/16

to rabbitmq-users

Thanks Michael, but I'm not sure what this is getting at. Are you suggesting an alternative explanation or an aside on top of Nick's comment?

I don't recall seeing any major garbage collection changes in the logs for 3.6.[3-5].

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Aug 11, 2016, 6:36:31 PM8/11/16

to rabbitm...@googlegroups.com

Why would they be in release notes? It's the runtime that performs garbage collection

and this didn't change since pre-1.0 days.

Queues don't guarantee identical throughput at all times, this includes operations such as acks.

With lazy queues the fluctuations will generally be noticeably lower but not entirely absent.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Auer, Anthony

unread,

Aug 11, 2016, 7:13:49 PM8/11/16

to rabbitm...@googlegroups.com

The sorts of changes I imagined might be in the release notes would be something, well, akin to the scheduler behaviour change I cited – a change to default runtime parameters set by Rabbit, which did appear in the release notes.

In any case, how might simple queue throughput fluctuation explain a consistent throughput difference between versions (of RabbitMQ, not Erlang) at loads higher than a certain threshold? To say nothing of the consistently different CPU usage?

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,

Aug 12, 2016, 3:47:32 AM8/12/16

to rabbitm...@googlegroups.com

Anthony,

Perhaps we should take a step back. Every queue in RabbitMQ is backed by an Erlang process

(to oversimplify). Erlang is a memory managed language, specifically it has a garbage collector

which runs individually for every queue process. Both have been the case since day one

for both RabbitMQ and Erlang.

Why would we mention that queue processes perform GC in RabbitMQ release notes? Does Cassandra

mention that its runtime performs GC? Or even that it moves user data off heap where possible?

The need to perform GC, move messages to disk, internal flow control [1] and so on are behind

the uneven throughput rate.

Nearly every user-facing RabbitMQ change is mentioned in the change log and for releases in the last 18 months,

can be found on GitHub because all development happens there.

Throughput is a complicated topic. There is no hard and fast rule that explains why things change

between versions. Sometimes changes have unexpected effects on some workloads.

Besides one Erlang VM scheduler flag change, I cannot think of any other change that is specifically

aimed at reducing unnecessary overhead (e.g. due to ongoing scheduler migration between CPU cores).

A different CPU usage pattern might hint that the change above might be related. You can always use a different

scheduler binding option by configuring VM flags and see. We know nothing about your CPU topology or workload,

and picking a default that pleases everybody is impossible. We have debated what the default should be for months

and concluded that nothing is worse than what the runtime chooses to do by default, sadly.

That change *is* in release notes.

HTH.

1. https://github.com/rabbitmq/internals/blob/master/credit_flow.md

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Auer, Anthony

unread,

Aug 12, 2016, 7:10:54 AM8/12/16

to rabbitm...@googlegroups.com

From: rabbitm...@googlegroups.com [mailto:rabbitm...@googlegroups.com] On Behalf Of Michael Klishin
Sent: Friday, August 12, 2016 3:47 AM
To: rabbitm...@googlegroups.com
Subject: Re: [rabbitmq-users] Re: RabbitMQ 3.6.5 - sudden and arbitrary ack slowness

Anthony,

Perhaps we should take a step back. Every queue in RabbitMQ is backed by an Erlang process

(to oversimplify). Erlang is a memory managed language, specifically it has a garbage collector

which runs individually for every queue process. Both have been the case since day one

for both RabbitMQ and Erlang.

Why would we mention that queue processes perform GC in RabbitMQ release notes? Does Cassandra

mention that its runtime performs GC? Or even that it moves user data off heap where possible?

I’m not sure, and I had no expectation you would. I think i was just confused because you mentioned queues and garbage collection yourself, and it wasn’t immediately clear to me how it tied back to the scheduler change and whether it was a likely cause or not.

The need to perform GC, move messages to disk, internal flow control [1] and so on are behind

the uneven throughput rate.

Thanks for all this feedback, Michael. This more or less sums up where my understanding of Erlang processes begins and ends. To be clear, I am not describing merely uneven throughput but a clear and consistent change in throughput behaviour under high (for us) load. I haven’t had the time to furnish more detailed metrics, but as a general indicator all versions prior to 3.6.4 – I’ve been using Rabbit since 3.3.2 – happily consumed a spike of 50,000+ messages in a few seconds, while under 3.6.4+ such a load has my users/developers calling me asking if the system is down. ;-}

The scheduler flag change is the only one which seems a remotely plausible cause of what I’m seeing, but obviously I can’t bolster that with much more evidence than a very high-level understanding of how context switching works, and how it might conceivably prevent full usage of cores and thus some de-parallelization of workload. In any case, it’s an easy thing to check just by trying the “unbound” configuration, and I will do so as soon as I can get the environment for exclusive use again.

Nearly every user-facing RabbitMQ change is mentioned in the change log and for releases in the last 18 months,

can be found on GitHub because all development happens there.

I have indeed found the change log very useful – discussions around changes are very robust. Certainly it helps people like me to winnow down possible causes.