I’m fairly certain there weren’t any other changes. I was wondering about upgrading to Erlang 19, but confess I’ve become somewhat gun-shy about “early-adoption” upgrades to x.0 versions (3.6.0 to name one)… I’ll give it a shot in one of my environments. Unfortunately I’m not in a position to spin up a “pristine” cluster (if by that you mean new server images).
One of the surprising things about this behaviour is that CPU and memory consumption are both relatively normal. It seems like Rabbit is somehow throttling acks and/or publishes for no immediately obvious reason.
I tried turning queue acks off, which helped a lot (though isn’t a solution I can move to production), though there still seemed to be some sudden drops in throughput which would result in issues for our clients, without any obvious resource contention. These would eventually resolve themselves.
I’ll report back anything interesting in the logs in a bit.
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
By "pristine" I just meant one that hasn't previously existed as an earlier version of RabbitMQ, though certainly that can be a lot of work to set up. When we had our incident a few weeks ago, I had to manually kill some processes and delete some queues that kept giving us trouble to get everything back into an acceptable state; merely restarting clustered nodes one at a time did not actually solve the problem, as though some cruft were being left over under the hood or persisted because I never took down the entire cluster as a whole.
I can understand and appreciate hesitation to move to a .0 release of something, though Erlang seems more or less "evolutionary" rather than "revolutionary" at this point in its development. FWIW, we've observed no trouble with 19 yet, but we haven't gone into full production mode with it yet either.
It also might be interesting to look at the "maybe stuck" output when this is happening. Perhaps you're getting a stuck process, and things are waiting to time out trying to interact with it, thereby slowing everything down. Totally a guess on my part, but we did see that too when we were having mnesia troubles.
Also, are you using HiPE? We haven't had problems with HiPE per se, but I have noticed that "maybe_stuck" produces a lot of spurious output when HiPE is turned on.
- u
unbound - Schedulers will not be bound to logical processors, i.e., the operating system decides where the scheduler threads execute, and when to migrate them. This is the default.
- db
default_bind - Binds schedulers the default way. Currently the default is thread_no_node_processor_spread (which might change in the future).
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
The sorts of changes I imagined might be in the release notes would be something, well, akin to the scheduler behaviour change I cited – a change to default runtime parameters set by Rabbit, which did appear in the release notes.
In any case, how might simple queue throughput fluctuation explain a consistent throughput difference between versions (of RabbitMQ, not Erlang) at loads higher than a certain threshold? To say nothing of the consistently different CPU usage?
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
From: rabbitm...@googlegroups.com [mailto:rabbitm...@googlegroups.com] On Behalf Of Michael Klishin
Sent: Friday, August 12, 2016 3:47 AM
To: rabbitm...@googlegroups.com
Subject: Re: [rabbitmq-users] Re: RabbitMQ 3.6.5 - sudden and arbitrary ack slowness
Anthony,
Perhaps we should take a step back. Every queue in RabbitMQ is backed by an Erlang process
(to oversimplify). Erlang is a memory managed language, specifically it has a garbage collector
which runs individually for every queue process. Both have been the case since day one
for both RabbitMQ and Erlang.
Why would we mention that queue processes perform GC in RabbitMQ release notes? Does Cassandra
mention that its runtime performs GC? Or even that it moves user data off heap where possible?
I’m not sure, and I had no expectation you would. I think i was just confused because you mentioned queues and garbage collection yourself, and it wasn’t immediately clear to me how it tied back to the scheduler change and whether it was a likely cause or not.
The need to perform GC, move messages to disk, internal flow control [1] and so on are behind
the uneven throughput rate.
Thanks for all this feedback, Michael. This more or less sums up where my understanding of Erlang processes begins and ends. To be clear, I am not describing merely uneven throughput but a clear and consistent change in throughput behaviour under high (for us) load. I haven’t had the time to furnish more detailed metrics, but as a general indicator all versions prior to 3.6.4 – I’ve been using Rabbit since 3.3.2 – happily consumed a spike of 50,000+ messages in a few seconds, while under 3.6.4+ such a load has my users/developers calling me asking if the system is down. ;-}
The scheduler flag change is the only one which seems a remotely plausible cause of what I’m seeing, but obviously I can’t bolster that with much more evidence than a very high-level understanding of how context switching works, and how it might conceivably prevent full usage of cores and thus some de-parallelization of workload. In any case, it’s an easy thing to check just by trying the “unbound” configuration, and I will do so as soon as I can get the environment for exclusive use again.
Nearly every user-facing RabbitMQ change is mentioned in the change log and for releases in the last 18 months,
can be found on GitHub because all development happens there.
I have indeed found the change log very useful – discussions around changes are very robust. Certainly it helps people like me to winnow down possible causes.
Throughput is a complicated topic. There is no hard and fast rule that explains why things change
between versions. Sometimes changes have unexpected effects on some workloads.
Besides one Erlang VM scheduler flag change, I cannot think of any other change that is specifically
aimed at reducing unnecessary overhead (e.g. due to ongoing scheduler migration between CPU cores).
A different CPU usage pattern might hint that the change above might be related. You can always use a different
scheduler binding option by configuring VM flags and see. We know nothing about your CPU topology or workload,
and picking a default that pleases everybody is impossible. We have debated what the default should be for months
and concluded that nothing is worse than what the runtime chooses to do by default, sadly.
That change *is* in release notes.
Yes I read that discussion, and the conclusion certainly seems to make sense to me. Unfortunately I know little more about our CPU topology than you do, beyond processor type, speed and core count. On top of that is the further confusion introduced by the vagaries of VMWare/ESXi configuration, the details of which I am not privy to. If there is any information I can furnish about our setup once I’ve tested this, please let me know what would be most useful to you and the community and I’ll try to track it down with our VMWare and Wintel groups.
HTH.
On Fri, Aug 12, 2016 at 2:13 AM, Auer, Anthony <Anthon...@bmo.com> wrote:
The sorts of changes I imagined might be in the release notes would be something, well, akin to the scheduler behaviour change I cited – a change to default runtime parameters set by Rabbit, which did appear in the release notes.
In any case, how might simple queue throughput fluctuation explain a consistent throughput difference between versions (of RabbitMQ, not Erlang) at loads higher than a certain threshold? To say nothing of the consistently different CPU usage?
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
--MKStaff Software Engineer, Pivotal/RabbitMQ
I did – in particular I set:
RABBITMQ_SCHEDULER_BIND_TYPE=u
And have verified the effect as rabbitmqctl eval "erlang:system_info(scheduler_bindings)." yields “{unbound,unbound,unbound,unbound,unbound,unbound,unbound,unbound}”
`rabbitmqctl eval 'erlang:system_info(cpu_topology).'` yields
[{processor,[{core,{logical,0}},
{core,{logical,1}},
{core,{logical,2}},
{core,{logical,3}}]},
{processor,[{core,{logical,4}},
{core,{logical,5}},
{core,{logical,6}},
{core,{logical,7}}]}]
and `rabbitmqctl eval 'erlang:system_info(cpu_topology_detected).'` yields
Error: {badarg,[{erlang,system_info,[cpu_topology_detected],[]},
{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,670}]},
{rpc,'-handle_call_call/6-fun-0-',5,
[{file,"rpc.erl"},{line,206}]}]}
I tried looking around Erlang docs but don’t see to be able to find anything on the argument “cpu_topology_detected” so I’m not sure if there is an alternative syntax.
I can’t think of a reason the following would be relevant but in case it is, I had the atom index table count jacked up in response to the table exhaustion issue from 3.6.2, and didn’t get rid of it on upgrade. I had also jacked up the total file handle count from previous versions as well, when I thought we were exhausting file count (this later turned out to be a misreported number on Windows – to my knowledge this is still open). There didn’t seem to be a good reason to reduce them again.
Thus my additional ERL args are
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=+Q 65536 +t 4190304
From: rabbitm...@googlegroups.com [mailto:rabbitm...@googlegroups.com] On Behalf Of Michael Klishin
Sent: Sunday, August 14, 2016 12:33 AM
To: rabbitm...@googlegroups.com
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
Ah ok, that explains it. I should optimize my email read/write lock ;-]
From: rabbitm...@googlegroups.com [mailto:rabbitm...@googlegroups.com] On Behalf Of Michael Klishin
Sent: Sunday, August 14, 2016 12:35 AM
To: rabbitm...@googlegroups.com
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/cKNwiMJ64QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
I am doing the same but then rabbitmqctl eval "erlang:system_info(scheduler_bindings)." evaluates to
{0,1}
instead of the unbound that you are seeing.....Not sure what I might be doing differently.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send an email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.