--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/0OVgPKKY1jU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
----MKStaff Software Engineer, Pivotal/RabbitMQ
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/0OVgPKKY1jU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
A queue is a stateful entity that reacts to cluster membership changes.Promoting a different replica takes time, even if it's milliseconds for a replica that is in sync or an empty queue.The process isn't sequential (at least I cannot immediately think of a part where it would be) but it modifies shared cluster state,which involves acquiring write locks in the internal data store.So yes, it is generally expected that the more queues there are, the more work will happen on other nodes when one node leaves.Two data points is not enough to conclude as to whether this is a linear relationship.
A better question to ask here is: is this migration necessary? Can it be avoided entirely? Is there a way to prevent queues from migrating?
The answer is: there should be a way but currently there's only a workaround that involves disabling mirroring or messing with mirroringsetups so that the node undergoing maintenance next won't have any queue masters. Some examples definitely have beendiscussed on this list before. [3] mentions some relevant settings (you can tell a node that it's replica set involves specific nodes,explicitly moving it away from the node that's about to be shut down).
There is no consensus on what alternativewould work better. Quorum queues will have a different leader election and sync implementation which is a lot less drastic and in someways, significantly more efficient. [1] covers this in detail.
We believe that Blue/Green deployments is generally the best way to do upgrades but automating is often non-trivial.It would make this particular issue mostly irrelevant since your apps will gradually migrate to a different cluster andwon't be affected nearly as much.
On Thursday, January 17, 2019 at 12:54:45 PM UTC+3, João Nuno Silva wrote:Thank you for the detailed comments. Re-reading my question I see that it's confusing because I omitted some important details. Let me try to clarify the question.MotivationAssuming I have a N nodes cluster with ha=all (for simplicity we can consider N=2, nodes A and B). I want to perform a rolling restart of all machines to apply an OS security patch. I want to do this while minimizing perceived downtime from the pov of publishers/subscribers.ContextI created 1 + 1000 queues. The 1 is the queue under test and the 1000 are empty queues which are not receiving messages. Node A is the master of all these queues. I already restarted node B and applied the security patch (stop_app, service rabbit stop, reboot machine and apply patch, service rabbit start, start_app).Node B is fully replicated but is not the master of any queue. At this point I want to apply the patch in node A. When I do a stop_app in node A, node B will become the master of all queues.ProblemThis process is taking time proportional to the number of queues.Scenario 1) If I just have 1 queue (the queue under test), this master migration is immediate. The measured latency between publish and subscribe is just a couple of milliseconds.Scenario 2) If I have the other 1000 queues (although these are not receiving messages), the latency increases to about two seconds.Note that the publish and subscriber configuration is the same in both scenarios. Using an AutorecoveringConnection with networkRecoveryInterval set to 100ms and the publish rate is 1 msg/s.I agree that this is not a perfect benchmarking setup, but it consistently reproduces the offending behavior we're seeing in production.Questions1) Is is expected that master migration time is proportional to the number of queues? If not, should I file a bug? Give you the example code I'm using to repro this?2) Is there a better way to perform these maintenance operations without downtime?
Inline