We have been testing this scenario on RabbitMQ 3.9.5 and it is reproducible. The issue is almost exactly the same is is described here
https://github.com/rabbitmq/rabbitmq-server/issues/2749 which the exception that the rebalance request initiates the processing, otherwise the issue with classic queues and an ha-exactly policy of 2 (less than 3 node cluster) behave similarly.
What has been observed:
1. 3 node cluster. ha-mode exactly 2.
2. Classic mirror Q on A (leader) and B (replica)
3. Create a backlog in the queue (1M 1K events in our tests)
4. Initiate a rebalance via the Mgmt API
5. The rebalance drops A, then B promotes and then adds C as the new replica and starts syncing
6. If before the sync completes, B goes down, C promotes to the leader and adds A as the replica. The queue is now empty.
I realize that this is all legal per our ha-promote-on-failure being the default of always, but should the rebalance be more cautious here and ensure the new replica is in-sync before dropping an existing? Although I suppose that means ha-exactly 2 policy is in violation during the sync?
I just want to make sure our understanding of what is happening here is correct and if there are other steps we could take to avoid this type of issue short of transitioning to quorum queues are forcing replication to all nodes (which won't scale for us on larger deployments).
Would this behavior be considered a bug candidate or is it defined system behavior based on configuration selection of ha-policy with classic queues and leader promote choices?