I've since learned that the crashes hapenned again while the daily disk backup snapshots were being taken. A latency on the file system operation could be the cause
> to switching to an OS where this issue does not exhibit itself (to our knowledge)
If that's your recommended action, this might very well be what we are going to try next. It's not a quick and free solution, this will require doubling the amount of hosts in the infrastructure (currently all other softwares of the solution are running on Windows). And if you have doubts about the Windows implementation as you seem to have, maybe think about adding a warning on the website saying that the Windows version have known issues?
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit@Developpement
cluster_formation.classic_config.nodes.2 = rabbit@server2
cluster_formation.classic_config.nodes.3 = rabbit@server3
cluster_partition_handling = autoheal
Name: HA
Pattern: .*
Apply to: Exchanges and queuesha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
max-length-bytes: 50000000
> pauses a virtual machine, which is known to not be handled well by RabbitMQI would not have expected a pause/resume operation on a VM to be any different than a network failure or a system shutdown in term of how RabbitMQ try to resume it's operation when it's back. I believe running RabbitMQ on VM (even on AWS/Azure hosts) to be quite common, and that taking snapshots of servers to also be a common operation.
1. How do you propose we handle server snapshots if RabbitMQ can't survive a short pause?
2. Is there a way to quickly recover from the situation once it failed without resetting all nodes?
3. Is there something that can be done from the Client library to recover from it? I see a lot of this exception in my application logs:RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=404, text="NOT_FOUND - home node 'rabbit@Developpement' of durable queue 'nsb.delay-level-27' in vhost '/' is down or inaccessible", classId=50, methodId=10, cause=Maybe the client library could catch the NOT_FOUND error and try to re-create the queues/exchanges in this situation? Anything precise I could come up with as a description that I could post on the ParticularSoftware GitHub as an issue that would be clear and help keep the cluster valid?