One of the interesting things found during investigating this issue is that the
force-leave commands do not complete when they can not communicate with the nodes they are attempting to remove. This means a new command/workflow may need to be created in order to view an unreachable node being removed from the local WAN list as still being a valid command. This is a shortfall of the current implementation.
What can be done for your cluster?
In this particular instance, there's not much that can be done. You could adjust the
reconnect_timeout_wan setting in your configuration file and do a rolling restart of your server processes. This would cut down your time from
72 hours to
8 hours, but doing a rolling process restart has its own downsides. I would
not recommend this, and quite honestly, it's a user experience we need to implement better.
Here's what I can do to get things rolling:
1. We will reopen
Accept datacenter as a param for force-leave from WAN pool for the -wan flag to be added, mark it with "type/.:umbrella:" and have the Consul team investigate further. As mentioned in the ticket, there are
a number of technical blockers that need to be considered when messing with the WAN pool. Please watch that
ticket and :+1: :)
2. I have opened up
Update force-leave documentation to reflect LAN vs WAN behavior to track the documentation for
consul force-leave to be made more accurate.
I hope this helps! Thank you so much for your detailed information and please watch the tickets linked here for more info.
Best,
Jono Sosulska
Digital Developer Advocate || Community