How to "consul force-leave" members of WAN pool?

589 views
Skip to first unread message

M Moore

unread,
Jun 11, 2020, 1:09:02 PM6/11/20
to Consul
I have many consul clusters I must monitor, and sometimes people or vendors I have no influence with will shut off accounts that contain these clusters in a crude way, giving no chance for the remote cluster's servers to notify the WAN gossip pool they are leaving and will never return.  Their corresponding datacentre names still appear in catalogs, which gives false hope to monitoring systems that there is a target for monitoring, and gives false positives as alerts go off saying there's a datacentre not responding to queries.

In the documentation , I can identify the rudely departed datacentres with:  `consul members -wan |grep failed` but not with `consul catalog datacenters` nor with `http://localhost:8500/v1/catalog/datacenters`.
https://www.consul.io/docs/commands/members  See the "-wan" parameter at the bottom of the page.
Also in the documentation it states "When [consul force-leave is] run on a server that is part of a WAN gossip pool, force-leave can remove failed servers in other datacenters from the WAN pool. The identifying node-name in a WAN pool is [node-name].[datacenter]." https://www.consul.io/docs/commands/force-leave.html

When I do `consul force-leave ${remote_wan_node}.${remote_dc}` the command returns success, but `consul members -wan |grep ${remote_wan_node}` shows it is still in the WAN pool.

How do I force-leave a failed member of the WAN pool that I know will never return?
Please do not say "Just turn your back on the mess and ignore alerts for 72 hours or more, it will clean itself."

Jono Sosulska

unread,
Jun 12, 2020, 7:26:18 PM6/12/20
to consu...@googlegroups.com
Hi Moses,

Thanks for bringing this up, and I am sorry to hear about your Consul deployment breaking in this way :/ I spent some time spinning up this context, and was able to replicate the behaviors you saw. I'd like to add some context that helped me understand this better, and go from there.
Consul maintains multiple lists of members and is maintained by serf. Consul makes use of multiple gossip pools so that the performance of Serf over a LAN can be retained while still using it over a WAN for linking together multiple datacenters. The force-leave command does a "leave" against the LAN serf pool, but does not affect membership in the WAN serf pool. Using consul force-leave [node-name].[dc], today, doesn't seem to work like that. The documentation for force-leave could be better in this regard for what is, and is not possible.

One of the interesting things found during investigating this issue is that the force-leave commands do not complete when they can not communicate with the nodes they are attempting to remove. This means a new command/workflow may need to be created in order to view an unreachable node being removed from the local WAN list as still being a valid command. This is a shortfall of the current implementation. 

What can be done for your cluster?
In this particular instance, there's not much that can be done. You could adjust the reconnect_timeout_wan setting in your configuration file and do a rolling restart of your server processes. This would cut down your time from 72 hours to 8 hours, but doing a rolling process restart has its own downsides. I would not recommend this, and quite honestly, it's a user experience we need to implement better.

Here's what I can do to get things rolling:
1. We will reopen Accept datacenter as a param for force-leave from WAN pool for the -wan flag to be added, mark it with "type/.:umbrella:" and have the Consul team investigate further. As mentioned in the ticket, there are a number of technical blockers that need to be considered when messing with the WAN pool. Please watch that ticket and :+1: :) 
2. I have opened up Update force-leave documentation to reflect LAN vs WAN behavior to track the documentation for consul force-leave to be made more accurate. 

I hope this helps! Thank you so much for your detailed information and please watch the tickets linked here for more info.
Best,
Jono Sosulska

Digital Developer Advocate || Community
jo...@hashicorp.com           || @Jono


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/c2bcd98c-4653-46be-b322-3b96e9a0f28co%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages