Consul reap time (reconnect_timeout) configurations

RamG

unread,

Feb 25, 2019, 3:09:54 PM2/25/19

to Consul

Is there any configuration to set reap interval different for consul servers nodes and client nodes? i have a requirement to clean client nodes in 5m to 10m interval but from the consul doc it mentioned it must be >= 8h. What are the implications/drawbacks of setting "reconnect_timeout" to 5m or 10m. One more question, is it possible to update these configurations through API?

Thanks,

Ram

RamG

unread,

Feb 28, 2019, 4:18:53 PM2/28/19

to Consul

Any help or comments on this? is it CPU intense or any other reasons for not recommending this value very low (5m or 10m)?

Paul Banks

unread,

Mar 1, 2019, 7:05:23 AM3/1/19

to consu...@googlegroups.com

Hey,

The docs for this do describe a bunch of the drawbacks:

> This controls how long it takes for a failed node to be completely removed from the cluster. This defaults to 72 hours and it is recommended that this is set to at least double the maximum expected recoverable outage time for a node or network partition. WARNING: Setting this time too low could cause Consul servers to be removed from quorum during an extended node failure or partition, which could complicate recovery of the cluster. The value is a time with a unit suffix, which can be "s", "m", "h" for seconds, minutes, or hours. The value must be >= 8 hours.

The basic tradeoff is not CPU so much as cluster stability - if you have a network partition that means one agent seems to be failed for 10 mins but really it's still up, if it gets reaped and another agent is allowed to join with the same name for example, then when the partition heals you will have two agents competing with each other and gossip with pick one or the other and may cause flapping health checks and other instability problems.

For client agents though that might be fine. The biggest risk as noted in the docs above is that if a server node is partitioned for that long but not dead then it would get _removed_ from the quorum instead of just marked as failed. For example if you have 3 servers and one is removed then you now have a quorum of 2 which is still just about OK, but if either of those fails and is removed then you are down to one server and since quorum is now 1 it will continue to accept writes which may later be lost since they are only on one node and so break consistency guarantees etc.

If you have some way to know that you'll never have temporary outages longer than a few minutes then it would be safe but this is almost never the case.

There is also an open issue for a "force reap" command where as an operator you can explicitly clean up a node that you know isn't going to come back which would probably be a better solution for your case. This is actually possible now by deleting the node via the API but a handy command for it would be great.

Hope this helps.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/e93461cf-8121-4681-9bdb-38b252a29c9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

RamG

unread,

Mar 1, 2019, 10:24:33 AM3/1/19

to Consul

Paul,

Thanks you, yes this will help to tune that value.

Ram

Reply all

Reply to author

Forward