Hey folks,
I've been doing some exploratory work on automatically recovering an etcd cluster backing a Kubernetes control plane when a member of the etcd cluster goes unhealthy.
The flow I would like to be able to achieve is:
- We know an etcd member has gone unhealthy
- We create a new control plane node and a new etcd member on the new node
- We add this new etcd member to the cluster as a learner
- We can see the learner is ready to be promoted
- We removed the failed member from the voting members
- We promote the learner to a voting member
Currently, we can't do this today as we have to delete the failed member before the strict config check will allow us to add a learner member.
The second check is checking that the current server has connected to all voting peers recently, to avoid breaking the quorum if some member is unhealthy. It was my understanding that learner members wouldn't affect the quorum and so this check shouldn't be affected by the addition of a learner?
I would expect this check to be present in the promotion part, but not in the add.
Can anyone help to understand if this is a bug or if there's a genuine reason why adding a learner might affect the quorum?
Thanks,
Joel