Adding learner members when the cluster is unhealthy

256 views
Skip to first unread message

Joel Speed

unread,
Aug 4, 2022, 11:30:06 AM8/4/22
to etcd-dev
Hey folks,

I've been doing some exploratory work on automatically recovering an etcd cluster backing a Kubernetes control plane when a member of the etcd cluster goes unhealthy.

The flow I would like to be able to achieve is:
- We know an etcd member has gone unhealthy
- We create a new control plane node and a new etcd member on the new node
- We add this new etcd member to the cluster as a learner
- We can see the learner is ready to be promoted
- We removed the failed member from the voting members
- We promote the learner to a voting member

Currently, we can't do this today as we have to delete the failed member before the strict config check will allow us to add a learner member.

Reviewing the code (https://github.com/etcd-io/etcd/blob/ae36a577d7becbdeebf1f0fb665573b721e435f8/server/etcdserver/server.go#L1333-L1361) I can see that there are two checks. The first ignores learner members. But the second does not.

The second check is checking that the current server has connected to all voting peers recently, to avoid breaking the quorum if some member is unhealthy. It was my understanding that learner members wouldn't affect the quorum and so this check shouldn't be affected by the addition of a learner?
I would expect this check to be present in the promotion part, but not in the add.

Can anyone help to understand if this is a bug or if there's a genuine reason why adding a learner might affect the quorum?

Thanks,
Joel
Reply all
Reply to author
Forward
0 new messages