Recover a failed member

27 views
Skip to first unread message

cindyxi...@gmail.com

unread,
Jan 3, 2019, 7:05:08 PM1/3/19
to etcd-dev
If a member of an etcd cluster failed, say disk corruption, what's the process to recover the node? Assume the same machine/IP is to be used but only fix the disk. 
It seems snapshot and restore is to recover a whole cluster. 

When adding the node back to the cluster, my understanding is that the leader will replicate/restore the new member with snapshot.
Is it correct? 

Thanks
Cindy

Joe Betz

unread,
Jan 3, 2019, 7:09:58 PM1/3/19
to cindyxi...@gmail.com, etcd-dev
Hi Cindy,

Have a look at https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md if you haven't already and then ping us with questions that we not answered to your satisfaction there.

-Joe

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

cindyxi...@gmail.com

unread,
Jan 3, 2019, 7:50:49 PM1/3/19
to etcd-dev
Thanks Joe.

If understand the document correctly, this is to recover the whole cluster. 

I'd like to understand the exact process when a failed member to be re-added back to the cluster. 
When a new member with the same peer-ip as before & empty data-dir to be added to an existing cluster, what's the flow?

Is it, leader replicate the whole db file (or the latest snap) & wal file, then load the db & re-apply?

Chance Zibolski

unread,
Jan 3, 2019, 8:19:20 PM1/3/19
to cindyxi...@gmail.com, etcd-dev

cindyxi...@gmail.com

unread,
Jan 4, 2019, 2:56:00 PM1/4/19
to etcd-dev
Yes, this is very helpful! Thanks Chance and Joe. 
Internally, can someone help the implementation? I meant adding a member with empty data-dir, is it the same as restore from an existing snapshot?

Sam Batschelet

unread,
Jan 5, 2019, 1:01:56 PM1/5/19
to etcd-dev
> I meant adding a member with empty data-dir, is it the same as restore from an existing snapshot?

No, if you add a member with the Maintenance API and start etcd with blank data-dir then this new member must sync the contents of the db across the network. In the case of a large db this could possibly be disruptive to the cluster. If you choose to restore from snapshot each member is started with the same db. So when the cluster forms quorum, no additional work is required. I think most times removing the failed member and adding new is fine and allows the cluster to function without downtime.

- Sam
Reply all
Reply to author
Forward
0 new messages