Using bootkube recover to restore from etcd snapshot

33 views

Skip to first unread message

David Taylor

unread,

Jan 16, 2018, 11:06:19 AM1/16/18

to CoreOS User

Hello,

I am deploying Tectonic on bare-metal, and we are using our second cluster to run through disaster recovery scenarios. I have opened an issue on the bootkube GitHub repo (https://github.com/kubernetes-incubator/bootkube/issues/834), and I am hoping someone here can provide some suggestions.

Long story short (full details in the GitHub issue), I am periodically taking snapshots from a remote machine using etcdctl snapshot save, and storing them in S3. I simulated a total failure by blowing away every single master node and rebuilding with terraform. As it stands there is nothing running on the controllers besides etcd. I have destroyed everything else (pods, containers, etc). After following the instructions in https://coreos.com/tectonic/docs/latest/troubleshooting/bootkube_recovery_tool.html, I attempt the bootkube recover command but receive the following error:

Error: key not found: /registry/etcd.database.coreos.com/etcdclusters/kube-system/kube-etcd
key not found: /registry/etcd.database.coreos.com/etcdclusters/kube-system/kube-etcd

Perhaps I am going about this totally wrong, but I would like to be able to completely blow away the cluster, rebuild with terraform, then restore from an etcd backup.

I find it strange that those keys don't even exist in our operational cluster. Am I taking backups incorrectly, or is there some other command I need to run to get a functional backup? I am at a total loss here. If I can provide any other helpful details let me know, any help would be greatly appreciated.

David Taylor

unread,

Jan 16, 2018, 3:30:01 PM1/16/18

to CoreOS User

When deploying with Tectonic installer and terraform, self-hosted etcd is running on rkt directly on the controller node. In the terminology of bootkube, self-hosted etcd is running inside Kubernetes rather then on the node itself. If I treat etcd running on the controllers as an external cluster and use the appropriate flag in bootkube, this error goes away.

Reply all

Reply to author

Forward

0 new messages