The etcd operator is an awesome addition for the stability of any service that relies on it, including k8s itself. If etcd is safe, the cluster is safe.
Since etcd can run outside kubernetes, the problem of keeping etcd safe is really an independent problem from kubernetes. The project I'm working on (Rook) depends on etcd and has a requirement to run in both a kubernetes environment and a standalone environment. We have started implementing what amounts to a very basic etcd operator that will manage the health of the etcd cluster, but want to replace it with your much more complete operator. We would benefit now and going forward from the etcd operator.
What would it take to factor out the management of etcd from the dependency on kubernetes? Looking at the code, it seems we could define an interface, or interfaces, that define how the operator interacts with a generalized cluster. Methods such as "enumerate etcd members ", "start instance", "stop instance", and other operations that kubernetes takes care of. The etcd operator would become a library to be used by different types of clusters. In different environments where etcd runs, the clusters would benefit from a common implementation of monitoring etcd health, growing/shrinking the membership, backup/restore, and more.
This means that all references to kubernetes would be factored out to a new package. For the k8s scenario, the etcd-operator would be initialized with the kubernetes cluster implementation. In the Rook scenario, the etcd operator would be initialized with the Rook cluster implementation.
Any reason the operator couldn't run outside kubernetes given this abstraction?
Another level of abstraction to consider is the operator pattern. In our clusters, we effectively have a Ceph operator that manages the distributed storage subsystems. Currently the etcd and prometheus operators don't appear to share any common operator library. Is there a planned operator library or is the k8s management all they are expected to have in common? Perhaps this abstraction would become obvious with the other refactoring suggested for etcd, but it might be different. Thoughts on this?
Thanks!
Travis Nielsen