Run etcd Production Cluster in Docker on Mesos and Marathon

230 views
Skip to first unread message

Tom O'Connor

unread,
Aug 24, 2016, 7:14:25 PM8/24/16
to CoreOS User
Hi,

I'm pretty new to etcd, but not to Docker/Mesos/Marathon.  I'm trying to set up an etcd cluster that is running in Docker containers, deployed via Marathon.  I followed the steps listed here to create a 3 node cluster in Docker.  I initially had some success.  However, I have some questions.

  1. Does anyone run etcd like this, or is the only real production supported configuration is installed on the host OS?
  2. I'm surprised the data-dir is not a volume mounted on the host in the instructions?  In an effort to try this, I tried to poke around a running quay.io/coreos/etcd container with, 
    docker exec -it <cid> /bin/ls
    but get "exec: "/bin/ls": stat /bin/ls: no such file or directory".  That's a pretty stripped down container.  Does anyone know what exactly the host OS is and why it doesn't have a shell or ls at all?  Because of this, I'm not sure where to mount the container volume to?
  3. Restarting a container causes it to fail to rejoin the cluster.  I see in the Clustering Guide that subsequent runs should ignore the --initial-cluster flags.  I suspect this is not happening due to the ephemeral nature of a container without a mapped volume.  I was able to get it to rejoin if I set --initial-cluster-state existing.  Am I missing something else here?
Thanks,
Tom

Rob Szumski

unread,
Aug 25, 2016, 1:22:14 PM8/25/16
to Tom O'Connor, CoreOS User
  1. Does anyone run etcd like this, or is the only real production supported configuration is installed on the host OS?
Running in containers is definitely a supported config. 
  1. I'm surprised the data-dir is not a volume mounted on the host in the instructions?  In an effort to try this, I tried to poke around a running quay.io/coreos/etcd container with, 
    docker exec -it <cid> /bin/ls
    but get "exec: "/bin/ls": stat /bin/ls: no such file or directory".  That's a pretty stripped down container.  Does anyone know what exactly the host OS is and why it doesn't have a shell or ls at all?  Because of this, I'm not sure where to mount the container volume to?
It depends how you want to handle failure. If you can make sure the IPs don’t change when the container is restarted, and the host storage is durable, that will be fine. Otherwise, if that job fails, you can use etcdctl to remove that instance, and then add a new one.

The default storage location is /var/lib/etcd2, that is where you should mount your host volume.

This container only contains the static binary, so it doesn’t have a shell. There is a fatter container that is auto-built via Git here: quay.io/coreos/etcd-git:master
  1. Restarting a container causes it to fail to rejoin the cluster.  I see in the Clustering Guide that subsequent runs should ignore the --initial-cluster flags.  I suspect this is not happening due to the ephemeral nature of a container without a mapped volume.  I was able to get it to rejoin if I set --initial-cluster-state existing.  Am I missing something else here?
Correct, this is most certainly due to the data directory not being resilient (without looking at the logs)

Thanks,
Tom

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom O'Connor

unread,
Aug 25, 2016, 4:02:57 PM8/25/16
to CoreOS User, ichas...@gmail.com
I got this working beautifully.  Thanks for the advice.

In our case we have constrained the Mesos hosts we wanted to deploy etcd to, so we have durable storage and known IPs to bootstrap.

It turned out that the container doesn't actually have it's data dir set to /var/lib/etcd2, so I simply added an env var and injected it into the container to set the data dir to /var/lib/etcd2 and then the data showed up on the host, persistently.  Containers can be restarted and rejoin the cluster as expected.  Hopefully this will help someone else who might try to do the same thing.
Reply all
Reply to author
Forward
0 new messages