Etcd cannot start

1,152 views
Skip to first unread message

Charles Lescot

unread,
Aug 10, 2015, 5:13:31 AM8/10/15
to Deis user list
Hi,
on 3 bare metal servers, i've activated a coreos install(723.3.0) proposed by my hosting provider.


i've tried to install deis (v1.9) on these 3 servers.
on login, the deis welcome message appears, but etcd seems cannot start.
i've tried to register to a new discovery token but the problem remains...


'journalctl -b -u etcd' :
Aug 10 08:57:04 sd-71577 systemd[1]: Started etcd2 container.
Aug 10 08:57:04 sd-71577 docker[7066]: 2015/08/10 08:57:04 etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 8
Aug 10 08:57:04 sd-71577 docker[7066]: 2015/08/10 08:57:04 etcdmain: listening for peers on http://0.0.0.0:2380
Aug 10 08:57:04 sd-71577 docker[7066]: 2015/08/10 08:57:04 etcdmain: listening for peers on http://0.0.0.0:7001
Aug 10 08:57:04 sd-71577 docker[7066]: 2015/08/10 08:57:04 etcdmain: listening for client requests on http://0.0.0.0:2379
Aug 10 08:57:04 sd-71577 docker[7066]: 2015/08/10 08:57:04 etcdmain: listening for client requests on http://0.0.0.0:4001
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: stopping listening for client requests on http://0.0.0.0:4001
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: stopping listening for client requests on http://0.0.0.0:2379
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: stopping listening for peers on http://0.0.0.0:7001
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: stopping listening for peers on http://0.0.0.0:2380
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: member "1f06ce04d9a646ff80625ff426d15b68" has previously registered with discovery service token (https://discovery.etcd.i
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: But etcd could not find vaild cluster configuration in the given data dir (/var/lib/etcd2).
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: Please check the given data dir path if the previous bootstrap succeeded
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: or use a new discovery token if the previous bootstrap failed.
Aug 10 08:57:05 sd-71577 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 08:57:05 sd-71577 systemd[1]: etcd.service: Unit entered failed state.
Aug 10 08:57:05 sd-71577 systemd[1]: etcd.service: Failed with result 'exit-code'.

'systemctl status -l etcd':
etcd.service - etcd2 container
   Loaded: loaded (/etc/systemd/system/etcd.service; static; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Mon 2015-08-10 09:02:35 UTC; 819ms ago
  Process: 8710 ExecStop=/usr/bin/docker stop $ETCD_NAME (code=exited, status=1/FAILURE)
  Process: 8695 ExecStart=/usr/bin/docker run --net=host --rm --volume=${ETCD_DATA_DIR}:/var/lib/etcd2 --volume=/usr/share/ca-certificates:/etc/ssl/certs:ro -p 4001:4001 -p 2380:2380 -p 2379:2379 -p 7001:7001 --name ${ETCD_NAME} ${ETCD_IMAGE} -name ${ETCD_NAME} -data-dir /var/lib/etcd2 -advertise-client-urls http://${COREOS_PRIVATE_IPV4}:2379,http://${COREOS_PRIVATE_IPV4}:4001 -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 -initial-advertise-peer-urls http://${COREOS_PRIVATE_IPV4}:2380,http://${COREOS_PRIVATE_IPV4}:7001 -listen-peer-urls http://0.0.0.0:2380,http://0.0.0.0:7001 --heartbeat-interval ${ETCD_HEARTBEAT_INTERVAL} --election-timeout ${ETCD_ELECTION_TIMEOUT} --discovery https://discovery.etcd.io/9b6b1f3137e5626fa2b197ad283030ae (code=exited, status=1/FAILURE)
  Process: 8687 ExecStartPre=/bin/sh -c docker inspect $ETCD_NAME >/dev/null 2>&1 && docker rm -f $ETCD_NAME || true (code=exited, status=0/SUCCESS)
  Process: 8681 ExecStartPre=/bin/sh -c docker history $ETCD_IMAGE >/dev/null 2>&1 || docker pull $ETCD_IMAGE (code=exited, status=0/SUCCESS)
 Main PID: 8695 (code=exited, status=1/FAILURE)

Aug 10 09:02:35 sd-71577 systemd[1]: etcd.service: Unit entered failed state.
Aug 10 09:02:35 sd-71577 systemd[1]: etcd.service: Failed with result 'exit-code'.


on the discovery url, only 2 servers appears.
'fleetctl list-machines' :
Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etcd
' systemctl status -l etcd2'
etcd2.service
   Loaded: masked (/dev/null)
   Active: inactive (dead)

have you any tips to solve my problem?


best regards,

Charles.

Lorieri

unread,
Aug 10, 2015, 2:24:28 PM8/10/15
to Charles Lescot, Deis user list
Hi Charles,

I could never recover from a etcd crash :(
That is my main problem with Deis.


[]s
-lorieri

--
You received this message because you are subscribed to the Google Groups "Deis user list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deis-users+...@googlegroups.com.
To post to this group, send email to deis-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/deis-users/17122925-a93c-450a-a5ea-c0eb740a8407%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Armstrong

unread,
Aug 10, 2015, 2:46:41 PM8/10/15
to Lorieri, Charles Lescot, Deis user list
Hi Charles,

Looks like your error is here:

Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: member "1f06ce04d9a646ff80625ff426d15b68" has previously registered with discovery service token (https://discovery.etcd.i
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: But etcd could not find vaild cluster configuration in the given data dir (/var/lib/etcd2).
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: Please check the given data dir path if the previous bootstrap succeeded
Aug 10 08:57:05 sd-71577 docker[7066]: 2015/08/10 08:57:05 etcdmain: or use a new discovery token if the previous bootstrap failed.


You said you tried a new discovery token - are you sure you did? How did you apply the new token? Once the clusters are booted that's typically a bit more difficult to change - you'll have to change the discovery URL parameter to etcd and remove the etcd data directory.

If possible, I'd suggest playing with things on a public cloud so you can remove hosts and redeploy the cluster without penalty. Then, once you're comfortable with things, give it a go on bare metal.

Chris


For more options, visit https://groups.google.com/d/optout.



--
Chris Armstrong | Deis Team Lead | Engine Yard t: @carmstrong_afk | gh: carmstrong

Chris Armstrong

unread,
Aug 10, 2015, 3:08:31 PM8/10/15
to Lorieri, Charles Lescot, Deis user list
Lorieri,

That should be fixed with 1.9.0 now that we're using etcd2! etcd 0.x had a ton of problems and would be dead in the water if the cluster lost its leader.

Chris

Lorieri

unread,
Aug 10, 2015, 3:09:48 PM8/10/15
to Chris Armstrong, Charles Lescot, Deis user list
Cool :)
I did not give up anyway :p

Charles Lescot

unread,
Aug 10, 2015, 5:11:04 PM8/10/15
to Lorieri, deis-...@googlegroups.com, Chris Armstrong

Hi,
I've tested another token, by replacing the previous one in the user_data file, and rebooting each server.
But i've not removed any directory.
Which one need to be clean ?
Is there any others steps to follow ?
Best regards,
Charles

Chris Armstrong

unread,
Aug 11, 2015, 1:25:16 PM8/11/15
to Charles Lescot, Lorieri, Deis user list
Have you confirmed with systemctl cat etcd that the discovery URL is actually updated for the service? The user-data would have to be reprocessed. 

You're effectively creating a new etcd cluster, so you'll also want to purge the etcd data directory (which I believe is /var/lib/etcd2 on bare metal). 

Charles Lescot

unread,
Aug 17, 2015, 4:01:49 AM8/17/15
to Deis user list
Hi Chris,
i've confirmed with 'systemctl cat etcd' on each server that discovery rl has been updated.
On this url, only 2 servers were listed, but i've clean the /var/lib/etcd2 directory and reboot on the server not listed, and now the third server appears.

But the main problem remains:
it seems that etcd is not running on 2 servers.
When i try '
etcdctl --debug member list' on the first and second server, here is the output :
dial tcp 127.0.0.1:2379: connection refused

On the third server, the ouput is different :

no endpoints available

when i do :
curl -L http://127.0.0.1:4001/version

the third server answer :
{"etcdserver":"2.1.1","etcdcluster":"not_decided"}


the first and the second one :
Failed to connect to 127.0.0.1 port 4001: Connection refused


Maybe do i need to specify on the third server address of the first and second ones with the etcdctl --peers command?

best regards,

Charles.

Charles Lescot

unread,
Aug 17, 2015, 6:35:28 AM8/17/15
to Deis user list
Hi,
i've regenerated a discovery token, rechecked with 'systemctl cat etcd|grep discovery' on each server and the new token is in place.
The same problem remains.

Only one server is listening  one 4001 port, 3 servers are listed on the discovery url.
when i execute:
'curl -L http://127.0.0.1:4001/v2/stats/leader' on the server listening on 4001 port :
{"message":"not current leader"}

on other servers:
curl: (7) Failed to connect to 127.0.0.1 port 4001: Connection refused

i can curl (so communication is good) from the 'non listening on 4001 port servers', the third server on 4001 port with its private or public ip.


best regards,

Charles.

Le lundi 10 août 2015 11:13:31 UTC+2, Charles Lescot a écrit :

Kingdon Barrett

unread,
Aug 17, 2015, 6:50:57 AM8/17/15
to Charles Lescot, Deis user list

I think this is related to your version of CoreOS. I'll bet you are on the stable channel, where docker 1.7.1 has not landed yet...


--
You received this message because you are subscribed to the Google Groups "Deis user list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deis-users+...@googlegroups.com.
To post to this group, send email to deis-...@googlegroups.com.

Charles Lescot

unread,
Aug 17, 2015, 7:07:30 AM8/17/15
to Kingdon Barrett, Deis user list
Hi,
the docker version on servers is 1.6.2 (docker -v), and coreos version (cat /etc/os-release )is :
723.3.0.

the deis documentation (http://docs.deis.io/en/latest/installing_deis/baremetal/ ) says :
"Deis runs on CoreOS version 494.5.0 or later in the Stable channel."
so, the coreos should not be the problem?

Best regards,

Charles.

--
Charles Lescot.
société MEROVIA

Charles Lescot

unread,
Aug 17, 2015, 10:16:39 AM8/17/15
to Deis user list
According to Joshua Anderson  and Kingdon Barret, there are some incompatibilities between latest stable coreos release which ships docker 1.6.2 and deis 1.9 (docker 1.7.1 is the minimum).

So, i've created an issue to update the documentation (https://github.com/deis/deis/issues/4282).


I hope it will save  time to others which try to install deis 1.9 on baremetal.....



best regards,

Charles.

Le lundi 10 août 2015 11:13:31 UTC+2, Charles Lescot a écrit :

Kingdon Barrett

unread,
Aug 17, 2015, 10:21:28 AM8/17/15
to Charles Lescot, Deis user list
This was actually true until last week.

For anyone following this thread, the version currently required is 647.2

Some cloud providers likely will not allow you to start an image that is on an off-channel release

I planned to try later today using coreos-install on DigitalOcean to bring some new nodes back down to an acceptable release, that should work, but I know I have heard that coreos-install is effectively disabled on some cloud providers, or at least on some of those providers that you can't change your kernel from the one that is provided by the initial image.

This would make sense on a xen-backed host, I think, since the kernel is not actually loaded from the disk image.
--
Kingdon Barrett <kin...@tuesdaystudios.com>

Charles Lescot

unread,
Aug 17, 2015, 10:30:16 AM8/17/15
to Deis user list
Hi,
thanks for your response.
In my case, servers are bare metal and installation is automated by the hosting provider, which sticks to the coreos stable channel for stability and security reasons i suppose....

I hope that coreos will ships quickly a stable release with docker 1.7.1....
Or deis will provide a fix  to this problem......


Best regards,


Charles.


Le lundi 10 août 2015 11:13:31 UTC+2, Charles Lescot a écrit :

Charles Lescot

unread,
Aug 17, 2015, 11:46:36 AM8/17/15
to Deis user list
To go back to this etcd problem, the docker version cause does not appear to be proven......

Does anyone has got any other tips to solve this problem?

best regards,


Charles.

Le lundi 10 août 2015 11:13:31 UTC+2, Charles Lescot a écrit :

Kingdon Barrett

unread,
Aug 17, 2015, 1:56:40 PM8/17/15
to Charles Lescot, Deis user list

I think it does,

Deis 1.9 has introduced etcd2 which was not on CoreOS stable channel yet...

To control the version of etcd deployed, I think they have masked the system etcd and added a new etcd2 hosted inside of containers.

This was planned to go away later, when etcd2 had landed in CoreOS, and I don't know if that has happened yet.  But presumably deis will want to go on supporting old releases of CoreOS that were supported before 1.9, I think, regardless of whether they are still current in any channel.  This may persist for some time.

Anyone from the core team can probably answer more precisely about the ongoing designs of the release coordination, I just attended the open roadmap meeting earlier this month and think I know what I'm talking about.  I am not a deis core developer.

-Kingdon


--
You received this message because you are subscribed to the Google Groups "Deis user list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deis-users+...@googlegroups.com.
To post to this group, send email to deis-...@googlegroups.com.

Charles Lescot

unread,
Aug 17, 2015, 2:06:07 PM8/17/15
to Deis user list, lescot....@gmail.com
Hi,
your explanation seems more clear to me (etcd2 masked).

' systemctl status -l etcd2' answer :


etcd2.service
   Loaded: masked (/dev/null)
   Active: inactive (dead)

I think deis releases should rely on coreos stable channel to avoid these kind of issues....

thanks Kingdon!

Charles.

Chris Armstrong

unread,
Aug 17, 2015, 3:55:37 PM8/17/15
to Charles Lescot, Deis user list
Kingdon is exactly correct.

I think deis releases should rely on coreos stable channel to avoid these kind of issues....

We do, for exactly these reasons :) All of our provision scripts specify a CoreOS release version in the stable channel. I know our docs for bare metal could call this out more clearly, so I've submitted a PR: https://github.com/deis/deis/pull/4283


For more options, visit https://groups.google.com/d/optout.

Kingdon Barrett

unread,
Aug 17, 2015, 3:56:31 PM8/17/15
to Charles Lescot, Deis user list

I certainly think that was the intention, and these are just breaking changes.


Charles Lescot

unread,
Aug 17, 2015, 5:54:28 PM8/17/15
to Deis user list
HI,
just to refine the documentation:
http://docs.deis.io/en/latest/installing_deis/baremetal/ contains :
"Please get the source and refer to the scripts in contrib/bare-metal while following this documentation."
which points to https://github.com/deis/deis/tree/master/contrib/bare-metal which points.... to http://docs.deis.io/en/latest/installing_deis/baremetal/

So, we cannot reach the contrib/bare-metal scripts with this infinite documentation loop.....


best regards,

Charles.

Le lundi 10 août 2015 11:13:31 UTC+2, Charles Lescot a écrit :

Chris Armstrong

unread,
Aug 17, 2015, 8:15:06 PM8/17/15
to Charles Lescot, Deis user list
I just PRed this: https://github.com/deis/deis/pull/4289

For future reference, our documentation is in the repo under docs/ - we'd welcome correction PRs!

Chris

--
You received this message because you are subscribed to the Google Groups "Deis user list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deis-users+...@googlegroups.com.
To post to this group, send email to deis-...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Kingdon Barrett

unread,
Aug 18, 2015, 9:39:54 PM8/18/15
to Chris Armstrong, Charles Lescot, Deis user list
Is this fixed?  It looks fixed

I was just able to bring up Deis on 723.3.0, the latest on the stable channel.

I'm not sure what changed.  Not the CoreOS release.  Fleet units maybe.


For more options, visit https://groups.google.com/d/optout.



--
Kingdon Barrett <kin...@tuesdaystudios.com>

Charles Lescot

unread,
Aug 19, 2015, 6:39:25 AM8/19/15
to Deis user list, carms...@engineyard.com, lescot....@gmail.com
Hi,
the docs will be fixed with the PR from Chris Amstrong.
But my problem remains with the 723.3.0 coreos release, and deis 1.9......

I've published thi sproblem on the coreos mailing list without response.....


best regards,

Charles.

Kingdon Barrett

unread,
Aug 19, 2015, 8:02:03 AM8/19/15
to Charles Lescot, Deis user list, carms...@engineyard.com

I am still on v1.9.0 but deis is working on stable CoreOS. Try refresh-units and start again?


Lorieri

unread,
Aug 27, 2015, 10:53:12 AM8/27/15
to Kingdon Barrett, Charles Lescot, Deis user list, Chris Armstrong
for the first time I could recover from an etcd failure :)

https://github.com/coreos/etcd/issues/815

removed /var/lib/etcd/standby_info and restarted etcd

ps: I'm still in old etcd, not etcd2
> https://groups.google.com/d/msgid/deis-users/CAFUZwa4_goQr5eTyEpqM0Y-CxeBhJeoG%3Drt3CHJhcOnkUL9ULQ%40mail.gmail.com.

Chris Armstrong

unread,
Aug 27, 2015, 12:21:23 PM8/27/15
to Lorieri, Kingdon Barrett, Charles Lescot, Deis user list
Awesome! I suspect that'll work for standby nodes, but not nodes that are part of the master cluster. Glad you got it sorted!
Reply all
Reply to author
Forward
0 new messages