Etcd cluster Issues

1,623 views
Skip to first unread message

abdulazee...@gmail.com

unread,
Apr 3, 2019, 11:52:59 AM4/3/19
to CoreOS User
I;m trying to setup a 2-node etcd cluster (I know its not recommended but I'm just doing a proof of concept till i get my 3rd hardware) with Terraform and matchbox. I'm using the matchbox example on git as a guide (HERE). The only changes I made are the domain name (to match mine) and the "Environment="ETCD_IMAGE_TAG=v3.2.0"" to 3.3.10 which is the current version running on CoreOs 1967.6.0.

The 2 servers are up but etcd-member service fails to start fails to start. Errors are shown below:

Apr 03 15:23:00 follower systemd[1]: Failed to start etcd (System Application Container).
Apr 03 15:23:10 follower systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Apr 03 15:23:10 follower systemd[1]: etcd-member.service: Scheduled restart job, restart counter is at 3.
Apr 03 15:23:10 follower systemd[1]: Stopped etcd (System Application Container).
Apr 03 15:23:10 follower systemd[1]: Starting etcd (System Application Container)...
Apr 03 15:23:10 follower etcd-wrapper[1359]: ++ id -u etcd
Apr 03 15:23:10 follower etcd-wrapper[1359]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume core>


output of netstat -tupln  shows that both ports 2379 and 2380 are not up either 

core@leader ~ $ netstat -tupln
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::22                   :::*                    LISTEN      -
udp        0      0 192.168.200.30:68       0.0.0.0:*                           -
udp6       0      0 fe80::d294:66ff:fe8:546 :::*                                -



core@leader ~ $ etcdctl cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: connect: connection refused
; error #1: dial tcp 127.0.0.1:2379: connect: connection refused

error #0: dial tcp 127.0.0.1:4001: connect: connection refused
error #1: dial tcp 127.0.0.1:2379: connect: connection refused


abdulazee...@gmail.com

unread,
Apr 3, 2019, 12:00:11 PM4/3/19
to CoreOS User
core@leader ~ $ systemctl status etcd-member
● etcd-member.service - etcd (System Application Container)
   Loaded: loaded (/usr/lib/systemd/system/etcd-member.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/etcd-member.service.d
           └─40-etcd-cluster.conf
   Active: activating (start) since Wed 2019-04-03 10:01:42 UTC; 46s ago
  Process: 2681 ExecStartPre=/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid (code=exited, status=254)
  Process: 2679 ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos (code=exited, status=0/SUCCESS)
 Main PID: 2695 (rkt)
    Tasks: 18 (limit: 32767)
   Memory: 36.1M
   CGroup: /system.slice/etcd-member.service
           └─2695 /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=>

Apr 03 10:01:42 leader systemd[1]: Starting etcd (System Application Container)...
Apr 03 10:01:42 leader etcd-wrapper[2695]: ++ id -u etcd
Apr 03 10:01:42 leader etcd-wrapper[2695]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume core>

Seán C. McCord

unread,
Apr 3, 2019, 12:33:55 PM4/3/19
to coreos-user
Please understand that running a 2-node cluster is not simply "not recommended."  etcd is not like a RDBMS which operates as a primary and a failover.  Running is 2-node cluster is fundamentally _worse_ than running a single-node cluster, since either node being down will render the database inoperative.  Run 3 or 1, but NOT 2.  The communication between members is a critical design and operational component of etcd.

That said, your problem appears to be that you are not specifying your bindings and endpoints.  By default, etcd listens only on localhost.  Moreover, for a cluster to form you _must_ specify the parameters for its construction.  That is, it needs to know who the members of that cluster are.  This can be through a third-party discovery tool (such as the one provided by CoreOS) or through explicit specification.  In general, you need to specify both the binding ports _and_ the advertised ports (which may or may not be the same thing, depending on your networking) for both the membership network _and_ the client network.

Start small:  just run a single node system, specifying the client service bindings.  Once you have that understood and operational outside of a single system (that is, properly binding to an external interface, serving external clients), you can add the cluster membership parameters.



--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Seán C. McCord
ule...@gmail.com
CyCore Systems

abdulazee...@gmail.com

unread,
Apr 3, 2019, 12:39:10 PM4/3/19
to CoreOS User
Thanks for the reply Sean. 

Like I mentioned earlier, I'm only running a 2-node cluster because I'm awaiting the delivery of my 3rd hardware. So this is just a test environment.

With respect to specifying the bindings and endpoints, can you please point me to a resource that shows how to do that. It'll be greatly appreciated. Thanks for your prompt reply once again.

Azeem
To unsubscribe from this group and stop receiving emails from it, send an email to coreo...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

abdulazee...@gmail.com

unread,
Apr 3, 2019, 12:52:48 PM4/3/19
to CoreOS User
Both servers have the following parameters 

Environment="ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380"


From the documentation, it says that specifying 0.0.0.0 means etcd listens to the given port on all interfaces.

I also specified the members of the cluster using the environment variable below and I can confirm that the DNS names are resolving without issues:

Environment="ETCD_INITIAL_CLUSTER=node2=http://node2.sp.swarm:2380,node3=http://node3.sp.swarm:2380"

Maybe I might be missing something else though

Seán C. McCord

unread,
Apr 3, 2019, 1:02:48 PM4/3/19
to CoreOS User
That looks right so far.  You still need to finish out the membership parameters.  Take a look at the clustering operation guide here:  https://coreos.com/etcd/docs/latest/op-guide/clustering.html



--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

abdulazee...@gmail.com

unread,
Apr 3, 2019, 1:08:39 PM4/3/19
to CoreOS User
Thanks Sean. I'll check out the documentation and see if there's anomaly in my config.

By the way, here's the result of the " systemctl cat etcd-member" command for both nodes 

core@leader ~ $ systemctl cat etcd-member
# /usr/lib/systemd/system/etcd-member.service
[Unit]
Description=etcd (System Application Container)
Wants=network-online.target network.target
After=network-online.target
Conflicts=etcd.service
Conflicts=etcd2.service

[Service]
Type=notify
Restart=on-failure
RestartSec=10s
TimeoutStartSec=0
LimitNOFILE=40000

Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_NAME=%m"
Environment="ETCD_USER=etcd"
Environment="ETCD_DATA_DIR=/var/lib/etcd"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid"

ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid
ExecStart=/usr/lib/coreos/etcd-wrapper $ETCD_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_NAME=node2"
Environment="ETCD_ADVERTISE_CLIENT_URLS=http://node2.sp.swarm:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=http://node2.sp.swarm:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380"
Environment="ETCD_INITIAL_CLUSTER=node2=http://node2.sp.swarm:2380,node3=http://node3.sp.swarm:2380"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"


core@follower ~ $ systemctl cat etcd-member
# /usr/lib/systemd/system/etcd-member.service
[Unit]
Description=etcd (System Application Container)
Wants=network-online.target network.target
After=network-online.target
Conflicts=etcd.service
Conflicts=etcd2.service

[Service]
Type=notify
Restart=on-failure
RestartSec=10s
TimeoutStartSec=0
LimitNOFILE=40000

Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_NAME=%m"
Environment="ETCD_USER=etcd"
Environment="ETCD_DATA_DIR=/var/lib/etcd"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid"

ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid
ExecStart=/usr/lib/coreos/etcd-wrapper $ETCD_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_NAME=node3"
Environment="ETCD_ADVERTISE_CLIENT_URLS=http://node3.sp.swarm:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=http://node3.sp.swarm:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380"
Environment="ETCD_INITIAL_CLUSTER=node2=http://node2.sp.swarm:2380,node3=http://node3.sp.swarm:2380"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"


Seán C. McCord

unread,
Apr 3, 2019, 1:18:55 PM4/3/19
to CoreOS User
You will still need to bootstrap:  ETCD_INITIAL_CLUSTER_STATE=new


--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

abdulazee...@gmail.com

unread,
Apr 3, 2019, 6:21:26 PM4/3/19
to CoreOS User
I added the bootstrap  ETCD_INITIAL_CLUSTER_STATE=new and I still get the same error.

I'm now trying just a 1-node etcd cluster for simplicity. and I still get the same error (see below). The ports (2379 and 2380) still are not open 

core@leader ~ $ sudo netstat -tupln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::22                   :::*                    LISTEN      1/systemd
udp        0      0 192.168.200.30:68       0.0.0.0:*                           1043/systemd-networ
udp6       0      0 fe80::d294:66ff:fe8:546 :::*                                1043/systemd-networ



Environment="ETCD_STRICT_RECONFIG_CHECK=true"
Environment="ETCD_INITIAL_CLUSTER_STATE=new"


core@leader ~ $ systemctl status etcd-member
● etcd-member.service - etcd (System Application Container)
   Loaded: loaded (/usr/lib/systemd/system/etcd-member.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/etcd-member.service.d
           └─40-etcd-cluster.conf
   Active: activating (start) since Wed 2019-04-03 16:20:49 UTC; 26s ago
  Process: 2295 ExecStartPre=/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid (code=exited, status=254)
  Process: 2292 ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos (code=exited, status=0/SUCCESS)
 Main PID: 2307 (rkt)
    Tasks: 19 (limit: 32767)
   Memory: 37.1M
   CGroup: /system.slice/etcd-member.service
           └─2307 /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=>

Apr 03 16:20:49 leader systemd[1]: Starting etcd (System Application Container)...
Apr 03 16:20:49 leader etcd-wrapper[2307]: ++ id -u etcd
Apr 03 16:20:49 leader etcd-wrapper[2307]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume core>
lines 1-17/17 (END)



Apr 03 16:20:39 leader systemd[1]: Failed to start etcd (System Application Container).
Apr 03 16:20:49 leader systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Apr 03 16:20:49 leader systemd[1]: etcd-member.service: Scheduled restart job, restart counter is at 28.
Apr 03 16:20:49 leader systemd[1]: Stopped etcd (System Application Container).
Apr 03 16:20:49 leader systemd[1]: Starting etcd (System Application Container)...
Apr 03 16:20:49 leader etcd-wrapper[2307]: ++ id -u etcd
Apr 03 16:20:49 leader etcd-wrapper[2307]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume core>
Apr 03 16:21:49 leader systemd[1]: etcd-member.service: Main process exited, code=exited, status=254/n/a
Apr 03 16:21:49 leader systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Apr 03 16:21:49 leader systemd[1]: Failed to start etcd (System Application Container).


abdulazee...@gmail.com

unread,
Apr 4, 2019, 11:41:57 AM4/4/19
to CoreOS User
Hi Sean,

I've added all the possible parameters and the etcd-member sevice is still not coming up. I have a 3-node etcd cluster running on AWS on Ubuntu 18.04 without any issues so I'm just wondering where I got it wrong in this case.

I even switched the URLs from domain names to IP addresses and its still the same error ( for a 1-node cluster on CoreOS baremetal with Terraform v0.11.13
+ provider.matchbox v0.2.3 and Matchbox v0.7.1). it looks like the rkt  command is not executing but I don't know how to further troubleshoot rkt.

Apr 04 09:22:20 leader systemd[1]: Starting etcd (System Application Container)...
Apr 04 09:22:20 leader etcd-wrapper[2306]: ++ id -u etcd
Apr 04 09:22:20 leader etcd-wrapper[2306]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --mount volume=coreos-notify,target=/run/systemd/notify --volume coreos-notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --volume coreos-data-dir,kind=host,source=/var/lib/etcd,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume coreos-etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=coreos-data-dir,target=/var/lib/etcd --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-etc-hosts,target=/etc/hosts --mount volume=coreos-etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.3.10 --user=232 --
Apr 04 09:23:21 leader systemd[1]: etcd-member.service: Main process exited, code=exited, status=254/n/a
Apr 04 09:23:21 leader systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Apr 04 09:23:21 leader systemd[1]: Failed to start etcd (System Application Container).
Apr 04 09:23:31 leader systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Apr 04 09:23:31 leader systemd[1]: etcd-member.service: Scheduled restart job, restart counter is at 29.
Apr 04 09:23:31 leader systemd[1]: Stopped etcd (System Application Container).


abdulazee...@gmail.com

unread,
Apr 4, 2019, 3:15:14 PM4/4/19
to CoreOS User
UPDATE

So I started the etcd-member service with the same paramenters on a CoreOS node I spun up on AWS EC2 and it worked without any issues. 

I found out that the issue was the fact that I was running CoreOS in a closed network without internet and he service is trying to pull the etcd image from a remote quay repository. I don't have any solution to this yet and I would like to keep my network closed. I have a local docker repository running on Nexus though. I don;t know if I can host the etcd image there. Any suggestions?
To unsubscribe from this group and stop receiving emails from it, send an email to coreo...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages