Periodic failures every few minutes on a HA etcd cluster running on EKS

175 views
Skip to first unread message

Selvi K

unread,
Sep 21, 2023, 8:25:37 PM9/21/23
to etcd-dev
Hello,

I am running a 3-replica etcd cluster on a single node (m5.8xlarge) EKS cluster (as a test before I move to using multiple nodes). I am using version 3.5.6 and EBS persistent volumes on the cluster nodes. The etcd stateful set as well as services all come up ok and I can verify via the "member list" and "endpoint health" commands that the cluster is healthy. However every few minutes 1 or more of the 3 etcd pods get into CrashloopBackOff and I see many connectivity related error messages in the etcd pod logs.

These are the flags I am using to start up etcd:
PEERS="etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380" && exec /usr/local/bin/etcd  --listen-peer-urls http://0.0.0.0:2380 --listen-client-urls http://0.0.0.0:2379  --advertise-client-urls http://${ETCD_NAME}.etcd:2379
          --initial-advertise-peer-urls http://${ETCD_NAME}:2380  --initial-cluster-token
          etcd-cluster-1 --initial-cluster ${PEERS} --initial-cluster-state new --data-dir
          /var/run/etcd/${ETCD_NAME}

Here are some of errors in the pod logs:

1.
{"level":"info","ts":"2023-09-22T00:17:24.691Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"b429c86e3cd4e077"}
{"level":"warn","ts":"2023-09-22T00:17:24.692Z","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"2e80f96756a54ca9","remote-peer-id-stream-handler":"2e80f96756a54ca9","remote-peer-id-from":"7fd61f3f79d97779","cluster-id":"718fb68f6a80fda9"}

2.
{"level":"info","ts":"2023-09-22T00:17:24.690Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"7fd61f3f79d97779"}
{"level":"info","ts":"2023-09-22T00:17:24.690Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"b429c86e3cd4e077"}
{"level":"warn","ts":"2023-09-22T00:17:24.690Z","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"b429c86e3cd4e077"}

3. {"level":"info","ts":"2023-09-22T00:17:24.695Z","caller":"embed/etcd.go:568","msg":"stopping serving peer traffic","address":"[::]:2380"}
{"level":"info","ts":"2023-09-22T00:17:25.695Z","caller":"embed/etcd.go:573","msg":"stopped serving peer traffic","address":"[::]:2380"}
{"level":"info","ts":"2023-09-22T00:17:25.695Z","caller":"embed/etcd.go:375","msg":"closed etcd server","name":"etcd-0","data-dir":"/var/run/etcd/etcd-0","advertise-peer-urls":["http://etcd-0:2380"],"advertise-client-urls":["http://etcd-0.etcd:2379"]}

Any hints to troubleshoot this further would be appreciated!

Thanks,
Selvi
Reply all
Reply to author
Forward
0 new messages