I'm writing today to find out if I am experiencing an isolated incident somehow of my own making or if others have seen similar problems.
Environment Overview
Kubernetes 1.7.11
etcd v3.0.17
Cloud Provider AWS
Cluster sizes range from 15 C4.8xlarge to 30 C4.8xlarge with about 250 - 700 pods per cluster
Discrete k8s clusters for dev/test/prod
Our etcd clusters are 5 member clusters running on c4.large or c4.xlarge depending on the cluster.
On friday january 5th we upgraded our dev cluster from CoreOS 1562.1.0 to 1649.0.0. Initially tests of the cluster seemed fine. After the upgrade the etcd dataset grew from what had been a steady state size of 214M to 601M. During this time two of five etcd pods failed their liveness probes and got restarted.
The liveness probe is
livenessProbe:
httpGet:
path: /health
port: 2379
initialDelaySeconds: 30
timeoutSeconds: 15
These two etcd members were never able to rejoin the cluster upon restarting and all attempts to remove them from the cluster or get keys returned the error
Error: context deadline exceeded
In this case we ended up recovering the cluster from a snapshot using our in house backup/recovery tools.
In the 2+ years of running these clusters we had never had an etcd outage before but we wrote it off as an unfortunate event which needs to be researched further and proceeded to upgrade our test cluster from CoreOS 1562.1.0 to 1649.0.0. The test cluster did not show an growth in the dataset of 417M but within 30 minutes of the upgrade the cluster experienced the same problem as the dev cluster where 2 etcd pods failed the liveness probe and never rejoined the cluster upon restarting. We even removed the liveness probes but the cluster never became usable. We saw the same problems with the remaining healthy members that we could not do any administrative actions to remove members from the cluster. At this point we recovered the etcd cluster to a known good snapshot only to have it go south again. We tried one more recovery and experienced the same situation. Upon rolling CoreOS back to 1562.1.0 and doing another etcd restore the cluster is stable and has continued to run fine. So ya we'll hold off on our prd upgrade.)
FWIW here is an example pod manifest.
kind: Pod
apiVersion: v1
metadata:
name: etcd
labels:
scheduling: static
spec:
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.0.17
env:
- name: GOMAXPROCS
value: "2"
- name: ETCD_DATA_DIR
value: /etcd_data
- name: ETCD_LISTEN_CLIENT_URLS
value: http://etcd-ip-10-20-2-124.ec2.internal:2379
- name: ETCD_LISTEN_PEER_URLS
value: http://etcd-ip-10-20-2-124.ec2.internal:2380
- name: ETCD_INITIAL_CLUSTER
value: ip-10-20-2-124=http://10.20.0.37:2380,ip-10-20-9-53=http://10.20.9.110:2380,ip-10-20-6-181=http://10.20.7.187:2380,ip-10-20-13-226=http://10.20.1
4.217:2380,ip-10-20-2-173=http://10.20.2.38:2380
- name: ETCD_INITIAL_ADVERTISE_PEER_URLS
value: http://10.20.0.37:2380
- name: ETCD_ADVERTISE_CLIENT_URLS
value: http://10.20.0.37:2379
- name: ETCD_NAME
value: ip-10-20-2-124
- name: ETCD_HEARTBEAT_INTERVAL
value: "225"
- name: ETCD_ELECTION_TIMEOUT
value: "2250"
ports:
- name: etcd-peer
hostPort: 2380
containerPort: 2380
- name: etcd-client
hostPort: 2379
containerPort: 2379
volumeMounts:
- name: etcd
mountPath: "/etcd_data"
readOnly: false
livenessProbe:
httpGet:
path: /health
port: 2379
initialDelaySeconds: 30
timeoutSeconds: 15
volumes:
- name: etcd
hostPath:
path: "/etcd"
restartPolicy: Always
So are we alone in our experience or have others hit something similar?
Thanks. G.