etcd issues after updating OS for meltdown patch

weathe...@gmail.com

unread,

Jan 11, 2018, 1:36:31 PM1/11/18

to CoreOS User

I'm writing today to find out if I am experiencing an isolated incident somehow of my own making or if others have seen similar problems.

Environment Overview
Kubernetes 1.7.11
etcd v3.0.17
Cloud Provider AWS
Cluster sizes range from 15 C4.8xlarge to 30 C4.8xlarge with about 250 - 700 pods per cluster
Discrete k8s clusters for dev/test/prod
Our etcd clusters are 5 member clusters running on c4.large or c4.xlarge depending on the cluster.

On friday january 5th we upgraded our dev cluster from CoreOS 1562.1.0 to 1649.0.0.   Initially tests of the cluster seemed fine.   After the upgrade the etcd dataset grew from what had been a steady state size of 214M to 601M.   During this time two of five etcd pods failed their liveness probes and got restarted.

The liveness probe is

 livenessProbe:
      httpGet:
        path: /health
        port: 2379
      initialDelaySeconds: 30
      timeoutSeconds: 15

These two etcd members were never able to rejoin the cluster upon restarting and all attempts to remove them from the cluster or get keys returned the error

Error: context deadline exceeded

In this case we ended up recovering the cluster from a snapshot using our in house backup/recovery tools.

In the 2+ years of running these clusters we had never had an etcd outage before but we wrote it off as an unfortunate event which needs to be researched further and proceeded to upgrade our test cluster from CoreOS 1562.1.0 to 1649.0.0. The test cluster did not show an growth in the dataset of 417M but within 30 minutes of the upgrade the cluster experienced the same problem as the dev cluster where 2 etcd pods failed the liveness probe and never rejoined the cluster upon restarting. We even removed the liveness probes but the cluster never became usable. We saw the same problems with the remaining healthy members that we could not do any administrative actions to remove members from the cluster. At this point we recovered the etcd cluster to a known good snapshot only to have it go south again. We tried one more recovery and experienced the same situation. Upon rolling CoreOS back to 1562.1.0 and doing another etcd restore the cluster is stable and has continued to run fine. So ya we'll hold off on our prd upgrade.)

FWIW here is an example pod manifest.

kind: Pod
apiVersion: v1
metadata:
  name: etcd
  labels:
    scheduling: static
spec:
  containers:
  - name: etcd
    image: quay.io/coreos/etcd:v3.0.17
    env:
    - name: GOMAXPROCS
      value: "2"
    - name: ETCD_DATA_DIR
      value: /etcd_data
    - name: ETCD_LISTEN_CLIENT_URLS
      value: http://etcd-ip-10-20-2-124.ec2.internal:2379
    - name: ETCD_LISTEN_PEER_URLS
      value: http://etcd-ip-10-20-2-124.ec2.internal:2380
    - name: ETCD_INITIAL_CLUSTER
      value: ip-10-20-2-124=http://10.20.0.37:2380,ip-10-20-9-53=http://10.20.9.110:2380,ip-10-20-6-181=http://10.20.7.187:2380,ip-10-20-13-226=http://10.20.1
4.217:2380,ip-10-20-2-173=http://10.20.2.38:2380
    - name: ETCD_INITIAL_ADVERTISE_PEER_URLS
      value: http://10.20.0.37:2380
    - name: ETCD_ADVERTISE_CLIENT_URLS
      value: http://10.20.0.37:2379
    - name: ETCD_NAME
      value: ip-10-20-2-124
    - name: ETCD_HEARTBEAT_INTERVAL
      value: "225"
    - name: ETCD_ELECTION_TIMEOUT
      value: "2250"
    ports:
    - name: etcd-peer
      hostPort: 2380
      containerPort: 2380
    - name: etcd-client
      hostPort: 2379
      containerPort: 2379
    volumeMounts:
    - name: etcd
      mountPath: "/etcd_data"
      readOnly: false
    livenessProbe:
      httpGet:
        path: /health
        port: 2379
      initialDelaySeconds: 30
      timeoutSeconds: 15
  volumes:
  - name: etcd
    hostPath:
      path: "/etcd"
  restartPolicy: Always

So are we alone in our experience or have others hit something similar?
Thanks. G.

Rob Szumski

unread,

Jan 11, 2018, 2:35:43 PM1/11/18

to weathe...@gmail.com, CoreOS User

Thanks for the report. Have you taken a look at the metrics that etcd tracks internally? https://coreos.com/etcd/docs/latest/metrics.html

I’d be very interested to see what they are pre and post update. etcd is very disk latency sensitive, and I wonder if the syscalls are being impacted by the patch enough that etcd needs to be re-tuned. Please let us know what you find.

- Rob

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benjamin Gilbert

unread,

Jan 11, 2018, 4:55:26 PM1/11/18

to CoreOS User, weathe...@gmail.com, Rob Szumski

On Thu, Jan 11, 2018 at 11:35 AM, Rob Szumski <rob.s...@coreos.com> wrote:

I’d be very interested to see what they are pre and post update. etcd is very disk latency sensitive, and I wonder if the syscalls are being impacted by the patch enough that etcd needs to be re-tuned.

There were also a number of other changes between 1562.1.0 and 1649.0.0, notably a bump from kernel 4.13.5 to 4.14.11. Could you try with Container Linux 1632.0.0? That would rule out the 4.14 kernel as a cause. You could also try disabling the Meltdown fix in 1649.0.0 by adding

set linux_append="$linux_append pti=off"

to /usr/share/oem/grub.cfg and rebooting.

--Benjamin Gilbert

weathe...@gmail.com

unread,

Jan 12, 2018, 10:37:05 AM1/12/18

to CoreOS User

Benjamin.
Good points. We will look into capturing metrics' per Rob's suggestion and then move between the different versions of the OS in case as you point out we are hitting something in a previous kernel patch.

Appreciate the input from both of you!

weathe...@gmail.com

unread,

Jan 17, 2018, 6:12:02 PM1/17/18

to CoreOS User

Rob.
So we have finally reverted back to the last known good OS across the board and are collecting etcd metrics into our metrics sink (datadog). I was thinking of running the etcd benchmark tool (https://github.com/coreos/etcd/tree/master/tools/benchmark) to apply some pressure to the cluster before doing the next round of OS updates. I was not able to get the tool to compile in the master branch (https://gist.github.com/someword/a81bd81aab37c52bdf789d9182771ea7). Do you guys build/publish the benchmark tool anywhere? I didn't see it in the quay etcd image or in the tar.gz release for etcd. Do you think running the benchmark would be useful instead of just relying on the load of the cluster operations?

Thanks.

Xiang Li

unread,

Jan 17, 2018, 6:39:46 PM1/17/18

to CoreOS User

On Wednesday, January 17, 2018 at 3:12:02 PM UTC-8, weathe...@gmail.com wrote:

Rob.
So we have finally reverted back to the last known good OS across the board and are collecting etcd metrics into our metrics sink (datadog). I was thinking of running the etcd benchmark tool (https://github.com/coreos/etcd/tree/master/tools/benchmark) to apply some pressure to the cluster before doing the next round of OS updates. I was not able to get the tool to compile in the master branch (https://gist.github.com/someword/a81bd81aab37c52bdf789d9182771ea7).

I can compile it without a problem on current master or the previous release branch. I am not sure what problem you hit.

Do you guys build/publish the benchmark tool anywhere?

No, we do not publish the complied version of benchmark tool.

Rob Szumski

unread,

Jan 17, 2018, 6:43:39 PM1/17/18

to Xiang Li, CoreOS User

Once you get it running, getting a proper baseline during your normal load and under more load would be a great starting point. Let us know what you find and we can keep digging.

- Rob

weathe...@gmail.com

unread,

Jan 18, 2018, 5:09:24 PM1/18/18

to CoreOS User

On Wednesday, January 17, 2018 at 3:39:46 PM UTC-8, Xiang Li wrote:

I can compile it without a problem on current master or the previous release branch. I am not sure what problem you hit.

I set a new GOPATH and checked out all of the dependencies into the new GOPATH and tried to compile benchmark on the v3.2.9 tag but it fails. Here is a gist that shows the problem i'm hitting with compiling the benchmark tool https://gist.github.com/someword/f2f9eb04ed7886e5675ce5a197e0c8dc