Debugging Etcd Failures

1,001 views
Skip to first unread message

Matt Hughes

unread,
Dec 13, 2016, 10:23:03 PM12/13/16
to CoreOS User
Running a 5-node etcd cluster at 2.3.7 on CoreOS 1185.3.0.

While the system appears to be functioning fine most days, my logs are filled with messages like this almost every day on every node:

Nov 14 21:34:48 master3 etcd2[1177]: failed to read f36e349c2084f22d on stream MsgApp v2 (unexpected EOF)
Nov 14 21:35:52 master3 etcd2[1177]: failed to dial f36e349c2084f22d on stream Message (dial tcp 96.118.54.205:7001: i/o timeout)
Nov 14 21:36:26 master3 etcd2[1177]: got unexpected response error (etcdserver: request timed out)
Nov 14 21:36:28 master3 etcd2[1177]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.97s]
Nov 15 06:02:25 master3 etcd2[1177]: got unexpected response error (etcdserver: request timed out) [merged 4 repeated lines in 2s]
Nov 15 06:12:45 master3 etcd2[1177]: failed to read aa8facee16728b33 on stream MsgApp v2 (read tcp 96.118.51.103:49182->96.118.54.183:2380: i/o timeout)
Nov 16 06:37:34 master3 etcd2[1177]: etcdserver: request timed out, possibly due to previous leader failure

These machines are running in VMs in the same DC with ping times of ~ .1ms.   Any clues as to what I can investigate to get to the bottom of this?  All these nodes are reachable outside of the DC; my Prometheus client hasn't reported any downtime for any of the 5 nodes.

Xiang Li

unread,
Dec 13, 2016, 11:49:01 PM12/13/16
to CoreOS User
This seems like a network issue to me. It is unlikely an etcd related issue, but it is possible. My suggestion is to monitor the etcd TCP connections between the peers (there should be only one or two connections between each peer. so it wont be hard to monitor) closely for a couple of days, and get some additional information on why i/o time out happens.

It would be great if you can create github issues (https://github.com/coreos/etcd/issues/new) for tracking down etcd related bugs in the future. Thanks.
Reply all
Reply to author
Forward
0 new messages