Running a 5-node etcd cluster at 2.3.7 on CoreOS 1185.3.0.
While the system appears to be functioning fine most days, my logs are filled with messages like this almost every day on every node:
Nov 14 21:34:48 master3 etcd2[1177]: failed to read f36e349c2084f22d on stream MsgApp v2 (unexpected EOF)
Nov 14 21:35:52 master3 etcd2[1177]: failed to dial f36e349c2084f22d on stream Message (dial tcp
96.118.54.205:7001: i/o timeout)
Nov 14 21:36:26 master3 etcd2[1177]: got unexpected response error (etcdserver: request timed out)
Nov 14 21:36:28 master3 etcd2[1177]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.97s]
Nov 15 06:02:25 master3 etcd2[1177]: got unexpected response error (etcdserver: request timed out) [merged 4 repeated lines in 2s]
Nov 15 06:12:45 master3 etcd2[1177]: failed to read aa8facee16728b33 on stream MsgApp v2 (read tcp 96.118.51.103:49182->
96.118.54.183:2380: i/o timeout)
Nov 16 06:37:34 master3 etcd2[1177]: etcdserver: request timed out, possibly due to previous leader failure
These machines are running in VMs in the same DC with ping times of ~ .1ms. Any clues as to what I can investigate to get to the bottom of this? All these nodes are reachable outside of the DC; my Prometheus client hasn't reported any downtime for any of the 5 nodes.