Etcd heartbeats exceeding default timeout

weathe...@gmail.com

unread,

Sep 20, 2017, 6:45:02 PM9/20/17

to CoreOS User

Hello.

We are running CoreOS version 1492.1.0

We run etcd in a static pod via the kubelet. The instances are AWS m4.xlarge in some clusters and m4.large in other clusters. The backing store is a dedicated ebs volume (gp2) which is 20GB so it get's the default 100iops.

These etcd instances only have kubernetes masters and flannel connecting to them.

All of our keys are still in v2 format.

We have 280 keys in etcd

etcdctl ls --recursive / |wc -l
280

The disk usage for wal and snap directories

sudo ls -alh /etcd/member/wal
total 428M
drwx------. 2 root root 4.0K Sep 20 20:21 .
drwx------. 4 root root 4.0K Sep 20 20:21 ..
-rw-------. 1 root root 62M Sep 20 20:21 0.tmp
-rw-------. 1 root root 62M Sep 20 09:30 0000000000000035-00000000002dcb0f.wal
-rw-------. 1 root root 62M Sep 20 12:57 0000000000000036-00000000002ea962.wal
-rw-------. 1 root root 62M Sep 20 16:24 0000000000000037-00000000002f87ba.wal
-rw-------. 1 root root 62M Sep 20 19:51 0000000000000038-0000000000306609.wal
-rw-------. 1 root root 62M Sep 20 22:40 0000000000000039-00000000003143bf.wal
-rw-------. 1 root root 62M Sep 20 19:51 1.tmp
core@ip-10-21-1-17 ~ $ sudo ls -alh /etcd/member/snap
total 24M
drwx------. 2 root root 4.0K Sep 20 22:23 .
drwx------. 4 root root 4.0K Sep 20 20:21 ..
-rw-r--r--. 1 root root 4.7M Sep 20 19:57 0000000000000029-0000000000314a74.snap
-rw-r--r--. 1 root root 5.3M Sep 20 20:36 000000000000002b-0000000000317185.snap
-rw-r--r--. 1 root root 5.2M Sep 20 21:11 000000000000002b-0000000000319896.snap
-rw-r--r--. 1 root root 4.2M Sep 20 21:47 000000000000002b-000000000031bfa7.snap
-rw-r--r--. 1 root root 4.6M Sep 20 22:22 000000000000002b-000000000031e6b8.snap
-rw-------. 1 root root 17M Sep 20 22:22 db
core@ip-10-21-1-17 ~ $

We are in the process of migrating from 2.3.2 to 3.0.17 when we noticed these warning messages.

2017-09-20 16:16:28.101633 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 68.292879ms)
2017-09-20 16:16:28.101733 W | etcdserver: server is likely overloaded
2017-09-20 16:16:28.101744 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 68.411161ms)
2017-09-20 16:16:28.101749 W | etcdserver: server is likely overloaded
2017-09-20 16:16:28.101757 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 68.427665ms)
2017-09-20 16:16:28.101762 W | etcdserver: server is likely overloaded
2017-09-20 16:16:28.101768 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 68.437652ms)
2017-09-20 16:16:28.101773 W | etcdserver: server is likely overloaded

These messages only show up on the leader node and in each cluster we get 3 - 10 times a day where for 10 - 80 seconds those warnings are logged.

I'm curious to understand what sort of severity I should associate with these messages and how I can go about determining what the cause is (eg: cpu saturation, disk latency, volume of etcd operations, etc).

For us etcd just works. We rev the version occasionally and dump the data for backups but we have not had any problems that we are aware of. Possibly it's been so out of mind we have neglected routine care and feeding?

Any thoughts or ideas appreciated.

TIA

Alex Crawford

unread,

Sep 21, 2017, 4:22:55 PM9/21/17

to weathe...@gmail.com, CoreOS User

On 09/20, weathe...@gmail.com wrote:
> We are in the process of migrating from 2.3.2 to 3.0.17 when we noticed
> these warning messages.
>
> 2017-09-20 16:16:28.101633 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.292879ms)
> > 2017-09-20 16:16:28.101733 W | etcdserver: server is likely overloaded
> > 2017-09-20 16:16:28.101744 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.411161ms)
> > 2017-09-20 16:16:28.101749 W | etcdserver: server is likely overloaded
> > 2017-09-20 16:16:28.101757 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.427665ms)
> > 2017-09-20 16:16:28.101762 W | etcdserver: server is likely overloaded
> > 2017-09-20 16:16:28.101768 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.437652ms)
> > 2017-09-20 16:16:28.101773 W | etcdserver: server is likely overloaded
>
>
>
> These messages only show up on the leader node and in each cluster we get 3
> - 10 times a day where for 10 - 80 seconds those warnings are logged.
>
> I'm curious to understand what sort of severity I should associate with
> these messages and how I can go about determining what the cause is (eg:
> cpu saturation, disk latency, volume of etcd operations, etc).
>
> For us etcd just works. We rev the version occasionally and dump the data
> for backups but we have not had any problems that we are aware of.
> Possibly it's been so out of mind we have neglected routine care and
> feeding?

It might be sufficient to increase the heartbeat interval. The etcd docs
have a section on tuning [1] which will help determine a reasonable
value.

-Alex

[1]: https://coreos.com/etcd/docs/latest/tuning.html

signature.asc

weathe...@gmail.com

unread,

Sep 21, 2017, 5:38:23 PM9/21/17

to CoreOS User

It might be sufficient to increase the heartbeat interval. The etcd docs
have a section on tuning [1] which will help determine a reasonable
value.

Alex

Thanks for the info. I was worried tuning that parameter would just be masking a problem.