On 09/20,
weathe...@gmail.com wrote:
> We are in the process of migrating from 2.3.2 to 3.0.17 when we noticed
> these warning messages.
>
> 2017-09-20 16:16:28.101633 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.292879ms)
> > 2017-09-20 16:16:28.101733 W | etcdserver: server is likely overloaded
> > 2017-09-20 16:16:28.101744 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.411161ms)
> > 2017-09-20 16:16:28.101749 W | etcdserver: server is likely overloaded
> > 2017-09-20 16:16:28.101757 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.427665ms)
> > 2017-09-20 16:16:28.101762 W | etcdserver: server is likely overloaded
> > 2017-09-20 16:16:28.101768 W | etcdserver: failed to send out heartbeat on
> > time (exceeded the 100ms timeout for 68.437652ms)
> > 2017-09-20 16:16:28.101773 W | etcdserver: server is likely overloaded
>
>
>
> These messages only show up on the leader node and in each cluster we get 3
> - 10 times a day where for 10 - 80 seconds those warnings are logged.
>
> I'm curious to understand what sort of severity I should associate with
> these messages and how I can go about determining what the cause is (eg:
> cpu saturation, disk latency, volume of etcd operations, etc).
>
> For us etcd just works. We rev the version occasionally and dump the data
> for backups but we have not had any problems that we are aware of.
> Possibly it's been so out of mind we have neglected routine care and
> feeding?
It might be sufficient to increase the heartbeat interval. The etcd docs