etcdserver: request timed out and server overload messages

8,050 views
Skip to first unread message

Yaroslav Molochko

unread,
Apr 8, 2016, 10:42:08 AM4/8/16
to etcd-dev
I have etcd 2.3.1, cluster of 3 servers with enough RAM and CPU. Etcd is used by kubernetes 1.2.1 (at the moment). I've discovered following strange messages:

Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e is starting a new election at term 21943
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e became candidate at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e received vote from c095da2c1856e37e at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [logterm: 21943, index: 9486432] sent vote request to 499cd6aebfad0c5d at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [logterm: 21943, index: 9486432] sent vote request to 65b064a223e15a04 at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: raft.node: c095da2c1856e37e lost leader 65b064a223e15a04 at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e received vote from 499cd6aebfad0c5d at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [quorum:2] has received 2 votes and 0 vote rejections
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e became leader at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: raft.node: c095da2c1856e37e elected leader c095da2c1856e37e at term 21944
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21944] ignored a MsgApp message with lower term from 65b064a223e15a04 [term: 21943]
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21944] ignored a MsgApp message with lower term from 65b064a223e15a04 [term: 21943]
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21944] ignored a MsgApp message with lower term from 65b064a223e15a04 [term: 21943]
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21944] ignored a MsgApp message with lower term from 65b064a223e15a04 [term: 21943]
Apr 08 10:53:10 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21944] ignored a MsgApp message with lower term from 65b064a223e15a04 [term: 21943]
Apr 08 10:53:11 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 115.39771ms)
Apr 08 10:53:11 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:11 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 115.48784ms)
Apr 08 10:53:11 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 251.318547ms)
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 251.393104ms)
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 82.422922ms)
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 82.483315ms)
Apr 08 10:53:12 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:15 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 1.631112849s)
Apr 08 10:53:15 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:15 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 1.631203813s)
Apr 08 10:53:15 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:16 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 192.682138ms)
Apr 08 10:53:16 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:16 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 192.754833ms)
Apr 08 10:53:16 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 391.778675ms)
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 391.848191ms)
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 12.510431ms)
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 12.580452ms)
Apr 08 10:53:21 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:54:30 ip-172-16-236-25 etcd[28754]: start to snapshot (applied: 9486922, lastsnap: 9481921)
Apr 08 10:54:30 ip-172-16-236-25 etcd[28754]: saved snapshot at index 9486922
Apr 08 10:54:30 ip-172-16-236-25 etcd[28754]: compacted raft log at 9481922
Apr 08 10:54:52 ip-172-16-236-25 etcd[28754]: purged file /srv/data/etcd/etcd25/member/snap/00000000000055b6-000000000090609d.snap successfully
Apr 08 10:58:50 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 836.788045ms)
Apr 08 10:58:50 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:58:50 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 836.855399ms)
Apr 08 10:58:50 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:58:51 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 275.352301ms)
Apr 08 10:58:51 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:58:51 ip-172-16-236-25 etcd[28754]: failed to send out heartbeat on time (deadline exceeded for 275.401125ms)
Apr 08 10:58:51 ip-172-16-236-25 etcd[28754]: server is likely overloaded
Apr 08 10:58:53 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21944] received a MsgVote message with higher term from 65b064a223e15a04 [term: 21945]
Apr 08 10:58:53 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e became follower at term 21945
Apr 08 10:58:53 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [logterm: 21944, index: 9488545, vote: 0] rejected vote from 65b064a223e15a04 [logterm: 21944, index: 9488539] at term
Apr 08 10:58:53 ip-172-16-236-25 etcd[28754]: raft.node: c095da2c1856e37e lost leader c095da2c1856e37e at term 21945
Apr 08 10:58:53 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [logterm: 21944, index: 9488545, vote: 0] rejected vote from 499cd6aebfad0c5d [logterm: 21944, index: 9488539] at term
Apr 08 10:58:54 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [term: 21945] received a MsgVote message with higher term from 499cd6aebfad0c5d [term: 21946]
Apr 08 10:58:54 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e became follower at term 21946
Apr 08 10:58:54 ip-172-16-236-25 etcd[28754]: c095da2c1856e37e [logterm: 21944, index: 9488545, vote: 0] rejected vote from 499cd6aebfad0c5d [logterm: 21944, index: 9488539] at term
Apr 08 10:58:54 ip-172-16-236-25 etcd[28754]: found conflict at index 9488540 [existing term: 21944, conflicting term: 21946]
Apr 08 10:58:54 ip-172-16-236-25 etcd[28754]: replace the unstable entries from index 9488540
Apr 08 10:58:54 ip-172-16-236-25 etcd[28754]: raft.node: c095da2c1856e37e elected leader 499cd6aebfad0c5d at term 21946
Apr 08 10:58:58 ip-172-16-236-25 etcd[28754]: got unexpected response error (etcdserver: request timed out)
Apr 08 10:59:00 ip-172-16-236-25 etcd[28754]: etcdserver: request timed out, possibly due to previous leader failure
Apr 08 10:59:00 ip-172-16-236-25 etcd[28754]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.96s]
Apr 08 11:06:30 ip-172-16-236-25 etcd[28754]: segmented wal file /srv/data/etcd/etcd25/member/wal/0000000000000121-000000000090d40c.wal is created
Apr 08 11:06:53 ip-172-16-236-25 etcd[28754]: purged file /srv/data/etcd/etcd25/member/wal/000000000000011c-00000000008e7a26.wal successfully
Apr 08 11:07:38 ip-172-16-236-25 etcd[28754]: start to snapshot (applied: 9491923, lastsnap: 9486922)
Apr 08 11:07:38 ip-172-16-236-25 etcd[28754]: saved snapshot at index 9491923
Apr 08 11:07:38 ip-172-16-236-25 etcd[28754]: compacted raft log at 9486923
Apr 08 11:07:52 ip-172-16-236-25 etcd[28754]: purged file /srv/data/etcd/etcd25/member/snap/00000000000055b6-0000000000907426.snap successfully

I have similar log messages all day long, it is time to move this config to production but this timeouts bothers me. What can I tune to get this working more stable? Or this is not a problem at all and I can simple forget about this? My etcd nodes running under systemd with following config:
/usr/local/bin/etcd --addr 172.16.236.25:4001 --bind-addr 172.16.236.25:4001 --initial-cluster-state new --initial-cluster-token mycluster --listen-peer-urls http://172.16.236.25:7001 --initial-advertise-peer-urls http://172.16.236.25:7001 --snapshot-count=5000 --discovery-srv k8s.otlabs.fr

What tuning do you suggest to fix the issue?




memory.png
cpu.png

Brandon Philips

unread,
Apr 8, 2016, 1:17:16 PM4/8/16
to Yaroslav Molochko, etcd-dev
Is the CPU load Y-axis out of 1.0? AKA is 1.0 100%?

What sort of network are these machines on? What is the latency like?

Xiang Li

unread,
Apr 8, 2016, 1:17:56 PM4/8/16
to Yaroslav Molochko, etcd-dev
It is likely caused by CPU. Can you monitor the etcd process CPU usage? The raft pipeline is serialized, so it cannot utilize all your cores even if you have more.  

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yaroslav Molochko

unread,
Apr 8, 2016, 2:14:30 PM4/8/16
to etcd-dev, ono...@gmail.com
Yes, it was > 100%, but we have 4 cores on that server, so it should be nothing. And most importantly - we did not have warnings about overloaded server during that time.
Nodes connected with OpenvSwitch, but I doubt that the network is to blame, as issues occur in different time on different nodes. So it is something local to the node.
...

Yaroslav Molochko

unread,
Apr 8, 2016, 2:15:40 PM4/8/16
to etcd-dev, ono...@gmail.com
Thanks for the hint! I've put etcd process for monitoring with pidstat. Hopefully will generate some graphs for you by tomorrow.
...

Xiang Li

unread,
Apr 8, 2016, 2:18:20 PM4/8/16
to Yaroslav Molochko, etcd-dev
One more thing, can you increase the resolution of the data? Now, it is a 5 minute avg which loses a lot of details.

--

Yaroslav Molochko

unread,
Apr 8, 2016, 2:26:59 PM4/8/16
to Xiang Li, etcd-dev
Sure, I'm gathering data every second.

Yaroslav Molochko

unread,
Apr 8, 2016, 4:26:29 PM4/8/16
to Xiang Li, etcd-dev
Actually I've been doing some preparation, and was checking CPU on one
screen while tail -f for the logs on other.. I've got:

Apr 08 20:19:41 ip-172-16-236-25 etcd[28754]: failed to send out
heartbeat on time (deadline exceeded for 22.148662ms)
Apr 08 20:19:41 ip-172-16-236-25 etcd[28754]: server is likely overloaded

But etcd was around 4% with average 2%, generally maximum was 10%. I
did not do plotting yet, but it seems, CPU load is not really matter
in this case.
What else can I do for debugging?

Xiang Li

unread,
Apr 8, 2016, 4:28:42 PM4/8/16
to Yaroslav Molochko, etcd-dev
What is your environment looks like?

Yaroslav Molochko

unread,
Apr 8, 2016, 4:36:52 PM4/8/16
to Xiang Li, etcd-dev
I have:
Ubuntu Xenial with 4.4.0-15-generic kernel
Nodes with etcd, has kubernetes master processes as well, such as
kube-apiserver, kube-proxy, kubelet, kube-scheduler, and
kube-controller-manager.
These etcd servers also has salt-master and salt-minion on it.
We have etcd version 2.3.1 running on 3 nodes, with autodiscovery
based on DNS SRV records. Snapshot is set to 5000. Ping between nodes
is
rtt min/avg/max/mdev = 0.102/0.306/0.734/0.229 ms

systemd service for etcd looks like:
[Unit]
Description=etcd key-value store
Documentation=https://github.com/coreos/etcd

[Service]
User=etcd
Type=notify
Environment=ETCD_DATA_DIR=/srv/data/etcd/etcd25
Environment=ETCD_NAME=etcd25
ExecStartPre=/bin/mkdir -p $ETCD_DATA_DIR
ExecStartPre=/bin/chown etcd -R $ETCD_DATA_DIR
ExecStart=/usr/local/bin/etcd \
--addr 172.16.236.25:4001 \
--bind-addr 172.16.236.25:4001 \
--initial-cluster-state new \
--initial-cluster-token mycluster \
--listen-peer-urls http://172.16.236.25:7001 \
--initial-advertise-peer-urls http://172.16.236.25:7001 \
--snapshot-count=5000 \
--discovery-srv 'k8s.otlabs.fr'
Restart=always
RestartSec=10s
LimitNOFILE=40000

[Install]
WantedBy=multi-user.target

What else you might be interested in?

Xiang Li

unread,
Apr 8, 2016, 4:59:14 PM4/8/16
to Yaroslav Molochko, etcd-dev
Can you get second level CPU monitoring both for the entire machine and etcd process? And also provide the log of etcd for around 10 minutes window when the overload happened? Thanks!

Yaroslav Molochko

unread,
Apr 9, 2016, 5:19:57 PM4/9/16
to Xiang Li, etcd-dev
So, I've visualized it (too many processes, it was impossible to see
anything without possibility to switch off some lines by demand), and
it seems problems occur when salt-minion is running the same time etcd
is trying to do something, and they are on the same core. I could not
see anything in etcd log at all, but attaching it as well, as you
requested.

Thanks for the hints, will try to put salt-minion and etcd on
different dedicated cores, and check what will be the outcome.
etcd_server_overloaded.txt
23_48.html
23_40.html
22_35.html

Xiang Li

unread,
Apr 9, 2016, 7:04:43 PM4/9/16
to Yaroslav Molochko, etcd-dev
Thanks for the data you provided. Let's know if the problem still exists after you separate salt-minion to another core.

Brandon Philips

unread,
Apr 10, 2016, 9:44:49 PM4/10/16
to Xiang Li, Yaroslav Molochko, etcd-dev
Interesting. It is probably worth adding docs on this if this ends up being the case.

Yaroslav Molochko

unread,
Apr 11, 2016, 1:40:32 PM4/11/16
to Xiang Li, etcd-dev
apparently my wild guess about etcd overlapping with salt-minion was
incorrect, it did not help :(
Reply all
Reply to author
Forward
0 new messages