Error recovering a failed machine in a etcd cluster : raft: tocommit(113) is out of range [lastInde

988 views
Skip to first unread message

Vicky Singh

unread,
May 10, 2018, 5:46:53 PM5/10/18
to CoreOS User
I have a 5 machine etcd cluster 3.3.5

I lost the the disk of one of the machines. I executed the following commands to get the snanpshot from another machine  and restore the state. However during the startup I get the following error. This can be simulated by deleting the infra1.etcd directory.

2018-05-10 21:37:36.372723 I | rafthttp: established a TCP streaming connection with peer f94df198dfd1fae9 (stream Message writer)
2018-05-10 21:37:36.375263 I | rafthttp: peer e25bd2572f40d862 became active
2018-05-10 21:37:36.375272 I | rafthttp: established a TCP streaming connection with peer e25bd2572f40d862 (stream Message writer)
2018-05-10 21:37:36.376419 I | rafthttp: established a TCP streaming connection with peer f94df198dfd1fae9 (stream MsgApp v2 writer)
2018-05-10 21:37:36.376708 I | rafthttp: established a TCP streaming connection with peer e25bd2572f40d862 (stream MsgApp v2 writer)
2018-05-10 21:37:36.379499 I | rafthttp: peer 5085423e29a03b70 became active
2018-05-10 21:37:36.379508 I | rafthttp: established a TCP streaming connection with peer 5085423e29a03b70 (stream MsgApp v2 writer)
2018-05-10 21:37:36.379554 I | rafthttp: established a TCP streaming connection with peer 5085423e29a03b70 (stream Message writer)
2018-05-10 21:37:36.384789 I | rafthttp: established a TCP streaming connection with peer 5085423e29a03b70 (stream MsgApp v2 reader)
2018-05-10 21:37:36.384837 I | rafthttp: established a TCP streaming connection with peer 5085423e29a03b70 (stream Message reader)
2018-05-10 21:37:36.401149 I | etcdserver: ef4b0eeaaa716a7 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 3 active peer(s)
2018-05-10 21:37:36.409883 I | raft: ef4b0eeaaa716a7 [term: 1] received a MsgHeartbeat message with higher term from e25bd2572f40d862 [term: 23]
2018-05-10 21:37:36.409895 I | raft: ef4b0eeaaa716a7 became follower at term 23
2018-05-10 21:37:36.409905 C | raft: tocommit(113) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(113) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?

goroutine 103 [running]:
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x16d
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc420226100, 0x8, 0xef4b0eeaaa716a7, 0xe25bd2572f40d862, 0x17, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1194 +0x54
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc420226100, 0x8, 0xef4b0eeaaa716a7, 0xe25bd2572f40d862, 0x17, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1140 +0x439
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc420226100, 0x8, 0xef4b0eeaaa716a7, 0xe25bd2572f40d862, 0x17, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:868 +0x1465
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:323 +0x113e
/tmp/etcd-release-3.3.5/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:223 +0x321

1. command to get snapshot : ETCDCTL_API=3 ./etcdctl --endpoints 10.0.0.1:2379 snapshot save snapshot.db

2. command to restore snapshot :  sudo ETCDCTL_API=3 ./etcdctl  snapshot --data-dir ./infra1.etcd restore ~/snapshot.db  --name infra1 --initial-cluster infra0=http://10.0.0.1:2380,infra1=http://10.0.0.2:2380,infra2=http://10.0.0.3:2380,infra3=http://10.0.0.4:2380,infra4=http://10.0.0.5:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls http://10.0.0.2:2380

3. Command to start the service :  etcd --name infra1 --initial-advertise-peer-urls http://10.0.0.2:2380 \
  --listen-peer-urls http://10.0.0.2:2380 \
  --advertise-client-urls http://10.0.0.2:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster-state existing

4. Commands to start the service 

etcd --name infra0 --initial-advertise-peer-urls http://10.0.0.1:2380 \
  --listen-peer-urls http://10.0.0.1:2380 \
  --advertise-client-urls http://10.0.0.1:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster-state new

etcd --name infra1 --initial-advertise-peer-urls http://10.0.0.2:2380 \
  --listen-peer-urls http://10.0.0.2:2380 \
  --advertise-client-urls http://10.0.0.2:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster-state new

./etcd --name infra2 --initial-advertise-peer-urls http://10.0.0.3:2380 \
  --listen-peer-urls http://10.0.0.3:2380 \
  --advertise-client-urls http://10.0.0.3:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster-state new

./etcd --name infra3 --initial-advertise-peer-urls http://10.0.0.4:2380 \
  --listen-peer-urls http://10.0.0.4:2380 \
  --advertise-client-urls http://10.0.0.4:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster-state new

./etcd --name infra4 --initial-advertise-peer-urls http://10.0.0.5:2380 \
  --listen-peer-urls http://10.0.0.5:2380 \
  --advertise-client-urls http://10.0.0.5:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster-state new


Vicky
Reply all
Reply to author
Forward
0 new messages