consul cluster in a weird state; can't figure out how to recover

379 views
Skip to first unread message

Jay Christopherson

unread,
May 9, 2016, 1:07:46 PM5/9/16
to Consul
I have three instances of Consul running and they've somehow ended up in a weird state after recovering.  I am not sure how to fix this and I don't want to wipe and restart if I can avoid it.  Each instance of consul thinks that another is failed - but nothing has changed network wise so I can't figure out why it's complaining:


bash-4.3# consul members
Node           Address              Status  Type    Build  Protocol  DC
consul
-prod    10.123.100.225:8301  alive   server  0.5.2  2         prod
consul
-prod-2  10.123.100.200:8301  alive   server  0.5.2  2         prod
consul
-prod-1  10.123.100.183:8301  failed  server  0.5.2  2         prod



Node           Address              Status  Type    Build  Protocol  DC
consul
-prod-2  10.123.100.200:8301  alive   server  0.5.2  2         prod
consul
-prod    10.123.100.225:8301  alive   server  0.5.2  2         prod
consul
-prod-1  10.123.100.183:8301  failed  server  0.5.2  2         prod



bash-4.3# consul members
Node           Address              Status  Type    Build  Protocol  DC
consul
-prod    10.123.100.225:8301  failed  server  0.5.2  2         prod
consul
-prod-1  10.123.100.183:8301  alive   server  0.5.2  2         prod
consul
-prod-2  10.123.100.200:8301  alive   server  0.5.2  2         prod


On 183, I see this in the log:


   
2016/05/09 17:05:03 [INFO] serf: EventMemberJoin: consul-prod 10.123.100.225
   
2016/05/09 17:05:03 [INFO] consul: adding server consul-prod (Addr: 10.123.100.225:8300) (DC: prod)
   
2016/05/09 17:05:05 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:05 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:05 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:05 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:05 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:05 [INFO] memberlist: Suspect consul-prod has failed, no acks received
   
2016/05/09 17:05:07 [INFO] memberlist: Suspect consul-prod has failed, no acks received
   
2016/05/09 17:05:09 [INFO] memberlist: Suspect consul-prod has failed, no acks received
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
   
2016/05/09 17:05:10 [INFO] memberlist: Marking consul-prod as failed, suspect timeout reached
   
2016/05/09 17:05:10 [INFO] serf: EventMemberFailed: consul-prod 10.123.100.225
   
2016/05/09 17:05:10 [INFO] consul: removing server consul-prod (Addr: 10.123.100.225:8300) (DC: prod)
   
2016/05/09 17:05:11 [ERR] http: Request /v1/kv/docker/network/v1.0/endpoint_count/4b4a7ef9ca7a1f04439f18f0ad41792669f05af3f0658b8ffeb8c0a4ba2a87ec/?index=1063117&wait=15000ms, error: No cluster leader
   
2016/05/09 17:05:11 [ERR] http: Request /v1/kv/docker/network/v1.0/endpoint_count/4b4a7ef9ca7a1f04439f18f0ad41792669f05af3f0658b8ffeb8c0a4ba2a87ec/?index=1063117&wait=15000ms, error: No cluster leader
   
2016/05/09 17:05:11 [ERR] http: Request /v1/kv/docker/nodes?index=1482195&recurse=&wait=15000ms, error: No cluster leader
   
2016/05/09 17:05:11 [ERR] http: Request /v1/kv/docker/nodes?index=1482195&recurse=&wait=15000ms, error: No cluster leader
   
2016/05/09 17:05:11 [ERR] http: Request /v1/kv/docker/network/v1.0/endpoint_count/4b4a7ef9ca7a1f04439f18f0ad41792669f05af3f0658b8ffeb8c0a4ba2a87ec/?index=1063117&wait=15000ms, error: No cluster leader
   
2016/05/09 17:05:11 [INFO] serf: EventMemberJoin: consul-prod 10.123.100.225
   
2016/05/09 17:05:11 [INFO] consul: adding server consul-prod (Addr: 10.123.100.225:8300) (DC: prod)
   
2016/05/09 17:05:11 [INFO] memberlist: Suspect consul-prod has failed, no acks received
   
2016/05/09 17:05:14 [INFO] memberlist: Suspect consul-prod has failed, no acks received
   
2016/05/09 17:05:15 [INFO] memberlist: Suspect consul-prod has failed, no acks received


On 225, I see this:
 
  2016/05/09 17:04:56 [ERR] raft: Failed to heartbeat to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:04:56 [ERR] raft: Failed to AppendEntries to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:04:56 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:04:58 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:00 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:02 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:04 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:07 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:09 [ERR] raft: Failed to AppendEntries to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:09 [ERR] raft: Failed to heartbeat to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:09 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:11 [WARN] memberlist: Refuting a suspect message (from: consul-prod)
   
2016/05/09 17:05:11 [INFO] serf: EventMemberJoin: consul-prod-1 10.123.100.183
   
2016/05/09 17:05:11 [INFO] consul: adding server consul-prod-1 (Addr: 10.123.100.183:8300) (DC: prod)
   
2016/05/09 17:05:11 [INFO] consul: member 'consul-prod-1' joined, marking health alive
   
2016/05/09 17:05:18 [WARN] memberlist: Refuting a dead message (from: consul-prod-2)
   
2016/05/09 17:05:19 [INFO] memberlist: Marking consul-prod-1 as failed, suspect timeout reached
   
2016/05/09 17:05:19 [INFO] serf: EventMemberFailed: consul-prod-1 10.123.100.183
   
2016/05/09 17:05:19 [INFO] consul: removing server consul-prod-1 (Addr: 10.123.100.183:8300) (DC: prod)
   
2016/05/09 17:05:19 [INFO] consul: member 'consul-prod-1' failed, marking health critical
   
2016/05/09 17:05:22 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:22 [ERR] raft: Failed to heartbeat to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:22 [ERR] raft: Failed to AppendEntries to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:24 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:26 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:30 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:33 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:35 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:36 [ERR] raft: Failed to heartbeat to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:36 [ERR] raft: Failed to AppendEntries to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:38 [WARN] memberlist: Refuting a suspect message (from: consul-prod-2)
   
2016/05/09 17:05:41 [WARN] memberlist: Refuting a suspect message (from: consul-prod)
   
2016/05/09 17:05:41 [INFO] serf: EventMemberJoin: consul-prod-1 10.123.100.183
   
2016/05/09 17:05:41 [INFO] consul: adding server consul-prod-1 (Addr: 10.123.100.183:8300) (DC: prod)
   
2016/05/09 17:05:41 [INFO] consul: member 'consul-prod-1' joined, marking health alive
   
2016/05/09 17:05:44 [WARN] memberlist: Refuting a suspect message (from: consul-prod-1)
   
2016/05/09 17:05:48 [INFO] serf: EventMemberFailed: consul-prod-1 10.123.100.183
   
2016/05/09 17:05:48 [INFO] consul: removing server consul-prod-1 (Addr: 10.123.100.183:8300) (DC: prod)
   
2016/05/09 17:05:48 [INFO] consul: member 'consul-prod-1' failed, marking health critical
   
2016/05/09 17:05:49 [ERR] raft: Failed to heartbeat to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host
   
2016/05/09 17:05:49 [ERR] raft: Failed to AppendEntries to 10.123.100.145:8300: dial tcp 10.123.100.145:8300: no route to host


10.123.100.145 is an old node that I don't know how to get rid of (it doesn't exist anymore).  It doesn't show up in the UI and I don't see any cli commands for deleting non-existent nodes.  I've checked all the networking and I can telnet to port 8300 on each host, from each other host, without issue.

So, not certain where to go from here and any assistance would be appreciated.

Jay Christopherson

unread,
May 9, 2016, 1:29:46 PM5/9/16
to Consul
I ended up stopping the cluster, removing 10.123.100.183 from peers.json, and then restarting the cluster and bringing in a different host as the third member.  That seemed to work - but I would prefer to bring back 10.123.100.183 as a member.  I'm a little leery of trying it though since I don't know why it had issues being a member in the first place.
Reply all
Reply to author
Forward
0 new messages