Validating recovery for Consul Replicate run by consul lock in a docker container

475 views
Skip to first unread message

Joe N.

unread,
Apr 13, 2017, 3:18:35 PM4/13/17
to Consul
Hi,

I'm quite new to Consul and Docker (~3 months experience), please read the following with that in mind.

I'm trying to determine what's the best way to test recovery for Consul Replicate if the Consul service that has the lock goes down. 

This is how I've set up my local test environment:
  • Two clusters (DC1 and DC2) with three nodes (VMs) each running a Consul Docker container
  • DC2 runs consul replicate and points to DC1 using “consul lock” to ensure only one instance of replication is running in DC2 cluster. See startup command below:
/bin/consul lock -verbose locks/consul-replicate /bin/consul-replicate -config /consul/config/consul-replicate-config.hcl

My first test was to "kill -9" the Consul agent on DC2 N1. This test passed, as N1 released the lock and another node in DC2 took over the replicate job.

Now this is where I need your help ...

My second test was to kill the N1 Docker container (using "docker stop" command) however in this case I got the following error on N1

Received interrupt, cleaning up...
2017/04/12 21:19:29.548594 [INFO] (runner) stopping
2017/04/12 21:19:29.548604 [DEBUG] (watcher) stopping all views
Shutdown triggered, killing child
Terminating child pid 61
Error running handler: exit status 12
exit status 12
Child terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (leadership lost while committing log)

This incident causes the "consul lock" processes on the other two nodes to exit with error message:

Attempting lock acquisition
Lock acquisition failed: failed to read lock: Unexpected response code: 500

Why does this happen? How can we protect against this behavior of losing consul lock and eventually consul replicate?

Any thoughts would be much appreciated.

Joe

Ray

unread,
May 3, 2017, 4:44:45 PM5/3/17
to Consul
We also experience similar situation. Does anyone have any idea?
Thanks,

James Phillips

unread,
May 9, 2017, 7:12:06 PM5/9/17
to consu...@googlegroups.com
> Lock release failed: failed to release lock: Unexpected response code: 500 (leadership lost while committing log)

Can you tell from the logs what's going on with the Consul server when
this happens? Stopping consul-replicate shouldn't be causing a loss of
leadership in any way.
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/de1b46fd-93f1-4d44-89cc-073a1f274bb2%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.
Message has been deleted
Message has been deleted

Joe N.

unread,
May 12, 2017, 7:23:25 AM5/12/17
to Consul
Hi James, thanks for your reply. Just to clarify, we are not having an issue with loss of Consul leadership.

A few notes: 
  1. The issue described in the original post only happens when consul-replicate is also running the elected Consul leader. In other words, when consul-replicate is running on a node that is NOT the Consul leader, the consul-replicate job DOES transfer to another node when the consul-replicated node container is stopped.
  2. We are using Consul 0.7.5
  3. The times on the three nodes are off from each other by about 6-10 seconds.
A bit more info that may help clarify things:

DC2 Cluster Info

Three Nodes:

  • dc2consul1 (192.168.99.103)
  • dc2consul2 (192.168.99.104)
  • dc2consul3 (192.168.99.105)

DC2 Cluster Initial State

  • Running Nodes: dc2consul1 (192.168.99.103), dc2consul2 (192.168.99.104), dc2consul3 (192.168.99.105)
  • Elected Leader: dc2consul1
  • Nodes running consul agent: dc2consul1 (192.168.99.103), dc2consul2 (192.168.99.104), dc2consul3 (192.168.99.105)
  • Nodes running consul lock: dc2consul1 (192.168.99.103), dc2consul2 (192.168.99.104), dc2consul3 (192.168.99.105)
  • Node running consul-replicate: dc2consul1 (192.168.99.103)

Action

  • Kill the Docker container on dc2consul1 (192.168.99.103) with command "docker stop consul"

DC2 Cluster Final State

  • Running Nodes: dc2consul2 (192.168.99.104), dc2consul3 (192.168.99.105)
  • Elected Leader: dc2consul2 (192.168.99.104)
  • Nodes running consul agent: dc2consul2 (192.168.99.104), dc2consul3 (192.168.99.105)
  • Nodes running consul lock: none
  • Node running consul-replicate: none

Logs After Docker Stop

dc2consul1 (192.168.99.103)

consul.log

==> Caught signal: terminated
2017/05/11 18:20:12 [INFO] agent: requesting shutdown
2017/05/11 18:20:12 [INFO] consul: shutting down server
2017/05/11 18:20:12 [WARN] serf: Shutdown without a Leave
2017/05/11 18:20:12 [ERR] dns: error starting tcp server: accept tcp [::]:8600: use of closed network connection
2017/05/11 18:20:12 [WARN] serf: Shutdown without a Leave
2017/05/11 18:20:12 [ERR] consul.kvs: Apply failed: leadership lost while committing log
2017/05/11 18:20:12 [ERR] http: Request PUT /v1/kv/locks/consul-replicate/.lock?flags=3304740253564472344&release=d81b3ef2-a583-2c58-933f-bd5399bfb831, error: leadership lost while committing log from=127.0.0.1:40608
2017/05/11 18:20:12 [INFO] raft: aborting pipeline replication to peer {Voter 192.168.99.104:8300 192.168.99.104:8300}
2017/05/11 18:20:12 [INFO] raft: aborting pipeline replication to peer {Voter 192.168.99.105:8300 192.168.99.105:8300}
2017/05/11 18:20:12 [INFO] agent: shutdown complete

consul-replicate.log

2017/05/11 18:20:12.893512 [INFO] (runner) stopping
2017/05/11 18:20:12.893533 [DEBUG] (watcher) stopping all views
Shutdown triggered, killing child
Terminating child pid 43
Error running handler: exit status 12
exit status 12
Child terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (leadership lost while committing log)

dc2consul2 (192.168.99.104)

consul.log

2017/05/11 18:19:58 [ERR] http: Request GET /v1/kv/locks/consul-replicate/.lock?index=2637&wait=15000ms, error: rpc error: EOF from=127.0.0.1:51716
2017/05/11 18:20:00 [WARN] raft: Rejecting vote request from 192.168.99.105:8300 since we have a leader: 192.168.99.103:8300
2017/05/11 18:20:01 [WARN] raft: Rejecting vote request from 192.168.99.105:8300 since we have a leader: 192.168.99.103:8300
2017/05/11 18:20:01 [INFO] memberlist: Suspect dc2consul1 has failed, no acks received
2017/05/11 18:20:01 [WARN] raft: Heartbeat timeout from "192.168.99.103:8300" reached, starting election
2017/05/11 18:20:01 [INFO] raft: Node at 192.168.99.104:8300 [Candidate] entering Candidate state in term 61
2017/05/11 18:20:01 [ERR] raft: Failed to make RequestVote RPC to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused
2017/05/11 18:20:01 [INFO] raft: Node at 192.168.99.104:8300 [Follower] entering Follower state (Leader: "")
2017/05/11 18:20:02 [WARN] raft: Heartbeat timeout from "" reached, starting election
2017/05/11 18:20:02 [INFO] raft: Node at 192.168.99.104:8300 [Candidate] entering Candidate state in term 63
2017/05/11 18:20:02 [ERR] raft: Failed to make RequestVote RPC to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused
2017/05/11 18:20:02 [INFO] raft: Election won. Tally: 2
2017/05/11 18:20:02 [INFO] raft: Node at 192.168.99.104:8300 [Leader] entering Leader state
2017/05/11 18:20:02 [INFO] raft: Added peer 192.168.99.103:8300, starting replication
2017/05/11 18:20:02 [INFO] raft: Added peer 192.168.99.105:8300, starting replication
2017/05/11 18:20:02 [INFO] consul: cluster leadership acquired
2017/05/11 18:20:02 [INFO] consul: New leader elected: dc2consul2
2017/05/11 18:20:02 [INFO] raft: pipelining replication to peer {Voter 192.168.99.105:8300 192.168.99.105:8300}
2017/05/11 18:20:02 [ERR] raft: Failed to AppendEntries to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused
2017/05/11 18:20:02 [ERR] raft: Failed to AppendEntries to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused
2017/05/11 18:20:02 [ERR] raft: Failed to AppendEntries to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused

consul-replicate.log

Lock acquisition failed: failed to read lock: Unexpected response code: 500

dc2consul3 (192.168.99.105)

consul.log

2017/05/11 18:20:06 [ERR] http: Request GET /v1/kv/locks/consul-replicate/.lock?index=2637&wait=15000ms, error: rpc error: EOF from=127.0.0.1:50638
2017/05/11 18:20:08 [WARN] raft: Heartbeat timeout from "192.168.99.103:8300" reached, starting election
2017/05/11 18:20:08 [INFO] raft: Node at 192.168.99.105:8300 [Candidate] entering Candidate state in term 61
2017/05/11 18:20:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused
2017/05/11 18:20:08 [INFO] memberlist: Suspect dc2consul1 has failed, no acks received
2017/05/11 18:20:09 [WARN] raft: Election timeout reached, restarting election
2017/05/11 18:20:09 [INFO] raft: Node at 192.168.99.105:8300 [Candidate] entering Candidate state in term 62
2017/05/11 18:20:09 [ERR] raft: Failed to make RequestVote RPC to {Voter 192.168.99.103:8300 192.168.99.103:8300}: dial tcp 192.168.99.103:8300: getsockopt: connection refused
2017/05/11 18:20:10 [INFO] raft: Node at 192.168.99.105:8300 [Follower] entering Follower state (Leader: "")
2017/05/11 18:20:10 [INFO] consul: New leader elected: dc2consul2
2017/05/11 18:20:12 [INFO] memberlist: Suspect dc2consul1 has failed, no acks received
2017/05/11 18:20:13 [INFO] memberlist: Marking dc2consul1 as failed, suspect timeout reached (0 peer confirmations)
2017/05/11 18:20:13 [INFO] serf: EventMemberFailed: dc2consul1 192.168.99.103
2017/05/11 18:20:13 [INFO] consul: Removing LAN server dc2consul1 (Addr: tcp/192.168.99.103:8300) (DC: dc2)
2017/05/11 18:20:14 [INFO] memberlist: Suspect dc2consul1 has failed, no acks received
2017/05/11 18:20:27 [INFO] memberlist: Suspect dc2consul1.dc2 has failed, no acks received
2017/05/11 18:20:50 [INFO] serf: EventMemberFailed: dc2consul1.dc2 192.168.99.103
2017/05/11 18:20:50 [INFO] consul: Removing WAN server dc2consul1.dc2 (Addr: tcp/192.168.99.103:8300) (DC: dc2)
2017/05/11 18:20:52 [INFO] serf: attempting reconnect to dc2consul1 192.168.99.103:8301
2017/05/11 18:21:02 [INFO] memberlist: Suspect dc2consul1.dc2 has failed, no acks received
2017/05/11 18:21:22 [INFO] serf: attempting reconnect to dc2consul1 192.168.99.103:8301
2017/05/11 18:21:22 [INFO] serf: attempting reconnect to dc2consul1.dc2 192.168.99.103:8302

consul-replicate.log

Lock acquisition failed: failed to read lock: Unexpected response code: 500

James Phillips

unread,
May 24, 2017, 6:34:42 PM5/24/17
to consu...@googlegroups.com
Hi Joe sorry for the delay in getting back to you. It looks like this
error is propagating back to `consul lock`:

> 2017/05/11 18:19:58 [ERR] http: Request GET /v1/kv/locks/consul-replicate/.lock?index=2637&wait=15000ms, error: rpc error: EOF from=127.0.0.1:51716

We don't have a built-in retry for that case so it's causing `consul
lock` to exit. If you rerun the process when it exits (like if you had
a process supervisor) then it would recover on its own. If you'd like
to open a GitHub issue for Consul about this we could see about making
the lock path more robust with some retries in that path.

-- James
> https://groups.google.com/d/msgid/consul-tool/0da99081-b15e-4a74-884e-39ccecec363b%40googlegroups.com.

Joe N.

unread,
Jun 2, 2017, 8:31:34 AM6/2/17
to Consul
Reply all
Reply to author
Forward
0 new messages