How to gracefully transfer leadership from one server to another?

3,245 views
Skip to first unread message

Steve H

unread,
Mar 14, 2017, 10:02:39 AM3/14/17
to Consul
Hi, 

We have a 5 server cluster we are observing downtime of the KV service on the remaining servers when we stop the agent on the server that currently holds the leadership role. It appears that this downtime is for the duration of the leadership election among the remaining servers. This is only a couple of seconds, but it seems odd that there would be any downtime at all in a cluster with 5 (or 4 during a reboot) servers? 

We're running 0.75 on Ubuntu 16.04. 

# consul members
Node           Address           Status  Type    Build  Protocol  DC
STEVE-DESKTOP  10.302.101.15:8301  alive   client  0.7.5  2         DC1
NODE1          10.302.101.24:8301  alive   server  0.7.5  2         DC1
NODE2          10.302.101.8:8301   alive   server  0.7.5  2         DC1
NODE3          10.302.101.11:8301  alive   server  0.7.5  2         DC1
NODE4          10.302.101.12:8301  alive   server  0.7.5  2         DC1
NODE5          10.302.101.13:8301  alive   server  0.7.5  2         DC1


The config on each of the servers is as follows (with differing node names): 
{
    "data_dir": "/opt/consul",
    "datacenter": "DC1",
    "log_level": "INFO",
    "node_name": "NODEx",
    "performance": {
        "raft_multiplier": 1
    },
    "rejoin_after_leave": true,
    "server": true,
    "ui": true
}



To provide an example of the downtime we observe that NODE1 is the leader:
# consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease =
        revision = '21f2d5a
        version = 0.7.5
consul:
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = 10.2.101.8:8300
        server = true
raft:
        applied_index = 131168
        commit_index = 131168
        fsm_pending = 0
        last_contact = 0
        last_log_index = 131168
        last_log_term = 129
        last_snapshot_index = 124469
        last_snapshot_term = 129
        latest_configuration = [{Suffrage:Voter ID:10.2.101.11:8300 Address:10.2.101.11:8300} {Suffrage:Voter ID:10.2.101.8:8300 Address:10.2.101.8:8300} {Suffrage:Voter ID:10.2.101.24:8300 Address:10.2.101.24:8300} {Suffrage:Voter ID:10.2.101.12:8300 Address:10.2.101.12:8300} {Suffrage:Voter ID:10.2.101.13:8300 Address:10.2.101.13:8300}]
        latest_configuration_index = 94583
        num_peers = 4
        protocol_version = 1
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 129
runtime:
        arch = amd64
        cpu_count = 16
        goroutines = 100
        max_procs = 16
        os = linux
        version = go1.7.5
serf_lan:
        encrypted = false
        event_queue = 0
        event_time = 29
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 26
        members = 6
        query_queue = 0
        query_time = 1
serf_wan:
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 2
        members = 1
        query_queue = 0
        query_time = 1


On NODE2 we run a contrived continuous loop putting and getting a value in the KV store:
while true
do 
    consul kv get test/test
    consul kv put test/test "$(date)"
done

This outputs a continuous stream of something like this: 
...
Tue 14 Mar 13:44:56 GMT 2017
Success! Data written to: test/test
Tue 14 Mar 13:44:57 GMT 2017
Success! Data written to: test/test
...

If we then follow the instructions in "Stopping an Agent" on https://www.consul.io/docs/agent/basics.html and send 
kill -INT consul_pid
on NODE1 

and on NODE2 we get:
...
Tue 14 Mar 13:46:50 GMT 2017
Success! Data written to: test/test
Error querying Consul agent: Unexpected response code: 500
Error! Failed writing data: Unexpected response code: 500 (rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused)
Error querying Consul agent: Unexpected response code: 500
Error! Failed writing data: Unexpected response code: 500 (rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused)
Error querying Consul agent: Unexpected response code: 500
Error! Failed writing data: Unexpected response code: 500 (rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused)
Error querying Consul agent: Unexpected response code: 500
Error! Failed writing data: Unexpected response code: 500 (rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused)
Error querying Consul agent: Unexpected response code: 500
Success! Data written to: test/test
Tue 14 Mar 13:46:50 GMT 2017
Success! Data written to: test/test
Tue 14 Mar 13:46:53 GMT 2017
Success! Data written to: test/test
... 

If we just perform a straight reboot of the leader server the number of 500 errors is much higher, so it seems that systemd is probably sending the default SIGTERM and doing a less graceful shutdown.

So our questions are: How can I make the shutdown of the agent service which is currently leader more graceful? Is there a way to force an election without shutting down the current leader? Is there an alternate signal that will wait for the leadership election to complete prior to the current leader going offline? Should I lower my expectations as to the availability of the KV service and account for downtime everywhere we use it? Is it possible to make the kv command retry a few times in the event of a 500 error? Have we totally missed something?

Thanks in advance for any assistance! 

Best Regards

Steve

Paul Archer

unread,
Mar 14, 2017, 9:24:36 PM3/14/17
to Consul
I'm new to consul myself, but I think this is what you are looking for:
$ consul maint -help
Usage: consul maint [options]

  Places a node or service into maintenance mode. During maintenance mode,
  the node or service will be excluded from all queries through the DNS
  or API interfaces, effectively taking it out of the pool of available
  nodes. This is done by registering an additional critical health check.

  When enabling maintenance mode for a node or service, you may optionally
  specify a reason string. This string will appear in the "Notes" field
  of the critical health check which is registered against the node or
  service. If no reason is provided, a default value will be used.

  Maintenance mode is persistent, and will be restored in the event of an
  agent restart. It is therefore required to disable maintenance mode on
  a given node or service before it will be placed back into the pool.

  By default, we operate on the node as a whole. By specifying the
  "-service" argument, this behavior can be changed to enable or disable
  only a specific service.

  If no arguments are given, the agent's maintenance status will be shown.
  This will return blank if nothing is currently under maintenance.

Options:

  -enable                    Enable maintenance mode.
  -disable                   Disable maintenance mode.
  -reason=<string>           Text string describing the maintenance reason
  -service=<serviceID>       Control maintenance mode for a specific service ID
  -token=""                  ACL token to use. Defaults to that of agent.
  -http-addr=127.0.0.1:8500  HTTP address of the Consul agent.

Steve H

unread,
Mar 15, 2017, 7:15:56 AM3/15/17
to Consul
Thanks Paul, 

We had looked at the "maint" option, but maybe we're still missing something. On the leader we run:

# consul maint -enable
Node maintenance is now enabled

However this doesn't cause an election to be held and leadership stays with the server even though it is in maintenance. If we then stop the agent process on the server we get the same 500 errors until the leadership election has taken place. Are we using the wrong options for maint or is there a second step to take after putting a node into maintenance?

Thanks & best regards

Steve

James Phillips

unread,
Mar 15, 2017, 11:01:25 AM3/15/17
to consu...@googlegroups.com
Hi Steve,

Consul currently doesn't have a mechanism to gracefully transfer
leadership (and consul maint unfortunately doesn't help here). Having
the current leader leave the cluster will kick off an election, and
there will be a brief period without a leader while that transpires.

We do have retry logic in the RPC client that attempts to hide this
from callers, having them experience just a longer request time. Are
your failing KV writes happening against Consul agents in client mode,
or are they happening against Consul server agents? Since you have the
Raft multiplier set to 1, I'd expect that the 500 errors won't make it
to the clients very often, but that logic may not be present if you
are doing KV writes directly against a Consul server since it's a
slightly different code path.

In general, we do recommend that your app has some retry logic since
it may experience legit 500 errors for a while if a Consul server is
suddenly lost. If you'd like to open a Github issue, we could look at
adding a more graceful mechanism for planned operations that take out
a leader.

-- James
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/bfc0c105-b913-4aa1-94bc-42e362d7d2e5%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Steve H

unread,
Mar 15, 2017, 12:57:39 PM3/15/17
to Consul
Thanks James, 

We see similar behaviour when using the RPC client when the local agent is in either server or client mode during the leadership election. With my contrived tight loop of gets and puts we get output like this on from the agent running in client mode:

(Where batch-1lb/10.2.101.24 is the current leader and the other names/IPs are of other servers.)
...
Success! Data written to: test/test
Wed, Mar 15, 2017  4:27:36 PM
Error! Failed writing data: Unexpected response code: 500 (rpc error: rpc error: stream closed)
Error querying Consul agent: Unexpected response code: 500
Error! Failed writing data: Unexpected response code: 500 (rpc error: failed to get conn: dial tcp 10.2.101.24:8300: connectex: No connection could be made because the target machine actively refused it.)
Error querying Consul agent: Unexpected response code: 500
Error! Failed writing data: Unexpected response code: 500 (rpc error: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused)
Error querying Consul agent: Unexpected response code: 500
Success! Data written to: test/test
Wed, Mar 15, 2017  4:27:38 PM
...

In the logs for the agent we see this sort of thing :
...
    2017/03/15 16:27:36 [ERR] consul: RPC failed to server 10.2.101.13:8300: rpc error: rpc error: stream closed
    2017/03/15 16:27:36 [ERR] http: Request PUT /v1/kv/test/test, error: rpc error: rpc error: stream closed from=127.0.0.1:54101
    2017/03/15 16:27:36 [ERR] consul: RPC failed to server 10.2.101.12:8300: rpc error: rpc error: stream closed
    2017/03/15 16:27:36 [ERR] http: Request GET /v1/kv/test/test, error: rpc error: rpc error: stream closed from=127.0.0.1:54104
    2017/03/15 16:27:37 [ERR] consul: RPC failed to server 10.2.101.24:8300: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: connectex: No connection could be made because the target machine actively refused it.
    2017/03/15 16:27:37 [ERR] http: Request PUT /v1/kv/test/test, error: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: connectex: No connection could be made because the target machine actively refused it. from=127.0.0.1:54108
    2017/03/15 16:27:37 [ERR] consul: RPC failed to server 10.2.101.8:8300: rpc error: stream closed
    2017/03/15 16:27:37 [ERR] http: Request GET /v1/kv/test/test, error: rpc error: stream closed from=127.0.0.1:54113
    2017/03/15 16:27:38 [ERR] consul: RPC failed to server 10.2.101.11:8300: rpc error: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused
    2017/03/15 16:27:38 [ERR] http: Request PUT /v1/kv/test/test, error: rpc error: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused from=127.0.0.1:54116
    2017/03/15 16:27:38 [ERR] consul: RPC failed to server 10.2.101.13:8300: rpc error: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused
    2017/03/15 16:27:38 [ERR] http: Request GET /v1/kv/test/test, error: rpc error: rpc error: failed to get conn: dial tcp 10.2.101.24:8300: getsockopt: connection refused from=127.0.0.1:54119
    2017/03/15 16:27:39 [INFO] consul: New leader elected: compute-3lb
    2017/03/15 16:27:40 [INFO] memberlist: Suspect batch-1lb has failed, no acks received
...

We seem to be able to achieve the least downtime by issuing a leave command to the server agent instead of sending an SIGINT. This seems to only result in one 500 error and then a block until the leadership election is complete. I'm thinking that we'll change the /lib/systemd/system/consul.service to have:

ExecStop=/usr/local/bin/consul leave

and make sure that "rejoin_after_leave" is always true in our setup. We'll also put retry logic in place in order to deal with both expected and unexpected downtime of the service. 

I'll write this all up in a github issue so that this info is there for others if needed and if there is scope to make things more graceful in future that would be grand. 

Thanks again and best regards

Steve

Jason W

unread,
Mar 22, 2017, 12:43:26 PM3/22/17
to Consul
Steve,

Thank you for going through the time to setup a reproducible test. We too are new to Consul and were struggling with the same issue. We assumed that it was our ignorance on the best way to set this up, but your work confirms that we did the right thing and that this is indeed an issue.

Not being able to have the leader gracefully leave the cluster is a bit disappointing to me. As a core service, I would expect such a thing to be transparent to the clients. I'm not looking forward to adding retry code to every place in my infrastructure that interacts with Consul.

If you find anything else out I'd be interested in hearing about it. Thank you again for creating a test.

JW

Nick Wales

unread,
Mar 23, 2017, 12:44:44 PM3/23/17
to Consul
Rather than adding retry code you can add ?stale to the requests which will enable any consul server to respond, not just the leader. 

If you're expecting high traffic volume to your KVs this is also useful for balancing the load. The details are in here: https://www.consul.io/docs/agent/http/kv.html
Reply all
Reply to author
Forward
0 new messages