Cluster Recovery simply doesn't work.

Jay

unread,

Feb 25, 2017, 3:40:31 PM2/25/17

to consu...@googlegroups.com

  So I tried once again to do a rolling update of an ASG consul cluster. And once again, it completely failed, and I'm left with a dead cluster and all KV data lost.

  I have yet to be able to recover a failed cluster in any form according to the documentation. It's pretty frustrating. I also have yet to do a graceful upgrade of an ASG that actually succeeds in doing what it's supposed to. It seems that if the leader dies, the cluster is likely toast. Even with 5 nodes, and all of them booted with bootstrap_expect of 3 originally, they sometimes fail to elect a new leader with 4 nodes still up and communicating.

  Is there some trick that's undocumented to get recovery to work? The operator command flat fail when there's no leader, so they're pretty much useless. The documentation to put a raft file in place simply leads to consul failing to even start with:

consul[12446]: ==> Starting Consul agent...
consul[12446]: ==> Error starting agent: Failed to start Consul server: Failed to start Raft: recovery failed: refused to recover cluster with no initial state, this is probably an operator error

  This seems really bad, and very fickle. Is there a documented successful-every-time way to configure an ASG so that rolling updates can be done, and the cluster doesn't completely blow up most of the time?

-- Frustrated

Brian Lalor

unread,

Feb 25, 2017, 4:30:53 PM2/25/17

to consu...@googlegroups.com

I re-spin my consul cluster ASGs on a pretty regular basis. So far I do it manually, but I need to come up with an automated system. At a high level, after updating the launch configuration, I put all existing instances into standby, which causes new instances to be launched. I wait for all the instances to join the cluster. Then I terminate the old ones one at a time, often saving the leader for last, so that there is only a single new election (but you obviously can't prevent those from happening). This has never not worked for me.

--
Brian Lalor
bla...@bravo5.org

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/CAJsMCEhFWX0%2BCJ4LHUe%3DNQ%3DizsRknik3wT9P-JCdmF%3DcT7i5%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jay

unread,

Feb 25, 2017, 4:55:37 PM2/25/17

to consu...@googlegroups.com

   New bit of info, maybe:

Feb 25 21:45:47 host consul[9203]: 2017/02/25 21:45:47 [ERR] agent: coordinate update error: No cluster leader
Feb 25 21:45:50 host consul[9203]: 2017/02/25 21:45:50 [ERR] consul.acl: Failed to get policy from ACL datacenter: No cluster leader
Feb 25 21:45:55 host consul[9203]: 2017/02/25 21:45:55 [ERR] agent: failed to sync remote state: No cluster leader

[ec2-user@host consul.d]$ consul operator raft --list-peers -stale=true
Operator "raft" subcommand failed: Unexpected response code: 403 (Permission denied)

   Is it possible there's a bug or restriction in the recovery process when ACL's are enabled? We have them enabled, but haven't actually put any in place other than the root one being there. I cannot find any way to use the master token from the CLI either to see if that would work to get around the above 403 and let me poke at the raft stuff.

James Phillips

unread,

Feb 27, 2017, 12:23:06 PM2/27/17

to consu...@googlegroups.com

Hi Jay,

Sorry about your experience and frustration with this.

> [ec2-user@host consul.d]$ consul operator raft --list-peers -stale=true
> Operator "raft" subcommand failed: Unexpected response code: 403 (Permission denied)

That 403 error is ACL-related, you can use -token=<token> to supply a
token with operator read rights (your master token will have that).

> consul[12446]: ==> Error starting agent: Failed to start Consul server: Failed to start Raft: recovery failed: refused to recover cluster with no initial state, this is probably an operator error

That message is generated when you set a peers.json on a fresh server
(nothing in the data-dir's "raft" directory). Consul's assuming this
is an error because usually you want to recover a server with data;
fresh servers can be joined into an existing cluster with a "consul
join". Is it possible that the data-dir got wiped, or this was a
newly-spawned server? I think the fix there is to only set that file
on the servers that were part of the cluster you are trying to recover
and that have valid Raft data in their data-dir, and then join the
replacement servers to the cluster in the usual way.

As Brian pointed out, you do have to terminate the old servers one by
one, and make sure they have left the cluster and have been removed
from the Raft peers; if not, you can issue a "consul force-leave
<terminated server name>". We are working on some automation in 0.8 to
do that force leave for you. That's really the main hazard of running
in an ASG: if a server dies unexpectedly and gets replaced, it won't
get removed from the Raft peers for 72 hours or a force-leave.

You should never get an outage when taking down just 1 of 5 servers,
though. That makes me think that the cluster may have been in a
degraded state before the upgrade started. Do you think it's possible
that there could have been some previously failed servers in the Raft
peers, or that the servers didn't all get joined as you expected? The
"consul operator raft -list-peers" output should help answer that once
you pass the token.

Rolling upgrades and recovery should both be really reliable, so I
hope this helps and that we can get you into a good state.

-- James

> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/consul-tool/CAJsMCEi6M4HjbNsd3_NtsBgcaO0uVeu%2BhgvEnRkbH_e4JGUFPw%40mail.gmail.com.

Reply all

Reply to author

Forward