Hi Jay,
Sorry about your experience and frustration with this.
> [ec2-user@host consul.d]$ consul operator raft --list-peers -stale=true
> Operator "raft" subcommand failed: Unexpected response code: 403 (Permission denied)
That 403 error is ACL-related, you can use -token=<token> to supply a
token with operator read rights (your master token will have that).
> consul[12446]: ==> Error starting agent: Failed to start Consul server: Failed to start Raft: recovery failed: refused to recover cluster with no initial state, this is probably an operator error
That message is generated when you set a peers.json on a fresh server
(nothing in the data-dir's "raft" directory). Consul's assuming this
is an error because usually you want to recover a server with data;
fresh servers can be joined into an existing cluster with a "consul
join". Is it possible that the data-dir got wiped, or this was a
newly-spawned server? I think the fix there is to only set that file
on the servers that were part of the cluster you are trying to recover
and that have valid Raft data in their data-dir, and then join the
replacement servers to the cluster in the usual way.
As Brian pointed out, you do have to terminate the old servers one by
one, and make sure they have left the cluster and have been removed
from the Raft peers; if not, you can issue a "consul force-leave
<terminated server name>". We are working on some automation in 0.8 to
do that force leave for you. That's really the main hazard of running
in an ASG: if a server dies unexpectedly and gets replaced, it won't
get removed from the Raft peers for 72 hours or a force-leave.
You should never get an outage when taking down just 1 of 5 servers,
though. That makes me think that the cluster may have been in a
degraded state before the upgrade started. Do you think it's possible
that there could have been some previously failed servers in the Raft
peers, or that the servers didn't all get joined as you expected? The
"consul operator raft -list-peers" output should help answer that once
you pass the token.
Rolling upgrades and recovery should both be really reliable, so I
hope this helps and that we can get you into a good state.
-- James
>
https://groups.google.com/d/msgid/consul-tool/CAJsMCEi6M4HjbNsd3_NtsBgcaO0uVeu%2BhgvEnRkbH_e4JGUFPw%40mail.gmail.com.