Hi,
I'm implementing a Consul/Vault cluster on AWS and I'm currently testing the Consul snapshot feature. What I did to test a snapshot restore on Consul is, inject some data into Vault (kv secrets, policies, userpass...), I took a snapshot of Consul from one of the Consul servers. I then destroyed my Consul/Vault cluster and recreated it from scratch. Finally, I connected to the freshly started Consul leader server and restored the snapshot previously taken.
The 'consul snapshot restore' CLI command seemed to work as expected. When I connected to Vault UI after the restore, I was able to retrieve all the information that was stored in Vault (kv secrets, policies, userpass...). Nevertheless, there was a few error logs on the Consul leader server which I am worried about:
2018/09/04 07:45:41 [INFO] snapshot: Creating new snapshot at /opt/consul/data/raft/snapshots/2-5624-1536047141786.tmp
2018/09/04 07:45:41 [INFO] raft: Copied 24962 bytes to local snapshot
2018/09/04 07:45:41 [INFO] raft: Restored user snapshot (index 5624)
2018/09/04 07:45:41 [ERR] raft: Failed to get log at index 272: log not found
2018/09/04 07:45:41 [INFO] raft: aborting pipeline replication to peer {Voter 9ca74286-53cd-ea2d-8c94-5ca7e4a33575 X.X.X.76:8300}
2018/09/04 07:45:41 [ERR] raft: Failed to get log at index 272: log not found
2018/09/04 07:45:41 [INFO] raft: aborting pipeline replication to peer {Voter 571aa34c-5b8a-8f63-7b72-ba7871afc21e X.X.X.100:8300}
2018/09/04 07:45:41 [ERR] raft: Failed to get log at index 272: log not found
2018/09/04 07:45:41 [ERR] raft: Failed to get log at index 272: log not found
2018/09/04 07:45:41 [INFO] raft: pipelining replication to peer {Voter 9ca74286-53cd-ea2d-8c94-5ca7e4a33575 X.X.X.76:8300}
2018/09/04 07:45:41 [INFO] raft: pipelining replication to peer {Voter 571aa34c-5b8a-8f63-7b72-ba7871afc21e X.X.X.100:8300}
On the Consul follower server, no error log were seen:
2018/09/04 07:45:41 [INFO] snapshot: Creating new snapshot at /opt/consul/data/raft/snapshots/2-5624-1536047141875.tmp
2018/09/04 07:45:41 [INFO] raft: Copied 24962 bytes to local snapshot
2018/09/04 07:45:41 [INFO] raft: Installed remote snapshot
I'd like to have your opinion on the severity of those raft error logs because the restore seems to be okay as far as data integrity is concerned in Vault.
Here is a bit of context of my setup:
==== 'consul version' ====
Client: Consul v1.2.1
Server: Consul v1.2.1
==== 'consul info' ====
Client:
agent:
check_monitors = 0
check_ttls = 1
checks = 1
services = 1
build:
prerelease =
revision = 39f93f01
version = 1.2.1
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 2
goroutines = 47
max_procs = 2
os = linux
version = go1.10.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 7
members = 6
query_queue = 0
query_time = 1
Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 39f93f01
version = 1.2.1
consul:
bootstrap = false
known_datacenters = 1
leader = true
leader_addr = X.X.X.85:8300
server = true
raft:
applied_index = 11746
commit_index = 11746
fsm_pending = 0
last_contact = 0
last_log_index = 11746
last_log_term = 2
last_snapshot_index = 5624
last_snapshot_term = 2
latest_configuration = [{Suffrage:Voter ID:571aa34c-5b8a-8f63-7b72-ba7871afc21e Address:X.X.X.100:8300} {Suffrage:Voter ID:9ca74286-53cd-ea2d-8c94-5ca7e4a33575 Address:X.X.X.76:8300} {Suffrage:Voter ID:c44a96da-6b21-fadb-0ffd-88068ec0e88c Address:X.X.X.85:8300}]
latest_configuration_index = 1
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 2
runtime:
arch = amd64
cpu_count = 2
goroutines = 118
max_procs = 2
os = linux
version = go1.10.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 7
members = 6
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 4
members = 3
query_queue = 0
query_time = 1
==== Operating system and Environment details ====
Amazon Linux AMI release 2018.03
Consul members:
Node Address Status Type Build Protocol DC Segment
i-XXXXXXXXXXXXXXXXX X.X.X.76:8301 alive server 1.2.1 2 us-east-1 <all>
i-XXXXXXXXXXXXXXXXX X.X.X.100:8301 alive server 1.2.1 2 us-east-1 <all>
i-XXXXXXXXXXXXXXXXX X.X.X.85:8301 alive server 1.2.1 2 us-east-1 <all>
i-XXXXXXXXXXXXXXXXX X.X.X.108:8301 alive client 1.2.1 2 us-east-1 <default>
i-XXXXXXXXXXXXXXXXX X.X.X.94:8301 alive client 1.2.1 2 us-east-1 <default>
i-XXXXXXXXXXXXXXXXX X.X.X.78:8301 alive client 1.2.1 2 us-east-1 <default>
Content of Consul server config file:
{
"advertise_addr": "X.X.X.85",
"bind_addr": "X.X.X.85",
"bootstrap_expect": 3,
"client_addr": "0.0.0.0",
"datacenter": "us-east-1",
"node_name": "i-XXXXXXXXXXXXXXXXX",
"retry_join": [
"provider=aws region=us-east-1 tag_key=consul-cluster tag_value=auto-join"
],
"server": true,
"encrypt": "*****",
"verify_outgoing": true,
"verify_incoming": true,
"verify_server_hostname": true,
"ca_path": "*****",
"cert_file": "*****",
"key_file": "*****",
"ui": true
}