Failed to start Raft: failed to load any existing snapshots

2,713 views
Skip to first unread message

Sudeep Deshpande

unread,
Jan 25, 2016, 6:44:54 PM1/25/16
to Consul
Hi,

With consul 0.5.0, at times when consul is restarted I am getting this error because of which consul fails to restart. 
>> Error starting agent: Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots

Btw, the consul data resides on a mounted volume. A dirty workaround is cleaning up the data folder, starting consul and making it rejoin one of the cluster nodes.

Any pointers?

Thanks,
Sudeep

Armon Dadgar

unread,
Jan 26, 2016, 1:28:48 PM1/26/16
to consu...@googlegroups.com, Sudeep Deshpande
Sudeep,

Do you have any other logs? Its possible Consul is failing to load the snapshot either
due to on-disk corruption (checksum failure) or the data being from a newer version of
Consul that it does not understand. I would also try to upgrade to Consul 0.6.X as well.

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/2cc51f0e-7c28-4db0-9afb-343d31529ac0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sudeep Deshpande

unread,
Jan 26, 2016, 2:28:41 PM1/26/16
to Consul, deshpande...@gmail.com
Hi Armon,

You are right. The versions are different. Its a mix bunch of 0.6 and 0.5. I will have them all upgraded to 0.6. Can I do this without cleaning up the data directory? I have some kv data in it for which I currently do not have a rebuilder and want to retain it as much as possible. 

An unrelated question - we are working on cloud app and currently have around 30 managed nodes (5 consul servers and rest are consul agents) in the same DC. Do we have any performance benchmarking around large consul rings?

Btw, your promptness is infectious :) and you guys are building a great culture in the open source community. Kudos to you and your entire team!!!

Thanks,
Sudeep

Armon Dadgar

unread,
Jan 26, 2016, 2:32:58 PM1/26/16
to consu...@googlegroups.com, Sudeep Deshpande, deshpande...@gmail.com
Sudeep,

You should be able to update to Consul 0.6 in place without cleaning up the data directory
if you were running 0.5.1 or 0.5.2 previously. If you were running something older you will
need to run the migration tool to upgrade the data to the 0.6 format.

There are many Consul clusters running with thousands of nodes in a single datacenter,
so with only ~30 nodes you shouldn’t have to worry about it. Writes are limited by disk IO and
reads by CPU, so you can continue to scale up the servers for quite some time.

Hope that helps!

Best Regards,
Armon Dadgar

Sudeep Deshpande

unread,
Jan 27, 2016, 1:29:04 AM1/27/16
to Consul, deshpande...@gmail.com
Thanks!

Motty Porat

unread,
Aug 2, 2017, 7:16:13 AM8/2/17
to Consul, deshpande...@gmail.com
Hi Armon,
I got this error message too, and in my case the corruption reason is the most probable one. (We test a hard reboot to a server).
Is it possible (any planned fix?) that Consul will heal from this condition? The server will start with empty data and the other servers will replicate into it?

Thanks,
Motty

Armon Dadgar

unread,
Aug 2, 2017, 1:14:55 PM8/2/17
to consu...@googlegroups.com, deshpande...@gmail.com
Hey Motty,

We’ve made a lot of changes since Consul 0.5.X. What version are you running now?

Best Regards,
Armon Dadgar

Rom Freiman

unread,
Aug 2, 2017, 4:46:39 PM8/2/17
to Consul, deshpande...@gmail.com
Hey Armon,
My name is Rom and I'm working with Motty.

We just upgraded to v0.8.4 (from 0.6.4) and started running into this issue again (after some research, seems that it's not related to the upgrade).
What actually happens is that we have tests which crash nodes where consul is running, and apparently, our crash timings fits exactly into consul raft performing snapshots  (approx 10m after consul starts). BTW, is there any sequence of events that causes the snapshot to be written to disk? Or is it time dependent? Where is it configured?
Anyhow, looking into the raft code and digging in some linux blogs, I started wandering whether your snapshot reaping (from .tmp to full dirctory) is safe enough.
According to some sources, at least if I got it right, you should both fsync the source dir and the dest dir while renaming, otherwise crash consistency is not guaranteed.
What happen in our case is the we get the dir renamed (without .tmp), and the metafile is written, but the state.bin file is empty).


But again, I might be wrong.

Thanks,
Rom

pre...@hashicorp.com

unread,
Aug 3, 2017, 6:01:56 PM8/3/17
to Consul, deshpande...@gmail.com
Thanks for reporting this Rom. You are right about this being an issue with not calling sync correctly. https://github.com/hashicorp/raft/issues/229 addresses this and we aim to get the fix into consul's upcoming minor release. 

Reply all
Reply to author
Forward
0 new messages