Hi,
I have a question related to the reaping time (8hrs limit in 0.7.0 version). Let me explain the issue we are observing and the environment we are using:
We are using consul in a docker container in openshift environment(Platform as a service). We have 3 consul servers which form a consul cluster with bootstrap-expect set to 3. We have a script which takes care of finding the consul servers and performing a join to form the cluster.
When we have consul server crashes and goes in to failed state, raft keeps track of the failed node ip in its peers.json file. Now in our openshift environment, new consul server respawns automatically whenever one crashes but the new consul server gets new ip address. If we encounter multiple such crashes with in 8 hours in 0.7.0 version we will end up with lot of failed node entries in the peers.json causing a scenario where we will not achieve quorum and cannot form a leader. Only way we find to recover from this scenario is to respawn all 3 consul servers which leads to loss of data and outage.
Based on my understanding the reason, hashicorp preffered 8 hours is not have users setting this value to too low so that network issues in context of world wide cluster can cause nodes to go to failed state but raft will remove those entries causing a need for manual action of rejoining them to the cluster.
But by not allowing the parameter configurable to a lower value it causes issues for environment such as the one we are using where we will have all 3 consul servers on same virtual network where maximum time to recover is in matter of seconds. So having such a high value like 8 hours causes issues when we respawn a consul server.
We would like to hear your feedback on this and it would be very helpful if hashicorp can review limitations on configuration of reaping time parameter by allowing the users to define the value that matches best to their environment.
As a quick fix, we are working on using openshift probes and volumes feature where we will be able to constantly check the consul member list and the peers.json file. If we find any consul server is in failed state and if its entry is present in peers.json we clean the peers.json and respawn the consul server using the data with correct peers.json. We would like to know your feedback on this quick fix approach that we are trying to avoid the scenario of failed node entries in peers.json causing a stale where we cannot achieve quorum and form a leader.
Thanks,
Pradeep NSS
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/9705bed9-a11c-4e63-a151-dabaad92a5d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/CAGoWc05XUzXEioFGmSUOF49OZRSu4F01PK9%2BP_ks7Gv%3D5S2s2w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/0BE8F52A-7B55-4DA4-A8BA-66EE7711603E%40gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/1e2a21fe-060c-4389-b3cd-4510a8e638c0%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/CAGoWc079QYhfXuZxZKx3M2Z8YaeKLsCO4tdLF0msDfFB5y5R1w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/7198aa8e-cffb-48d2-99cd-e59327665e93%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/CAN3s8zYEm4i%3DusoXj05LKvUnMZ62kFwhGfCxF_zY5hCc1%2BDhYQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/bd242860-b527-4349-81f4-eb1e99ffb7ef%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/2f3cd4cb-9c17-46a6-93fa-5ebf382e60ae%40googlegroups.com.
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/5cefb246-313f-4b6e-b6ba-0ed7a9bfd9f3%40googlegroups.com.