Reaping time in consul

868 views
Skip to first unread message

nss pradeep

unread,
Jul 1, 2016, 9:03:45 AM7/1/16
to Consul

Hi,


I have a question related to the reaping time (8hrs limit in 0.7.0 version). Let me explain the issue we are observing and the  environment we are using:


We are using consul in a docker container in openshift environment(Platform as a service). We have 3 consul servers which form a consul cluster with bootstrap-expect set to 3. We have a script which takes care of finding the consul servers and performing a join to form the cluster.


When we have consul server crashes and goes in to failed state, raft keeps track of the failed node ip in its peers.json file. Now in our openshift environment, new consul server respawns automatically whenever one crashes but the new consul server gets new ip address. If we encounter multiple such crashes with in 8 hours in 0.7.0 version we will end up with lot of failed node entries in the peers.json causing a scenario where we will not achieve quorum and cannot form a leader. Only way we find to recover from this scenario is to respawn all 3 consul servers which leads to loss of data and outage.


Based on my understanding the reason, hashicorp preffered 8 hours is not have users setting this value to too low so that network issues in context of world wide cluster can cause nodes to go to failed state but raft will remove those entries causing a need for manual action of rejoining them to the cluster.


But by not allowing the parameter configurable to a lower value it causes issues for environment such as the one we are using where we will have all 3 consul servers on same virtual network where maximum time to recover is in matter of seconds. So having such a high value like 8 hours causes issues when we respawn a consul server. 


We would like to hear your feedback on this and it would be very helpful if hashicorp can review limitations on configuration of reaping time parameter by allowing the users to define the value that matches best to their environment.


As a quick fix, we are working on using openshift probes and volumes feature where we will be able to constantly check the consul member list and the peers.json file. If we find any consul server is in failed state and if its entry is present in peers.json we clean the peers.json and respawn the consul server using the data with correct peers.json. We would like to know your feedback on this quick fix approach that we are trying to avoid the scenario of failed node entries in peers.json causing a stale where we cannot achieve quorum and form a leader.


Thanks,

Pradeep NSS

James Phillips

unread,
Jul 1, 2016, 2:24:14 PM7/1/16
to consu...@googlegroups.com
Hi Pradeep,

You are correct - we've been super hesitant to allow lower reap times because the logic isn't really aware of the requirements of quorum so it can lead to undesired configurations after long partition events (servers that could have rejoined would now be evicted).

We are working on a design for an automatic controller that will let the leader take care of this, and be aware of the stability of the cluster. It should allow you to configure a desired number of servers and it will evict dead servers for you *iff* the cluster is in a stable state. I don't have a timeline for this yet, but it should be Q3 some time.

As a workaround, if you have probes that can look for failed servers, you should be able to issue a force-leave command (https://www.consul.io/docs/commands/force-leave.html) to evict it from the cluster without having to take any downtime or modify the peers.json. This is basically kicking the reap to happen early.

Hope that helps!

-- James

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/9705bed9-a11c-4e63-a151-dabaad92a5d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Adams

unread,
Jul 1, 2016, 2:41:42 PM7/1/16
to consu...@googlegroups.com
I've been running into issues with this on hosts that are autoscaled (by AWS ASGs) where over the course of 12-16 hours each day we start and stop several hundred hosts. However it seems that most if not all of the time when the hosts are scaled down the consul agent doesn't get a chance to send a leave notice and so those hosts are identified as failed. So I put in place an hourly job to find "failed" hosts which have also been terminated according to the EC2 API and issue a force-leave. This does set their status to "left" but the reaping is still falling way behind with typically over a hundred hosts always listed in the "left" state even after several hours of no scale downs. We're seeing other issues (memory leaks it looks like) with the consul agent on a few long running hosts in the same datacenter and I'm trying to determine if these things are related.

We're still on 0.6.3 if that matters. I'm still collecting info on this stuff because only in the last few days have we started noticing actual problems related to the consul agent on any hosts (after running in this configuration for several months) so I don't have much more context to give but this topic seems closely related and so I thought I'd bring it up to find out if this are any known issues I might be running into here.

James Phillips

unread,
Jul 1, 2016, 2:54:41 PM7/1/16
to consu...@googlegroups.com
Hi David,

For non-server agents we were thinking about possibly letting the agents give their desired reap time for cases like this. With ASGs we've definitely seen cases where they can't leave cleanly.

There was a goroutine leak in 0.6.3 that was fixed in 0.6.4 related to TCP pings, so it was common to hit this in large clusters where you'd get non-trivial lost packets and invoke the TCP handler. This is the change that pulled in the fix - https://github.com/hashicorp/consul/pull/1802.

-- James

nss pradeep

unread,
Jul 5, 2016, 9:34:19 AM7/5/16
to Consul
Hi James,

Thank you for the response. 

Regarding automatic controller, we are expecting to have a leader which will take care of evicting the dead servers but if we are in a state where we have more than one dead server in the peers.json which prevents the new consul servers to form a quorum and elect leader. When it goes to this state even the automatic controller will not work as we will not be able to establish a leader right ?

Looking forward to your response.

Thanks,
Pradeep NSS

James Phillips

unread,
Jul 5, 2016, 9:40:24 AM7/5/16
to consu...@googlegroups.com
Hi Pradeep,

That's correct - you'd always need a majority of nodes to make automatic config changes. Once you are in an outage state a human operator would need to help clean up. The automation should make this much less likely to happen, though, by dealing with these issues shortly after they happen.

-- James

nss pradeep

unread,
Jul 6, 2016, 9:08:54 AM7/6/16
to Consul
Hi James,

Going back to the reaping time, for environment such as our's where we will have all 3 consul servers on same openshift virtual network so network issues will not last more than few seconds. As the consul servers will respawn (with new ip address) incase of crash of existing node and join the cluster. If we have more crashes within 8 hours window(currently 72 hours) there will be more failed node entries in the peers.json we will loose the ability to form quorum. So by not having the ability to set the reaping time to lower value even with the automatic controller the situation will not improve for such environments. 

Please can you share your feedback on this.

Thanks,
Pradeep NSS

David Adams

unread,
Jul 6, 2016, 9:57:56 AM7/6/16
to consu...@googlegroups.com
James,
Just upgraded to 0.6.4 and goroutine counts appear to be under control. We had some hosts with tens or hundreds of thousands of goroutines when running 0.6.3. But currently after 16 hours of running on 0.6.4, pretty much every agent in our biggest DC (~500 hosts at the moment, but grows and shrinks throughout the day) has exactly 33. Which is an incredibly striking difference.

To follow up on the topic here, how soon should hosts that have "left" be reaped? I have an hourly job that issues a force-leave for scaled-down hosts, but the actual EventMemberReap doesn't happen until 24 hours after the host was marked as "failed" and doesn't seem to coincide with the force-leave timing at all.

-dave

James Phillips

unread,
Jul 6, 2016, 1:26:01 PM7/6/16
to consu...@googlegroups.com
Hi Pradeep,

The automatic controller would see that the cluster is stable and that the desired number of healthy servers are present, and then would kick the failed servers out of the cluster for you, which would clean up peers.json as well. The reap time would no longer matter for your servers.

-- James

James Phillips

unread,
Jul 6, 2016, 1:45:07 PM7/6/16
to consu...@googlegroups.com
Hi David,

Glad 0.6.4 fixed it!

For your second question, you are correct that it uses the time from when the node was failed as the basis of the 24 hour timeout. This doesn't get pushed out when you do the force-leave operation.

-- James

nss pradeep

unread,
Jul 6, 2016, 2:13:55 PM7/6/16
to Consul
Hi James,

Automatic controller will watch for failed servers or it will be run every "x" seconds to check for failed servers and clear them ?

Thanks,
Pradeep NSS

James Phillips

unread,
Jul 6, 2016, 2:17:03 PM7/6/16
to consu...@googlegroups.com
I'm still working out the details but I think it will run periodically, like once per minute.

nss pradeep

unread,
Jul 7, 2016, 9:42:33 AM7/7/16
to Consul
Hi James,

Thank you for answering my questions. Is there anyway i can track this feature to know when it will be available ? 

Thanks,
Pradeep NSS

James Phillips

unread,
Jul 7, 2016, 11:55:48 AM7/7/16
to consu...@googlegroups.com
Sure I created https://github.com/hashicorp/consul/issues/2171 and will ping that with updates!

Phani Piduri

unread,
Jul 16, 2016, 7:16:20 PM7/16/16
to Consul
Hi Pradeep, Would you be able to provide some direction on Consul deployment on Openshift?
We have so far used Consul for Service Discovery and Registration for micro services in our non-paas environment with just running the spring-boot standalone apps.
Now we have dockerized all of these apps would like to deploy openshift with Consul as Service Discovery tool.
Is there a sample script available for you to share for setting up the consul cluster? I assume you have used registrator as well. Appreciate all your help

When I run oc new-app consul image the POD itself does not come up successfully.

Anyone others on this group can suggest the best practices to achieve this I would appreciate. 

Thanks !!

nss pradeep

unread,
Jul 17, 2016, 11:17:00 AM7/17/16
to consu...@googlegroups.com
Hi Phani,

Did you check the logs when the POD failed ? What is the status when you run "oc get pod " ?

Thanks,
Pradeep NSS

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

nss pradeep

unread,
Aug 8, 2016, 1:46:02 PM8/8/16
to Consul
Reply all
Reply to author
Forward
0 new messages