Serf Health Status Failures

2,198 views
Skip to first unread message

David Petzel

unread,
Nov 4, 2014, 4:56:31 PM11/4/14
to consu...@googlegroups.com
I've done some searching but so far come up dry. We are in the early stages of rolling out Consul in our AWS regions. Right now the regions are NOT communicating (we have not set that up yet), so we essentially have two independent installations, which are both exhibiting similar symptoms. The `Serf Health Status Failures` on random nodes, seems to randomly (but frequently) go into error and then come back out of error giving it a "flapping" effect. 

Being super new to Consul I'm wondering if anyone has seen this, and if not whats the best way to narrow down what might be happening. I've been reviewing the logs, but nothing is jumping out at me to be a "cause" just information letting me know its happening.

Thanks

Brian Lalor

unread,
Nov 4, 2014, 6:15:17 PM11/4/14
to David Petzel, consu...@googlegroups.com
When this happens to me, it usually means that the host doesn't allow access to the consul lab ports. 

--
Brian Lalor
bla...@bravo5.org
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Lalor

unread,
Nov 4, 2014, 6:15:59 PM11/4/14
to David Petzel, consu...@googlegroups.com
*lan ports…

--
Brian Lalor
bla...@bravo5.org

David Petzel

unread,
Nov 4, 2014, 7:50:42 PM11/4/14
to consu...@googlegroups.com, david...@gmail.com
Thanks Brian, You might be onto something here. After reading your reply, I went back and review things and I see the following:
 
From the Consul Docs: "Serf LAN (Default 8301). This is used to handle gossip in the LAN. Required by all agents, TCP and UDP."

So our current security group rules allow:
* All Consul **servers** can talk to all agents
* All Agents can talk to all Consul Servers
* Agents in prod can talk to agents in prod
* Agent in qa can talk to agents in qa
* Our qa and prod nodes are part of the same cluster (ie sharing the same Consul Servers)
* qa agents can **not** talk to prod agents.

I don't fully understand how all the magic happens, but I could see things failing depending on which agent was testing which agent...I'll see about getting my security groups updated and I'll report back after that.

Edwin Fuquen

unread,
Nov 5, 2014, 10:45:09 AM11/5/14
to consu...@googlegroups.com, david...@gmail.com
I've actually been seeing the exact same flapping behavior you've describe, but our security group is setup to allow all communication on all ports for boxes within the same security group, which all our consul servers/agents are on.  Also, even if it was a blocked udp port I'm not sure how it would make sense for a box to keep coming into and out of a cluster, if it really was a result of ports being blocked I wouldn't expect the consul server to have any connectivity to the cluster at all.  The fact that it is able to connect and then continue to repeatedly fall out of and join the cluster would lead me to think it's something else (possibly network related, but not blocked ports).

What I've also noticed on my end is this problem gets worse with time. Whenever I initially start a cluster it's fine, but after a few days things start to go down hill and boxes start flapping into and out of the cluster constantly.

David Petzel

unread,
Nov 5, 2014, 11:01:11 AM11/5/14
to consu...@googlegroups.com, david...@gmail.com
In my configuration, the agents all have access to the consul servers

A quote from the Consul Docs:
The "serfHealth" check is special, in that all nodes automatically have this check. When a node joins the Consul cluster, it is part of a distributed failure detection provided by Serf. If a node fails, it is detected and the status is automatically changed to "critical".


Based on that information and the behavior we are seeing, I don't think the issue is "blocked ports"to to the consul serves per say, but the fact that not all nodes can talk to each other. So depending on how that distributed check works I could imagine a scenario like this:
* Some production agents check on another production agent - It shows as up
* Next time around some QA agents check on that production agent - Now it shows as down, since in my case QA can't currently talk production
* Next time we're back to some production agents checking.

So it is interesting that your seeing similar issues without the connectivity restrictions I have in my environment

Armon Dadgar

unread,
Nov 5, 2014, 1:40:10 PM11/5/14
to David Petzel, consu...@googlegroups.com, david...@gmail.com
Hey,

The distributed failure detection does require that ALL nodes can communicate with each other over port 8301.
This means not just client <-> server, but client <-> client as well. This is so that the servers don’t have the burden
of all the health checking.

In the case of having QA / Prod environments, we typically recommend that they each have their own datacenter,
and not to overload them into the same datacenter. That way you get a clean namespace separation, and you
don’t run into strange networking issues when separating the clusters.

Edwin, the behavior you are describing is typical when the TCP port is open but not
the UDP port. The gossip system uses both TCP+UDP, so the UDP failure detector may be failing all the
time, and the TCP anti-entropy is able to recover, causing a constant flapping scenario.

Best Regards,
Armon Dadgar

David Petzel

unread,
Nov 5, 2014, 2:28:45 PM11/5/14
to consu...@googlegroups.com, david...@gmail.com
Thanks Armon,
I'm pretty sure I've got my connectivity squared away, I've been capturing packets for a bit now, and I'm seeing connections across environments and such.

The bad news is that I'm still getting this flapping. 
I've gisted a sampling of the data I'm looking at here, https://gist.github.com/dpetzel/c48b2256dfce407adb15

Is there anything specific I can look at in more detail to help narrow down what is going on?

Armon Dadgar

unread,
Nov 5, 2014, 10:02:28 PM11/5/14
to David Petzel, consu...@googlegroups.com, david...@gmail.com
Based on the logs, it doesn’t look like that particular pair that you ran tcpdump on is the problem.
On the machine that is being booted, you should see messages like:

Refuting suspect from X
Refuting dead from X

And on the otherwise you should see messages like:

Suspect X of failure

These messages will help establish the pairs of nodes having issues. The message you see in the
logs is just a plain Join/Fail which could be gossiped from any node, but doesn’t help pinpoint
the origin.

Best Regards,
Armon Dadgar

From: David Petzel <david...@gmail.com>
Reply: David Petzel <david...@gmail.com>>
Date: November 5, 2014 at 11:28:46 AM
To: consu...@googlegroups.com <consu...@googlegroups.com>>
Cc: david...@gmail.com <david...@gmail.com>>
Subject:  Re: Serf Health Status Failures

David Petzel

unread,
Nov 5, 2014, 10:28:12 PM11/5/14
to consu...@googlegroups.com, david...@gmail.com
Thanks Armon, 
I think I finally have this kicked. Things have been looking stable for about 30 minutes now. In the end I had two different issues:
1) Was the issue described above in that my security groups were not permissive enough for the agent <==> communications
2) This one was much more convoluted. I have a mixture of "normal" agents (running native on the OS), and some agents running as Docker containers. In our environment we customized the ports on both types of agents. On my Docker containers however, I left the default ports configured, and simply forwarded our custom ports to the default ports in the container. On the surface this seemed to work as they would register. Once I fixed #1, it became clear the only nodes still having issues were the docker containers. Looking through the logs on those I could see log messages where it was trying to communicate  with the default ports rather than our custom ports. So it "feels" like an agent announces to the group what port is listening on, and other nodes honor that and try to use it (makes sense). I ended up just updating my container Image so that the agent is listening on our custom ports inside the container, and after I rolled out that change, I've not had any issues.

Thanks again for your help and confirmation on questions.
Reply all
Reply to author
Forward
0 new messages