Consul - SerfHealth keeps flapping up and down

899 views
Skip to first unread message

Rishi Dhupar

unread,
May 12, 2015, 7:41:59 PM5/12/15
to consu...@googlegroups.com
I just started playing with consul on about 10 nodes. I must say it was extremely easy and will continue to evaluate it!

One of the nodes keeps spamming error messages and consul-alert is always detecting the SerfHealth flap up and down. I did read in other topics it may have to do with network acl buts I don't think that is my problem. I have verified there are no iptables on the suspected node and healthy server. I am only running one server just to play around with consul. I compared packets using tcpdump and every packet, UDP or TCP that is sent or received matches the other box exactly. I then compared this unhealthy box to a healthy one, traffic-wise they look very similar.

Here is a log of what consul on the bad box is spamming:

2015/05/12 19:24:54 [WARN] memberlist: Refuting a suspect message (from: a-01)
2015/05/12 19:24:57 [WARN] memberlist: Refuting a suspect message (from: b-01)
2015/05/12 19:25:02 [WARN] memberlist: Refuting a suspect message (from: c-01)
2015/05/12 19:25:04 [WARN] memberlist: Refuting a suspect message (from: d-01)
2015/05/12 19:25:07 [WARN] memberlist: Refuting a suspect message (from: d-01)
2015/05/12 19:25:23 [WARN] memberlist: Refuting a suspect message (from: f-01)
2015/05/12 19:25:36 [WARN] memberlist: Refuting a dead message (from: f-01)
2015/05/12 19:25:53 [WARN] memberlist: Refuting a suspect message (from: g-01)
2015/05/12 19:26:57 [WARN] memberlist: Refuting a suspect message (from: h-02)
2015/05/12 19:27:35 [WARN] memberlist: Refuting a dead message (from: f-01)
2015/05/12 19:27:37 [WARN] memberlist: Refuting a suspect message (from: a-01)
2015/05/12 19:27:40 [WARN] memberlist: Refuting a suspect message (from: d-01)
2015/05/12 19:27:43 [WARN] memberlist: Refuting a suspect message (from: a-01)


19:39:35.128231 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [.], ack 2672, win 168, options [nop,nop,TS val 1264064041 ecr 1903407511], length 0
19:39:48.579813 IP 192.168.20.11.8301 > 192.168.20.15.8301: UDP, length 40
19:39:48.580795 IP 192.168.20.15.8301 > 192.168.20.11.8301: UDP, length 11
19:39:50.180139 IP 192.168.20.11.8301 > 192.168.20.15.8301: UDP, length 155
19:39:53.581889 IP 192.168.20.11.8301 > 192.168.20.15.8301: UDP, length 40
19:39:53.582821 IP 192.168.20.15.8301 > 192.168.20.11.8301: UDP, length 11
19:40:05.086614 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [P.], seq 345:357, ack 2672, win 168, options [nop,nop,TS val 1264093999 ecr 1903407511], length 12
19:40:05.087517 IP 192.168.20.15.8300 > 192.168.20.11.42206: Flags [P.], seq 2672:2684, ack 357, win 130, options [nop,nop,TS val 1903437511 ecr 1264093999], length 12
19:40:05.087538 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [.], ack 2684, win 168, options [nop,nop,TS val 1264094000 ecr 1903437511], length 0
19:40:05.087546 IP 192.168.20.15.8300 > 192.168.20.11.42206: Flags [P.], seq 2684:2696, ack 357, win 130, options [nop,nop,TS val 1903437511 ecr 1264093999], length 12
19:40:05.087552 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [.], ack 2696, win 168, options [nop,nop,TS val 1264094000 ecr 1903437511], length 0
19:40:05.087772 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [P.], seq 357:369, ack 2696, win 168, options [nop,nop,TS val 1264094000 ecr 1903437511], length 12
19:40:05.127435 IP 192.168.20.15.8300 > 192.168.20.11.42206: Flags [.], ack 369, win 130, options [nop,nop,TS val 1903437551 ecr 1264094000], length 0
19:40:05.244973 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [F.], seq 369, ack 2696, win 168, options [nop,nop,TS val 1264094157 ecr 1903437551], length 0
19:40:05.245743 IP 192.168.20.15.8300 > 192.168.20.11.42206: Flags [F.], seq 2696, ack 370, win 130, options [nop,nop,TS val 1903437669 ecr 1264094157], length 0
19:40:05.245761 IP 192.168.20.11.42206 > 192.168.20.15.8300: Flags [.], ack 2697, win 168, options [nop,nop,TS val 1264094158 ecr 1903437669], length 0
19:40:07.579912 IP 192.168.20.11.8301 > 192.168.20.15.8301: UDP, length 40
19:40:07.580895 IP 192.168.20.15.8301 > 192.168.20.11.8301: UDP, length 11
19:40:17.579864 IP 192.168.20.11.8301 > 192.168.20.15.8301: UDP, length 40
19:40:17.580813 IP 192.168.20.15.8301 > 192.168.20.11.8301: UDP, length 11


Any ideas on how to debug this?

Armon Dadgar

unread,
May 12, 2015, 7:53:11 PM5/12/15
to consu...@googlegroups.com, Rishi Dhupar
Hey Rishi,

This indicates that the UDP traffic likely not flowing correctly between all the nodes.
The failure detector is at the core a UDP PING followed by an ACK. If the ACK message
is dropped (due to ACLs, firewalls, iptables, NAT, etc), then the machine appears to be
dead, and you will see this behavior.

If you go to the logs on one of the machines that suspects failure:

memberlist: Refuting a suspect message (from: a-01)

That is indicating that “a-01” thinks this node has failed. So the logs on “a-01” will provide more information.

That will give you a good place to start, but very likely UDP.

Best Regards,

Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rishi Dhupar

unread,
May 12, 2015, 10:46:53 PM5/12/15
to consu...@googlegroups.com, ris...@gmail.com
Thanks for the quick reply. I forgot about the gossip protocol and this is more broadcast based. I went to one of the boxes, (all 8 other nodes) are complaining about this box and do see some more info. I am assuming this is the UDP port 8301 (Serf LAN) traffic you were referring to.

    2015/05/12 22:25:54 [INFO] serf: EventMemberJoin: bad-box-01 192.168.21.113
    2015/05/12 22:26:08 [INFO] memberlist: Suspect bad-box-01 has failed, no acks received
    2015/05/12 22:26:09 [INFO] memberlist: Marking bad-box-01 as failed, suspect timeout reached
    2015/05/12 22:26:09 [INFO] serf: EventMemberFailed: bad-box-01 192.168.21.113
    2015/05/12 22:26:10 [INFO] serf: EventMemberJoin: bad-box-01 192.168.21.113
    2015/05/12 22:26:11 [INFO] memberlist: Suspect bad-box-01 has failed, no acks received
    2015/05/12 22:26:22 [INFO] memberlist: Suspect bad-box-01 has failed, no acks received
    2015/05/12 22:26:35 [INFO] serf: EventMemberFailed: bad-box-01 192.168.21.113
    2015/05/12 22:26:54 [INFO] serf: EventMemberJoin: bad-box-01 192.168.21.113

I did a tcpdump on udp port 8301 on the bad box and see traffic going both directions. I did it on 4 different boxes and see traffic going in both directions over UDP port 8301 to the specific bad box. I will have the network guys take a look if there some ACL in place but it wouldn't make sense based on the data I have collected with tcpdump.

Just saw something of interest the IP address being printed on the log from a box that cannot connect isn't valid. 192.168.21.113 does not even exist on bad-box-01. Looking at endpoint catalog/nodes the IP address is incorrect of bad-box-01. How would that happen and/or how would I fix this?




On Tuesday, May 12, 2015 at 7:53:11 PM UTC-4, Armon Dadgar wrote:
Hey Rishi,

This indicates that the UDP traffic likely not flowing correctly between all the nodes.
The failure detector is at the core a UDP PING followed by an ACK. If the ACK message
is dropped (due to ACLs, firewalls, iptables, NAT, etc), then the machine appears to be
dead, and you will see this behavior.

If you go to the logs on one of the machines that suspects failure:

memberlist: Refuting a suspect message (from: a-01)

That is indicating that “a-01” thinks this node has failed. So the logs on “a-01” will provide more information.

That will give you a good place to start, but very likely UDP.

Best Regards,

Armon Dadgar

Rishi Dhupar

unread,
May 12, 2015, 11:14:50 PM5/12/15
to consu...@googlegroups.com
Somehow consul is getting a wrong IP address for this particular box. It is getting an address that actually doesn't even exist on this box. The subnet where that IP consul was picking exists but I worked around the problem by setting advertise_addr on this box for now. I will try to look into how consul is getting this random inaccurate IP.

Rishi Dhupar

unread,
May 13, 2015, 7:15:53 AM5/13/15
to consu...@googlegroups.com
I was able to debug this a bit, I think I was more excited to finally write some Go code.

It appears Consul is finding an IP address for an interface that is down, I have created an issue for this and might attempt to put a patch in for it.


Thanks.
Reply all
Reply to author
Forward
0 new messages