ARP Cache Issues

659 views
Skip to first unread message

Bryan Rockwood

unread,
Aug 7, 2015, 1:50:41 PM8/7/15
to CoreOS User
Greetings,

I am current experiencing a problem with the ARP cache on a set of CoreOS instances running release 723.3.0 on EC2.  We have a service that spins up and destroys EC2 instances which are running fleet on a subnet that talks to an Etcd 2 cluster of three machines on the same subnet.  The first time the instance spins up, everything works fine.  The second time, if Amazon gives that new instance a recent IP (recent as in the last hour), that new instance cannot communicate with the Etcd server.  If I do an 'ip neigh', I can see that the machine shows as reachable but the MAC address is not correct.  So, if I do an 'ip -s -s neigh flush all', the new instance's fleet service immediately connect and 'ip neigh' now shows the correct MAC address.  Honestly, I'm stumped at how to handle this.  Would it be acceptable to put a timer unit in place that runs a flush every five minutes?  Or is there a configuration option I could flip that would help with this situation?

Bryan

eugene.y...@coreos.com

unread,
Aug 7, 2015, 2:46:28 PM8/7/15
to CoreOS User
Hi Bryan,

Can you "cat /proc/sys/net/ipv4/neigh/default/gc_stale_time" and look at the value. That should be the ARP timeout in seconds.
If that value is within the range of time that you're having problems then you can try dialing it down. Please see https://coreos.com/os/docs/latest/other-settings.html#tuning-sysctl-parameters
for instructions on how to change sysctls.

Thanks,
Eugene

Bryan Rockwood

unread,
Aug 8, 2015, 8:52:35 AM8/8/15
to CoreOS User
Eugene,

Thank you for replying.  When I cat 'gc_stale_time', it returns '60' which I take to mean 60 seconds.  But, if I wait 15 minutes and spin up another instance, the old MAC address shows up in the list as STALE.  Here's the output:

10.1.22.134 dev eth0 lladdr 0a:c5:39:83:68:d5 STALE

which I confirmed is the MAC address of the previous instance with that IP.  The new MAC address that AWS handed out is 0a:ee:63:1a:d8:05.  Are there any other sysctl parameters I could tweak to help with this issue?

Bryan

eugene.y...@coreos.com

unread,
Aug 9, 2015, 10:39:48 PM8/9/15
to CoreOS User
Hi Bryan,

Can you run "watch -n 0.5 ip neigh" in one terminal and try pinging the new instance in a different terminal. Does that entry transition from STALE to any other state? It should transition to something like INCOMPLETE or DELAY,PROBE and then to REACHALE/FAILED. What do you see? Can you also run tcpdump via toolbox (https://github.com/coreos/toolbox) to see if the ARP probes are sent out.

-Eugene

Bryan Rockwood

unread,
Aug 10, 2015, 11:48:37 AM8/10/15
to CoreOS User
Eugene,

I've done some more testing and I've been able to narrow the issue down a bit.  When I destroy the EC2 instance, the Etcd server will report the instance's IP as having gone STALE:

10.1.22.134 dev eth0 lladdr 0a:97:23:77:15:67 STALE

When I bring a new instance up with that same IP, it will stay STALE until one of two things happens.  1)  If I attempt to ping the new instance from the Etcd server per your instructions below (which has the IP of 10.1.22.5 by the way), it will go through the process of STALE to DELAY to FAILED to REACHABLE and things work fine.  On the other hand, if I try and initiate communication from the new instance with the old IP first, the entry will go from STALE to REACHABLE and show the old MAC address on the Etcd server.  Then the new instance will be unable to communicate with the Etcd server.  When I use the following cloud config, it will basically lock the Etcd server into reporting REACHABLE until I flush the table manually.

#cloud-config
coreos:
  units:
    - name: fleet.service
      command: start
  fleet:
    public-ip: $private_ipv4

Would it still help to get the toolbox or does this shed any light on to the problem?

Bryan

eugene.y...@coreos.com

unread,
Aug 10, 2015, 6:37:04 PM8/10/15
to CoreOS User
Hi Bryan,

This is very odd. It would be less surprising if it was the other way around.

Can you run tcpdump on the new instance when you initiate a connection to the etcd server and see if the packet leaving the box has the correct (new) MAC address. I guess you should also confirm that it arrives with the same MAC on the etcd server.
If so, this becomes an even bigger mystery of where that old MAC comes from.

-Eugene

Bryan Rockwood

unread,
Aug 10, 2015, 7:49:11 PM8/10/15
to CoreOS User
Eugene,

Attached are the cap files from both boxes.  Again, the 10.1.22.5.cap file is from the Etcd server and the 10.1.22.134.cap file is from the client with the old IP.  As a setup, I waited until 10.1.22.134 reported as STALE on the Etcd server, started tcpdump on both boxes via the toolbox.  On 10.1.22.134, I ran tcpdump as:

tcpdump -i eth0 -v -w capture.cap port not 22

and on 10.1.22.5, I ran it as:

tcpdump -i eth0 -w capture.cap -vvv port not 22 and port not 4001 and not port 7001 and port not 2380 

to filter out some of the noise.

I then tried to curl 10.1.22.5:2379, then tried to ping 10.1.22.5, curl again, and then a final ping.  The entire time, ip neigh on 10.1.22.5 reported as REACHABLE.

Bryan
10.1.22.5.cap.gz
10.1.22.134.cap

eugene.y...@coreos.com

unread,
Aug 11, 2015, 12:35:47 AM8/11/15
to CoreOS User
Bryan,

Your logs confirm that 10.1.22.5 node gets packets with the new MAC but replies with the old MAC. The ARP entry is not being updated. What is supposed to happen is that after 10.1.22.5 sent a packet (syn+ack in this case) with the wrong MAC, it should transition to DELAY state. Since it does not get a response in a few seconds, it should transition to PROBE state and send the ARP packet. I don't see the ARP packet being sent from 10.1.22.5 and I have no idea how it transitions to REACHABLE.

An interesting unrelated fact: the DELAY state is there to save the ARP probe if a SYN+ACK comes back (proving reachability). After combing the kernel source though, I can't find a place where that happens.

I did find this issue which sounds basically the same: https://forums.aws.amazon.com/thread.jspa?messageID=575277
They propose setting net.ipv4.neigh.default.gc_thresh1=0. This should make the garbage collector better nuke stale entries. Even that however wouldn't be foolproof as the entry would need to stick around in STALE state long enough the the GC to run.

This issue should probably be posted on the Linux kernel networking mailing list.

-Eugene

Bryan Rockwood

unread,
Aug 11, 2015, 10:34:18 AM8/11/15
to CoreOS User
Eugene,

Thanks a million for helping me to debug this.  I'll try adding gc_thresh1=0 to the cloud-config for the Etcd cluster.  I also found the following post with a response from Amazon support:


Basically, they are baking that setting into their Amazon Linux AMIs since 2014.09 release.  Is there a place I could submit an issue where we could discuss doing the same for the CoreOS AMI?

Bryan

Brandon Philips

unread,
Aug 11, 2015, 6:34:19 PM8/11/15
to Bryan Rockwood, coreos-user, Michael Marineau, Alex Crawford
github.com/coreos/bugs/issues would be the place to file the issue, we could add something to our AMI to make this setting a default on AWS. CC'ing a couple of people from the OS team to make sure the bug gets filed and to chime in if needed.

Brandon

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages