Hello,
I have recently activated consul on a datacenter with a medium sized
LAN (around 1,300 nodes) and noticed a large number of arp requests
sent (1.900 requests/sec).
Here is a screenshot:
https://snapshot.raintank.io/dashboard/snapshot/GiinHMNMvny3IT7RRdMRZ3Bf8wxpvoAp
As far as I understand, consul mainly exchange over udp for its
gossip. This leads to have a very large arp cache (it is correctly
sized on my servers though) but also to have a lots of stale entries
in that cache.
The reason is that consul does not have frequent interaction with each
other agent leading to expiration of the arp cache entry.
Default expiration is a random number in 30sec(+/-50%) so it is normal
that entries expire.
A solution would be to increase the cache expiration time
(net.ipv4.neigh.[interface].base_reachable_time_ms on linux, netsh int
ipv4 set interface [interface] basereachable on windows), the maximum
value (on windows) being 1hour.
But since udp is non-connected protocol, using the arp cache entry
cannot extend its lifetime as it would do on tcp (or even icmp).
Whatever the base_reachable_time you use, the entries will get stale
at some point (the kernel thinks you don't have confirmation of their
validity) and trigger an arp probe.
I've increased the value anyway to see the effect and decreased to
~1500 requests/sec (which is better but far from a near zero that I'd
like).
Screenshot:
https://snapshot.raintank.io/dashboard/snapshot/lq3hx83m36C0l6o0Pru7g8MuVqNrfadG.
As a side note, I've fixed the advertised address that was randomly
picked by consul (some nodes have choosen an address without gateway
which is connected to loadbalancers) and decreased to 1200
requests/sec.
Of course this is probably the price to have a large network based on
layer 2, but I am interested to know if other users have encountered
such issues and the way they've solved it.
--
Gregoire