4 physical hosts (128 GB Ram, 1GBit/s ethernet, 24 cpus, debian 7.5, 3.14.4-1~bpo70+1), each running with consul in server mode. 3 of them run about 15 lxc containers, each container with a consul agent.
the 4th host runs about 25 containers (each with one consul agent) and some containers are running udp service, like graphite, collectd, nameservers etc..
We could successfully join about 50 containers/agents without any issues. But then everytime we want to join a new container/agent or stop/start an existing one, all udp services, including dns queries on the 4th host get in trouble. The consul logs on the 4th host (containers and physical) are showing
2014/06/07 22:57:20 [ERR] memberlist: Failed to send gossip to
172.16.227.164:8301: write udp: invalid argument
2014/06/07 22:57:20 [ERR] memberlist: Failed to send gossip to
172.16.227.216:8301: write udp: invalid argument
2014/06/07 22:57:21 [ERR] memberlist: Failed to send gossip to
172.16.227.138:8301: write udp: invalid argument
2014/06/07 22:57:21 [ERR] memberlist: Failed to send gossip to
172.16.227.156:8301: write udp: invalid argument
2014/06/07 22:57:21 [ERR] memberlist: Failed to send gossip to
172.16.227.242:8301: write udp: invalid argument
2014/06/07 22:57:22 [ERR] memberlist: Failed to send gossip to
172.16.227.138:8301: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to
172.16.227.175:8301: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to
172.16.227.173:8301: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send ack: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to
172.16.227.152:8301: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to
172.16.227.169:8301: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send ping: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to
172.16.227.186:8301: write udp: invalid argument
2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to
172.16.227.152:8301: write udp: invalid argument
2014/06/07 22:57:24 [ERR] memberlist: Failed to send ack: write udp: invalid argument
2014/06/07 22:57:24 [ERR] memberlist: Failed to send gossip to
172.16.227.238:8301: write udp: invalid argument
2014/06/07 22:57:24 [ERR] memberlist: Failed to send gossip to
172.16.227.165:8301: write udp: invalid argument
2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to
172.16.227.156:8301: write udp: invalid argument
2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to
172.16.227.242:8301: write udp: invalid argument
2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to
172.16.227.143:8301: write udp: invalid argument
2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to
172.16.227.151:8301: write udp: invalid argument
2014/06/07 22:57:25 [ERR] memberlist: Failed to send ack: write udp: invalid argument
2014/06/07 22:57:30 [INFO] serf: EventMemberFailed: nodebar 172.16.227.159
2014/06/07 22:57:30 [ERR] memberlist: Failed to send gossip to
172.16.227.142:8301: write udp: invalid argument
2014/06/07 22:57:30 [ERR] memberlist: Failed to send gossip to
172.16.227.173:8301: write udp: invalid argument
2014/06/07 22:57:32 [INFO] serf: EventMemberFailed: nodefoo 172.16.227.216
2014/06/07 22:57:32 [ERR] memberlist: Failed to send gossip to
172.16.227.152:8301: write udp: invalid argument
2014/06/07 22:57:32 [ERR] memberlist: Failed to send gossip to
172.16.227.142:8301: write udp: invalid argument
members join, leave, join.. and it takes about 5 minutes until the messages disappear, all members have joined again and are 'alive' and dns queries or other udp services are working as expected. We played with very aggressive sysctl settings on the 4th host
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 20971520
net.core.wmem_default = 20971520
net.ipv4.udp_rmem_min = 131072
net.ipv4.udp_wmem_min = 131072
net.core.somaxconn = 32768
net.core.netdev_max_backlog = 32768
but that doesn't change anything.