memberlist: Failed to send gossip to ip:8301: write udp: invalid argument

3,102 views
Skip to first unread message

hol...@sauspiel.de

unread,
Jun 7, 2014, 5:42:49 PM6/7/14
to consu...@googlegroups.com
Hi,

we're actually playing with consul and have the following setup:

4 physical hosts (128 GB Ram, 1GBit/s ethernet, 24 cpus, debian 7.5, 3.14.4-1~bpo70+1), each running with consul in server mode. 3 of them run about 15 lxc containers, each container with a consul agent. 
the 4th host runs about 25 containers (each with one consul agent) and some containers are running udp service, like graphite, collectd, nameservers etc..

What's happening:

We could successfully join about 50 containers/agents without any issues. But then everytime we want to join a new container/agent or stop/start an existing one, all udp services, including dns queries on the 4th host get in trouble. The consul logs on the 4th host (containers and physical) are showing 
a lot of messages like

 2014/06/07 22:57:20 [ERR] memberlist: Failed to send gossip to 172.16.227.164:8301: write udp: invalid argument
    2014/06/07 22:57:20 [ERR] memberlist: Failed to send gossip to 172.16.227.216:8301: write udp: invalid argument
    2014/06/07 22:57:21 [ERR] memberlist: Failed to send gossip to 172.16.227.138:8301: write udp: invalid argument
    2014/06/07 22:57:21 [ERR] memberlist: Failed to send gossip to 172.16.227.156:8301: write udp: invalid argument
    2014/06/07 22:57:21 [ERR] memberlist: Failed to send gossip to 172.16.227.242:8301: write udp: invalid argument
    2014/06/07 22:57:22 [ERR] memberlist: Failed to send gossip to 172.16.227.138:8301: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to 172.16.227.175:8301: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to 172.16.227.173:8301: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send ack: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to 172.16.227.152:8301: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to 172.16.227.169:8301: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send ping: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to 172.16.227.186:8301: write udp: invalid argument
    2014/06/07 22:57:23 [ERR] memberlist: Failed to send gossip to 172.16.227.152:8301: write udp: invalid argument
    2014/06/07 22:57:24 [ERR] memberlist: Failed to send ack: write udp: invalid argument
    2014/06/07 22:57:24 [ERR] memberlist: Failed to send gossip to 172.16.227.238:8301: write udp: invalid argument
    2014/06/07 22:57:24 [ERR] memberlist: Failed to send gossip to 172.16.227.165:8301: write udp: invalid argument
    2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to 172.16.227.156:8301: write udp: invalid argument
    2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to 172.16.227.242:8301: write udp: invalid argument
    2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to 172.16.227.143:8301: write udp: invalid argument
    2014/06/07 22:57:25 [ERR] memberlist: Failed to send gossip to 172.16.227.151:8301: write udp: invalid argument
    2014/06/07 22:57:25 [ERR] memberlist: Failed to send ack: write udp: invalid argument
    2014/06/07 22:57:30 [INFO] serf: EventMemberFailed: nodebar 172.16.227.159
    2014/06/07 22:57:30 [ERR] memberlist: Failed to send gossip to 172.16.227.142:8301: write udp: invalid argument
    2014/06/07 22:57:30 [ERR] memberlist: Failed to send gossip to 172.16.227.173:8301: write udp: invalid argument
    2014/06/07 22:57:32 [INFO] serf: EventMemberFailed: nodefoo 172.16.227.216
    2014/06/07 22:57:32 [ERR] memberlist: Failed to send gossip to 172.16.227.152:8301: write udp: invalid argument
    2014/06/07 22:57:32 [ERR] memberlist: Failed to send gossip to 172.16.227.142:8301: write udp: invalid argument

members join, leave, join.. and it takes about 5 minutes until the messages disappear, all members have joined again and are 'alive' and dns queries or other udp services are working as expected. We played with very aggressive sysctl settings on the 4th host

net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 20971520
net.core.wmem_default = 20971520
net.ipv4.udp_rmem_min = 131072
net.ipv4.udp_wmem_min = 131072
net.core.somaxconn = 32768
net.core.netdev_max_backlog = 32768

but that doesn't change anything.

Any hints? :/


Armon Dadgar

unread,
Jun 8, 2014, 4:50:30 PM6/8/14
to hol...@sauspiel.de, consu...@googlegroups.com
In the past, when the “invalid argument” has shown up it has been an issue of compatibility
with Go + kernel being used. Since that error is coming much closer to the syscall layer than
from any of the logic in memberlist.

Can you provide more information about your deployment? For example:

* Did you compile Consul, or are you using our distribution?
* What version of Go / compiler tool chain did you use?
* Any kernel modifications? Standard distribution?
* Any relevant syslog messages?

This is something we may have to bug the Golang mailing lists about as well.

As more of a minimal test case, can you reproduce this with Serf as well? That also
uses the memberlist library under the hood.

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

hol...@sauspiel.de

unread,
Jun 10, 2014, 2:17:20 AM6/10/14
to consu...@googlegroups.com
Hi Armon,

thanks for your answer!

We're using your distribution, version 0.2.1. The kernel is from wheezy-backports (3.14-0.bpo.1-amd64, 3.14.4-1~bpo70+1), without modifications. There are syslog messages, which occur on a member list change, but I don't see something useful here

[1779304.245001] net_ratelimit: 432 callbacks suppressed
[1779315.637091] net_ratelimit: 28 callbacks suppressed
[1779320.761205] net_ratelimit: 586 callbacks suppressed
[1779335.462248] net_ratelimit: 879 callbacks suppressed
[1779340.484470] net_ratelimit: 804 callbacks suppressed
[1779345.610322] net_ratelimit: 853 callbacks suppressed
[1779350.748524] net_ratelimit: 666 callbacks suppressed

Do I have to stop the current consul cluster and set up a new serf cluster to test it with serf? Or could serf be integrated into the existing consul cluster?

Armon Dadgar

unread,
Jun 10, 2014, 2:27:01 PM6/10/14
to hol...@sauspiel.de, consu...@googlegroups.com
So, it seems from some googling that “net_ratelimit” means that messages to syslog are being suppressed.

It also seems this is an issue when the sysctl values have been modified. It would be useful if you could
increase the logging limit so that syslog can provide useful feedback.

This does appear to be a system issue not directly related to Consul.

Serf and Consul can run alongside one another, no need to stop either.

Best Regards,
Armon Dadgar

Jonathan Camp

unread,
Sep 30, 2014, 7:57:32 AM9/30/14
to consu...@googlegroups.com, hol...@sauspiel.de
Hi all, was there a resolution to this? I am seeing the same thing when launching 30-40 containers with a consul agent (0.4) in each one. I've tried various kernels, OSes and hardware. I am currently using docker 1.2.

Thanks! Jonathan

Armon Dadgar

unread,
Sep 30, 2014, 1:42:26 PM9/30/14
to Jonathan Camp, consu...@googlegroups.com, hol...@sauspiel.de
Hey Jonathan,

Like we discussed earlier, I believe this is not an issue with Serf/Consul directly but
rather something that is being triggered due to the kernel configuration. Sadly, I am
unable to help further since I don’t see this in any of our environments. 

Best Regards,
Armon Dadgar

Morten K

unread,
Oct 3, 2014, 9:15:59 AM10/3/14
to consu...@googlegroups.com, jona...@yaresse.com, hol...@sauspiel.de
Hi guys,

I've been debugging a similar issue, and tracked it down to the neighbour/arp table being overflowed, triggering the garbage collector which, in the end, seems to cause timeouts for sendmsg().

To work around it, I adjusted the following values:

net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh3

and for ipv6:

net.ipv6.neigh.default.gc_thresh1
net.ipv6.neigh.default.gc_thresh2
net.ipv6.neigh.default.gc_thresh3

Using the values 1048576 (1024**2) for gc_thresh1, and 1048576 * 4 for gc_thresh2/gc_thresh3, I was able to run serf in over 900 netns, connected by veth pairs to a linux bridge, without any flapping.

Best regards,
Morten

On Tuesday, September 30, 2014 7:42:26 PM UTC+2, Armon Dadgar wrote:
Hey Jonathan,

Like we discussed earlier, I believe this is not an issue with Serf/Consul directly but
rather something that is being triggered due to the kernel configuration. Sadly, I am
unable to help further since I don’t see this in any of our environments. 

Best Regards,
Armon Dadgar

Holger Amann

unread,
Oct 7, 2014, 9:55:18 AM10/7/14
to Morten K, consu...@googlegroups.com, jona...@yaresse.com
Morten, thanks for sharing! I will try that the next few days.
Are there reasons why you’d choose such high values? 

Morten K

unread,
Oct 7, 2014, 12:12:49 PM10/7/14
to consu...@googlegroups.com, mor...@krakvik.no, jona...@yaresse.com
Hi Holger,

The reason for the high values is that the system seems to need to accommodate for neighbour/arp tables for all namespaces. E.g. I tested running serf in 998 different network namespaces; when I inspected the neighbour table after they've been running for a while, I had 2 * 998 (running ipv6 with link local addr + ULA, all veth interfaces connected to a single bridge) entries in each namespace, that's almost 2M entries for all namespaces. You can check the neighbour table by calling "ip neigh sh" in the respective namespaces.

--
Morten
Reply all
Reply to author
Forward
0 new messages