I did some experimentation.
I increased the ram on the systems, increased the federation reconnect delay to 20 minutes, and let it run overnight.
Of 13 clusters, 1 seems to be exhibiting the issue this time (it seems random which ones have the issue).
It has 330000 connections and is using 28gb of ram, 16gb of which is "other connections", in the attached image, and I assume "other" means direct federated connections. 325000 of them are all connections to its two upstream clusters (according to the API).
It is also using 2 million+ erlang processes (which I assume are just multiples of the number of connections), but only 7000 sockets.
Also, the two clusters it has these 325000 direct connections with show a reasonable 20000 connections only.
There are a lot of these errors in the log:
2018-04-29 21:09:35.024 [error] <0.29883.4> Supervisor {<0.29883.4>,rabbit_federation_link_sup} had child {upstream,[<<"amqp://IPa">>,<<"amqp://IPb">>,
<<"amqp://IPc">>],
<<"imagemanagement-www-aceuae-com_ION_Standard_IM-10470768">>,
<<"imagemanagement-www-aceuae-com_ION_Standard_IM-10470768">>,1000,
1,1200,none,none,false,'on-confirm',none,<<"26119">>,false} started with rabbit_federation_queue_link:start_link({{upstream,[<<"amqp://IPa">>,<<"amqp://IPb">>,<<"amqp://IPc">>],...},...}) at <0.7981.4716> exit with reason {timeout,{gen_server,call,[<0.13735.4716>,connect,60000]}} in context child_terminated
I'm going to try setting the heartbeat and connection timeout by setting the query parameters in the federation amqp links:
amqp://IPa?heartbeat=30&connection_timeout=10000
The current linux tcp keepalive time is the default 2 hours, so that could be part of the issue (I will get that changed as well).
Do the errors in the log fit the issue hypothesis?
I'll update once I've completed the modifications and run some tests.
Thanks for all the help!!!