Hi,
We have run into an issue with a high number of "direct" connections being created by the federation plugin which causes a memory alarm and connections to be blocked.
We are in the process of upgrading from RabbitMQ 3.6.5 (Erlang 18.3) to RabbitMQ 3.7.7 (Erlang 20.3).
In our test environment we have two clusters (A and B) with five nodes in each with a Keepalived/LVS load balancer fronting each cluster. Cluster A is running 3.6.5 and cluster B is running 3.7.7.
We use federated queues and exchanges in both directions and have around 900 running federation links in each cluster.
Due to a misconfigured backup process the loadbalancer for cluster A did not respond at all for about 20 minutes. This led to the number of direct (protocol: "Direct 0-9-1") connections on cluster B to increase to the point where the memory alarm was triggered and connections were blocked. After the loadbalancer on cluster A became available again the federation links were restored but the thousands of direct connections were never removed.
The logs on cluster B (RabbitMQ 3.7.7) are filled with error messages like this:
2018-09-10 02:20:09.949 [error] <0.13662.124> ** Generic server <0.13662.124> terminating
** Last message in was {'$gen_cast',maybe_go}
** When Server state == {not_started,{amqqueue,{resource,<<"/">>,queue,<<"some.queue">>},true,false,none,[{<<"x-dead-letter-exchange">>,longstr,<<"some.dlx">>},{<<"x-dead-letter-routing-key">>,longstr,<<"some.key">>}],<0.11453.16>,[<5406.30539.140>],[],['rabbit@host-03'],[{vhost,<<"/">>},{name,<<"default">>},{pattern,<<".*">>},{'apply-to',<<"queues">>},{definition,[{<<"federation-upstream-set">>,<<"all">>},{<<"ha-mode">>,<<"exactly">>},{<<"ha-params">>,2},{<<"ha-sync-mode">>,<<"automatic">>}]},{priority,0}],undefined,[{<5406.30547.140>,<5406.30539.140>},{<0.11460.16>,<0.11453.16>}],[rabbit_federation_queue],live,0,[],<<"/">>,#{user => <<"admin">>}},false,{upstream,[<<"amqp://federation:password@host:5672/%2f">>],<<"some.queue">>,<<"some.queue">>,1000,1,5,none,none,false,'on-confirm',none,<<"host">>,false},{upstream_params,<<"amqp://federation:password@host:5672/%2f">>,{amqp_params_network,<<"federation">>,<<"password">>,<<"/">>,"host",5672,2047,0,10,60000,none,[#Fun<amqp_uri.12.90191702>,#Fun<amqp_uri.12.90191702>],[],[]},{amqqueue,{resource,<<"/">>,queue,<<"some.queue">>},true,false,none,[{<<"x-dead-letter-exchange">>,longstr,<<"some.dlx">>},{<<"x-dead-letter-routing-key">>,longstr,<<"some.key">>}],<0.11453.16>,[<5406.30539.140>],[],['rabbit@host-03'],[{vhost,<<"/">>},{name,<<"default">>},{pattern,<<".*">>},{'apply-to',<<"queues">>},{definition,[{<<"federation-upstream-set">>,<<"all">>},{<<"ha-mode">>,<<"exactly">>},{<<"ha-params">>,2},{<<"ha-sync-mode">>,<<"automatic">>}]},{priority,0}],undefined,[{<5406.30547.140>,<5406.30539.140>},{<0.11460.16>,<0.11453.16>}],[rabbit_federation_queue],live,0,[],<<"/">>,#{user => <<"admin">>}},...}}
** Reason for termination ==
** {timeout,{gen_server,call,[<0.13683.124>,connect,60000]}}
2018-09-10 02:20:09.949 [error] <0.13662.124> CRASH REPORT Process <0.13662.124> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.13683.124>,connect,60000]}} in gen_server2:terminate/3 line 1166
2018-09-10 02:20:09.950 [error] <0.11478.16> Supervisor {<0.11478.16>,rabbit_federation_link_sup} had child {upstream,[<<"amqp://federation:password@host:5672/%2f">>],
<<"some.queue">>,
<<"some.queue">>,1000,1,5,none,none,
false,'on-confirm',none,<<"host">>,false} started with rabbit_federation_queue_link:start_link({{upstream,[<<"amqp://federation:password@host:5672/%2f">>],<<"xxxxx....">>,...},...}) at <0.13662.124> exit with reason {timeout,{gen_server,call,[<0.13683.124>,connect,60000]}} in context child_terminated
2018-09-10 02:20:09.950 [warning] <0.13677.124> Channel (<0.13677.124>): Unregistering confirm handler <0.13662.124> because it died. Reason: {timeout,{gen_server,call,[<0.13683.124>,connect,60000]}}
We can reproduce the problem using this iptable rule to just drop all traffic on port 5672:
iptables -A INPUT -i eth0 -p tcp --dport 5672 -j DROP
Grateful for any help with this problem and if more information is needed then I will try to provide it.
Thanks,
Olof