Federation - Too many Direct Connections

Charlie

unread,

Apr 27, 2018, 12:05:26 PM4/27/18

to rabbitmq-users

Hi,

I'm going to try to give as much detail as I can.

The simplest scenario is two rabbitmq clusters linked via federation A and B.

The upstream (B) has ~2500 queues which are being federated.

I expect that when the A makes the federated direct links to B, it will pick a random machine in B (because federation doesn't seem to round robin the machines in the cluster it is connecting to) and make ~2500 connections (one for each queue).

What I am seeing is ~90k connections with both A and B running out of memory. 25+GB of ram used. Though I don't know if memory ran out before or after the ~90k connections, but I assume after.

All the connections have the following pattern:

<rabbit@a*IP*.1.10.898>

Federation link (upstream: 27064, policy: im)

Where *IP* is the ip of the machine in A making the connection.

here is another couple examples:

<rabbit@a*IP*.1.10004.906>

Federation link (upstream: 27064, policy: im)

<rabbit@a*IP*.1.14512.818>

Federation link (upstream: 27064, policy: im)

<rabbit@a*IP*.1.14482.879>

Federation link (upstream: 27064, policy: im)

I believe the .1 after the IP is always the same. I'm not sure what any of those numbers represent.

Any clue what is being done incorrectly?

Thanks!

Charlie

unread,

Apr 27, 2018, 3:01:53 PM4/27/18

to rabbitmq-users

Just wanted to add the versions for things:

RabbitMQ 3.7.4

Erlang 20.3

Michael Klishin

unread,

Apr 27, 2018, 9:13:32 PM4/27/18

to rabbitm...@googlegroups.com

Federation will connect to the first URI from the provided list which succeeds. It does not

know if the target node is a cluster member or not. You can make it connect to a load balancer.

You can reduce RAM consumption per network connection [1]. Direct connections are trickier because it's a completely

separate code path in a couple of areas. I'd need to check what Erlang client and runtime settings are available w.r.t. TCP buffer sizes.

1. https://www.rabbitmq.com/networking.html

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Charlie

unread,

Apr 28, 2018, 8:40:21 AM4/28/18

to rabbitmq-users

Thanks for the quick reply.

Any way to know how many connections would be needed. I assumed one per queue being federated (~2500 in my case) but I'm seeing many more (~90000).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Apr 28, 2018, 8:55:42 AM4/28/18

to rabbitm...@googlegroups.com

Federation can apply to queues *and* exchanges.

A federation link opens two connections from a node (one upstream and one downstream), even if the link is hosted on the

downstream node.

One known limitation of federation is that all links are colocated on a single node. That's the most likely limiting factor.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Charlie

unread,

Apr 28, 2018, 11:17:35 AM4/28/18

to rabbitmq-users

I do not think that is our issue.

We only have 8 exchanges (we only create 1 of our own).

So queues plus exchanges times the 2 upstream and downstream connections is (2500 + 8) * 2 = 5016 total connections... this still does not explain the 90000 direct connections we're seeing.

We set the reconnect delay to 60s - is it possible that that limit should be longer?

Maybe a direct connection becomes unresponsive, federation plugin makes a new connection and the old one still hasn't closed? And this happens quickly enough that we reach 90k connections?

I'm literally guessing here as I cannot explain the 90k number of direct connections...

Michael Klishin

unread,

Apr 28, 2018, 11:21:59 PM4/28/18

to rabbitm...@googlegroups.com

Then you have to inspect the connections.

We've seen a case where *outgoing* connections on a node were never succeeding due to firewall settings that silently

dropped frames so that TCP connections never completed the handshake but also did not error out. That lead to a connection build-up.

http://www.rabbitmq.com/heartbeats.html and TCP keepalives is what was recommended in that case. 60s sounds like a reasonable value

but any value that's lower than stale TCP connection detection would lead to a cumulative growth up to a certain number in scenarios such as above.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Charlie

unread,

Apr 29, 2018, 5:58:10 PM4/29/18

to rabbitmq-users

I did some experimentation.

I increased the ram on the systems, increased the federation reconnect delay to 20 minutes, and let it run overnight.

Of 13 clusters, 1 seems to be exhibiting the issue this time (it seems random which ones have the issue).

It has 330000 connections and is using 28gb of ram, 16gb of which is "other connections", in the attached image, and I assume "other" means direct federated connections. 325000 of them are all connections to its two upstream clusters (according to the API).

It is also using 2 million+ erlang processes (which I assume are just multiples of the number of connections), but only 7000 sockets.

Also, the two clusters it has these 325000 direct connections with show a reasonable 20000 connections only.

There are a lot of these errors in the log:

2018-04-29 21:09:35.024 [error] <0.29883.4> Supervisor {<0.29883.4>,rabbit_federation_link_sup} had child {upstream,[<<"amqp://IPa">>,<<"amqp://IPb">>,

<<"amqp://IPc">>],

<<"imagemanagement-www-aceuae-com_ION_Standard_IM-10470768">>,

<<"imagemanagement-www-aceuae-com_ION_Standard_IM-10470768">>,1000,

1,1200,none,none,false,'on-confirm',none,<<"26119">>,false} started with rabbit_federation_queue_link:start_link({{upstream,[<<"amqp://IPa">>,<<"amqp://IPb">>,<<"amqp://IPc">>],...},...}) at <0.7981.4716> exit with reason {timeout,{gen_server,call,[<0.13735.4716>,connect,60000]}} in context child_terminated

I'm going to try setting the heartbeat and connection timeout by setting the query parameters in the federation amqp links:

amqp://IPa?heartbeat=30&connection_timeout=10000

The current linux tcp keepalive time is the default 2 hours, so that could be part of the issue (I will get that changed as well).

Do the errors in the log fit the issue hypothesis?

I'll update once I've completed the modifications and run some tests.

Thanks for all the help!!!

memoryUsage.jpeg

Michael Klishin

unread,

Apr 30, 2018, 12:26:10 AM4/30/18

to rabbitm...@googlegroups.com

There is no "other connections" category. "Other" means "everything not covered by other categories".

You ahve 6.7 GB of RAM allocated by the runtime but not used by the broker. In other words you have a memory fragmentation problem.

Which has been discussed before on this list: there are different runtime allocator flags to try and some may be more efficient for your workload

than others.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jonas Falk

unread,

May 22, 2018, 8:25:28 AM5/22/18

to rabbitmq-users

Hi, We're experiencing the same issue on rabbitmq 3.7.4. Did you find a solution to this Charlie?

We normally have around 1xx connections to our rabbitmq server, with federation plugin enabled (where 20 queues are being federated).

5 days ago, we had some problem in our hosting environment (don't really know the reason to this though :(. or what really happened). The rabbitmq logs shows a lot of timeouts though, like:

2018-05-17 06:51:54.973 [error] <0.16368.162> ** Generic server <0.16368.162> terminating
** Last message in was {'$gen_cast',maybe_go}
** When Server state == {not_started,{amqqueue,..........
** Reason for termination ==
** {timeout,{gen_server,call,[<0.16226.162>,connect,60000]}}

2018-05-17 06:51:54.973 [error] <0.16368.162> CRASH REPORT Process <0.16368.162> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.16226.162>,connect,60000]}} in gen_server2:terminate/3 line 1166

The management gui now (5 days later) lists over 15k connections, and 99% of them are "direct" connections created on the same date and time when we had problems in the hosting environment.
The connections looks like:

Client-provided name Federation link (upstream: xxx policy: xxx)
Username none
Protocol Direct 0-9-1

It seems like the federationplugin hasn't closed the connections it created during the period of time where we were having problems?

Michael Klishin

unread,

May 22, 2018, 9:45:29 AM5/22/18

to rabbitm...@googlegroups.com

Federation links are only meant to be closed when an exchange/queue they are backing

is gone. If you make claims please back them by evidence.

The exception simply says that a connection operation timed out. There can be all kinds of reasons for that.

Federating fewer exchanges (queues) will reduce the number of links and therefore connections.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Charlie

unread,

May 22, 2018, 1:30:31 PM5/22/18

to rabbitmq-users

For reference, we just changed this on our linux boxes:

echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time

added this to the amqp:// links for federation updtreams:

?heartbeat=30&connection_timeout=10000

And increased the ram.

We expect lots of connections (we have a lot of queues).

Things seem to be pretty steady for us now.

Thanks!

Michael Klishin

unread,

May 23, 2018, 1:04:27 AM5/23/18

to rabbitm...@googlegroups.com

Thanks for reporting back. Heartbeats or TCP keepalives [1] control network connection [in]activity timeouts while TCP keepalives or

nettick value [2] cover direct connections. We will add a section to the Federation guide.

1. http://www.rabbitmq.com/heartbeats.html

2. http://www.rabbitmq.com/nettick.html

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jun 25, 2018, 2:03:26 PM6/25/18

to rabbitmq-users

Thanks to some excellent investigative work by Ricardo Gonçalves we have identified

one scenario in which direct connections can be left behind on the downstream node with exchange federation:

https://github.com/rabbitmq/rabbitmq-federation/issues/76.

Ricardo also contributed a fix which we've verified and merged. It will be in 3.7.7 provided that a couple more rounds of QA (another round from one more

member of our team and our standard pipeline) succeed.

Thank you, Charlie and Ricardo.

On Wednesday, May 23, 2018 at 8:04:27 AM UTC+3, Michael Klishin wrote:

Thanks for reporting back. Heartbeats or TCP keepalives [1] control network connection [in]activity timeouts while TCP keepalives or
nettick value [2] cover direct connections. We will add a section to the Federation guide.

1. http://www.rabbitmq.com/heartbeats.html

2. http://www.rabbitmq.com/nettick.html

Pranjal Jain

unread,

Sep 19, 2018, 1:51:23 AM9/19/18

to rabbitmq-users

Hi MIchael,

I am using rabbitmq server 3.7.7 and erlang 20.3.4

I can still see this issue occurring. In rabbitmq management UI I can see a thousands of connections with entry.

Overview					Details			Network		+/-
Virtual host	Name	Node	User name	State	SSL / TLS	Protocol	Channels	From client	To client
/	<nodeName.3.22867.0>	nodeName	<some username>			Direct 0-9-1

I do not use federation/shovel plugin.

What all other rabbitmq processes can use "Direct 0-9-1" protocols?

Is the issue known? I do not have steps to reproduce.

Michael Klishin

unread,

Sep 19, 2018, 3:53:28 AM9/19/18

to rabbitm...@googlegroups.com

This list uses one thread per question. Please stop posting to existing threads.

Direct connections are a feature unique to the Erlang client. Federation links and Shovels

use it internally (not all connections they use a direct, though).

There is a recent thread [1] where I cannot reproduce what was reported here but

someone claims to have an example that does. I haven't gotten to work on that yet.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pranjal Jain

unread,

Sep 19, 2018, 4:21:44 AM9/19/18

to rabbitmq-users

Hi Michael,

Apologies, I will keep in mind creating a new thread from next time.

Does RabbitMQ management plugin also uses direct connections somehow?

Can you share link to thread [1] in your last comment, I can also try reproducing it after getting some initial thoughts/circumstances?

Best Regards,

Pranjal

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Sep 19, 2018, 11:20:58 AM9/19/18

to rabbitm...@googlegroups.com

Management plugin uses direct connections in the aliveness test endpoint but that's it

IIRC.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Pranjal Jain

unread,

Sep 21, 2018, 12:41:15 AM9/21/18

to rabbitmq-users

Yes, In my deployment I have a monitoring job which calls rabbitmq management APIs aliveness test every 5 minute. Maybe some of these calls are misbehaving.

Michael Klishin

unread,

Sep 21, 2018, 3:45:33 AM9/21/18

to rabbitm...@googlegroups.com

We are only aware of one such scenario with that endpoint: when the node is in an alarmed state. I don't remember if it was

reproduced and/or addressed.

Pranjal Jain

unread,

Sep 21, 2018, 3:54:29 AM9/21/18

to rabbitmq-users

Yes, RabbitMQ cluster was under memory alarm state continuously during the time when count of direct connections was increasing.

> OverviewDetailsNetwork+/-Virtual hostNameNodeUser nameStateSSL / TLSProtocolChannelsFrom clientTo client/<nodeName.3.22867.0>nodeName<some username>Direct 0-9-1

Michael Klishin

unread,

Sep 21, 2018, 4:20:01 AM9/21/18

to rabbitm...@googlegroups.com

OK, that explains it. We'll try to reproduce.

To post to this group, send an email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andrey Lobachev

unread,

Jul 9, 2019, 4:33:16 PM7/9/19

to rabbitmq-users

Faced with the same issue (rabbitmq 3.7.14, erlang 21.3.7). It appears when downstream got timeout state.

Steps to reproduce:

1. Start two independent nodes with rabbitmq_management, rabbitmq_federation and rabbitmq_federation_management plugins enabled

2. Set up the upstream and policy for queue federation

3. Ensure that federation status in 'running' state

4. On downstream server create iptables rule: iptables -A OUTPUT -d _ip_addr_upstream_server_ -j DROP

5. After timeout events will appear a lot of Direct connections

Note: If downstream got reject state (e.g. via: iptables -A OUTPUT -d _ip_addr_upstream_server -j REJECT) then there won't be these stuck Direct connections

пятница, 21 сентября 2018 г., 11:20:01 UTC+3 пользователь Michael Klishin написал:

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward