Bizarre connection behaviour in Federation setup

441 views
Skip to first unread message

Daniele Bartocci

unread,
Sep 29, 2015, 7:47:36 AM9/29/15
to rabbitmq-users
Hi everyone,
we are experiencing a bizarre connection issue with some of our federated nodes. Maybe someone has been through this situation and could give some direction on how to troubleshoot it :)

The setup is quite simple: there is a master exchange which fans out every message to ~40 "slave" exchanges, each on its own geographically distributed RabbitMQ node, through a VPN tunnel on more or less shitty ADSLs (Italy yay!). Each "slave" then has its own replica set of queues, symfony2 consumers etc etc. Every server which hosts a "slave" node is a mirror (both in hardware and software) of the others.

The load is not much: about 40k to 70k messages per day, with an high on evening and weekends but no real peak, and almost no traffic at late night - early morning.
About 35 slaves work easy and steady (without any whipping) as expected, while ~5 of them can't handle it. Messages start queuing up in their internally created federated queue on master node, and while some messages go through, on the long run the queue keeps growing. During night (when there's no traffic) queues decrease but veeeeery slowly - not even 2k messages each queue.

We tried fiddling with prefetch counts, hearthbeats and other config options. Our sysadmins checked network settings, tweaked MTUs, watched netstats ad nauseam, to no avail. There is to note that while we have access to each server which hosts a RabbitMQ node, we did not setup the VPN and don't know the whole net topology (firewalls and such) at each node. The IT figure who manages the net swears it's the same in every location, including the ~5 failing nodes.

Enogh with prologue, here's the bizarre connection behaviour: watching the connection related to the federated queue of an unhealty slave, the chart in management panel shows sporadic bursts of outgoing traffic and then nothing else (apart from the heartbeat every 30 secs). Said heartbeat says the connection is well alive and running.. but idle the whole time. Meanwhile, there are ~90k message in queue.
In the attached picture you can see this on the left, while on the right there is one of the healty slaves.

Any idea of what could cause this? Could it just be down to a shitty network on the downstream node? Are some of the servers cursed?
Thanks in advance,
Daniele
rabbit_connections.png

Michael Klishin

unread,
Sep 29, 2015, 8:25:44 AM9/29/15
to rabbitm...@googlegroups.com
What kind of configuration do you use? What's in RabbitMQ log files on both upstream and downstream nodes? What versions do you run?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<rabbit_connections.png>

Daniele Bartocci

unread,
Sep 29, 2015, 9:45:52 AM9/29/15
to rabbitmq-users
Default configuration on every node. We started out in march with version 3.4.4, and progressively upgraded to 3.5.1 and upper. Currently using 3.5.4 on most nodes (including master) and trying 3.5.5 on some of the faulty slaves, but we didn't see any noticeable improvement (it always ran perfectly on 85% of slave nodes, and faulty on the others, despite RabbitMQ version used).

Federation upstream settings are (requirements are guaranteed delivery):
{"uri":"amqp://user:pa...@10.5.0.59?heartbeat=30","ack-mode":"on-confirm","trust-user-id":false}

Log files on upstream node show a bunch of connection open / closed - there are also shovels pushing messages from slaves to master, which work fine for every node. Here is an extract from a connection which is bound to the federation queue of a faulty node:
=INFO REPORT==== 29-Sep-2015::13:05:09 ===
accepting AMQP connection <0.15776.32> (10.5.118.70:49559 -> 10.5.0.59:5672)
...
=WARNING REPORT==== 29-Sep-2015::14:34:39 ===
closing AMQP connection <0.15776.32> (10.5.118.70:49559 -> 10.5.0.59:5672):
connection_closed_abruptly


These are the corresponding entries on slave node's log:
=INFO REPORT==== 29-Sep-2015::13:05:39 ===
Federation exchange 'federated' in vhost '/' connected to exchange 'federated' in vhost '/' on amqp://10.5.0.59

=ERROR REPORT==== 29-Sep-2015::14:34:40 ===
** Generic server <0.337.0> terminating
** Last message in was socket_closed
** When Server state == {state,amqp_network_connection,
                            {state,#Port<0.7938>,
                                <<"client 10.5.118.70:49559 -> 10.5.0.59:5672">>,
                                30,<0.359.0>,131072,<0.336.0>,undefined,false},
                            <0.358.0>,
                            {amqp_params_network,<<"user">>,
                                <<"pass">>,<<"/">>,"10.5.0.59",
                                5672,0,0,30,infinity,none,
                                [#Fun<amqp_uri.9.9354953>,
                                 #Fun<amqp_uri.9.9354953>],
                                [],[]},
                            0,
                            [{<<"capabilities">>,table,
                              [{<<"publisher_confirms">>,bool,true},
                               {<<"exchange_exchange_bindings">>,bool,true},
                               {<<"basic.nack">>,bool,true},
                               {<<"consumer_cancel_notify">>,bool,true},
                               {<<"connection.blocked">>,bool,true},
                               {<<"consumer_priorities">>,bool,true},
                               {<<"authentication_failure_close">>,bool,true},
                               {<<"per_consumer_qos">>,bool,true}]},
                             {<<"cluster_name">>,longstr,
                              <<"rabbit@master-node">>},
                             {<<"copyright">>,longstr,
                              <<"Copyright (C) 2007-2015 Pivotal Software, Inc.">>},
                             {<<"information">>,longstr,
                              <<"Licensed under the MPL.  See http://www.rabbitmq.com/">>},
                             {<<"platform">>,longstr,<<"Erlang/OTP">>},
                             {<<"product">>,longstr,<<"RabbitMQ">>},
                             {<<"version">>,longstr,<<"3.5.4">>}],
                            none,false}
** Reason for termination ==
** socket_closed_unexpectedly

Sometimes it dies with a different message, here it is (with some exchange info stripped for readability) with the consequent reconnection
=ERROR REPORT==== 27-Sep-2015::05:57:19 ===
** Generic server <0.5027.73> terminating
** Last message in was {'DOWN',#Ref<0.0.1199.218267>,process,<0.5057.73>,
                               killed}
** When Server state == {state,
                         {upstream,
                          [<<"amqp://user:pa...@10.5.0.59">>],
                          <<"federated">>,<<"federated">>,1000,1,1,none,none,
                          false,'on-confirm',none,<<"master">>},
                         {upstream_params,
                          <<"amqp://user:pa...@10.5.0.59">>,
                          {amqp_params_network,<<"user">>,
                           <<"pass">>,<<"/">>,"10.5.0.59",
                           undefined,0,0,0,infinity,none,
                           [#Fun<amqp_uri.9.9354953>,#Fun<amqp_uri.9.9354953>],
                           [],[]},
                          {exchange,
                           {resource,<<"/">>,exchange,<<"federated">>},
                           topic,true,false,false,[],
                           [{federation,
                             [{{<<"master">>,<<"federated">>},<<"B">>}]}],
                           [{vhost,<<"/">>},
                            {name,<<"federate-to-master">>},
                            {pattern,<<"^federated$">>},
                            {'apply-to',<<"exchanges">>},
                            {definition,
                             [{<<"federation-upstream">>,<<"master">>}]},
                            {priority,0}],
                           {[],[rabbit_federation_exchange]}},
                          <<"amqp://10.5.0.59">>,
                          [{<<"uri">>,longstr,<<"amqp://10.5.0.59">>},
                           {<<"exchange">>,longstr,<<"federated">>}]},
                         <<"rabbit@master-node">>,<0.5039.73>,
                         <0.5057.73>,<<"amq.ctag-O-AAy_9iU5LCRy4krTlqTA">>,
                         <<"federation: federated -> rabbit@slave-node-name">>,
                         <<"federation: federated -> rabbit@slave-node-name A">>,
                         {0,nil},
                         6,
                         {dict,5,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{-- EXCHANGE BINDINGS --,
                         <0.5023.73>,<0.5058.73>,
                         {resource,<<"/">>,exchange,<<"federated">>},
                         {98,
                          {906,906,
                           {903,903,nil,{905,905,{904,904,nil,nil},nil}},
                           {921,921,
                            {914,914,
                             {910,910,
                              {908,908,{907,907,nil,nil},{909,909,nil,nil}},
                              {912,912,{911,911,nil,nil},{913,913,nil,nil}}},
                             {918,918,
                              {916,916,{915,915,nil,nil},{917,917,nil,nil}},
                              {920,920,{919,919,nil,nil},nil}}},
                            {928,928,
                             {925,925,
                              {923,923,{922,922,nil,nil},{924,924,nil,nil}},
                              {927,927,{926,926,nil,nil},nil}},
                             {932,932,
                              {930,930,{929,929,nil,nil},{931,931,nil,nil}},
                              {963,963,
                               {948,948,
                                {940,940,
                                 {936,936,
                                  {934,934,
                                   {933,933,nil,nil},
                                   {935,935,nil,nil}},
                                  {938,938,
                                   {937,937,nil,nil},
                                   {939,939,nil,nil}}},
                                 {944,944,
                                  {942,942,
                                   {941,941,nil,nil},
                                   {943,943,nil,nil}},
                                  {946,946,
                                   {945,945,nil,nil},
                                   {947,947,nil,nil}}}},
                                {956,956,
                                 {952,952,
                                  {950,950,
                                   {949,949,nil,nil},
                                   {951,951,nil,nil}},
                                  {954,954,
                                   {953,953,nil,nil},
                                   {955,955,nil,nil}}},
                                 {960,960,
                                  {958,958,
                                   {957,957,nil,nil},
                                   {959,959,nil,nil}},
                                  {962,962,{961,961,nil,nil},nil}}}},
                               {978,978,
                                {971,971,
                                 {967,967,
                                  {965,965,
                                   {964,964,nil,nil},
                                   {966,966,nil,nil}},
                                  {969,969,
                                   {968,968,nil,nil},
                                   {970,970,nil,nil}}},
                                 {975,975,
                                  {973,973,
                                   {972,972,nil,nil},
                                   {974,974,nil,nil}},
                                  {977,977,{976,976,nil,nil},nil}}},
                                {986,986,
                                 {982,982,
                                  {980,980,
                                   {979,979,nil,nil},
                                   {981,981,nil,nil}},
                                  {984,984,
                                   {983,983,nil,nil},
                                   {985,985,nil,nil}}},
                                 {990,990,
                                  {988,988,
                                   {987,987,nil,nil},
                                   {989,989,nil,nil}},
                                  {992,992,
                                   {991,991,nil,nil},
                                   {993,993,nil,
                                    {997,997,
                                     {995,995,
                                      {994,994,nil,nil},
                                      {996,996,nil,nil}},
                                     {999,999,
                                      {998,998,nil,nil},
                                      {1000,1000,nil,nil}}}}}}}}}}}}}}}
** Reason for termination ==
** {upstream_channel_down,killed}

=INFO REPORT==== 27-Sep-2015::05:57:53 ===
Federation exchange 'federated' in vhost '/' connected to exchange 'federated' in vhost '/' on amqp://10.5.0.59


Not much to see here (to me) i'm afraid. It seems the connection hungs up when it finally detects it's stale, but no indication on why it's not transmitting any data when it's up.

Dave Kaufman

unread,
Jun 27, 2016, 9:33:13 AM6/27/16
to rabbitmq-users
Hello - I'm having a similar issue.  Did you ever resolve this?

Thanks,
Dave



On Tuesday, September 29, 2015 at 9:45:52 AM UTC-4, Daniele Bartocci wrote:
Default configuration on every node. We started out in march with version 3.4.4, and progressively upgraded to 3.5.1 and upper. Currently using 3.5.4 on most nodes (including master) and trying 3.5.5 on some of the faulty slaves, but we didn't see any noticeable improvement (it always ran perfectly on 85% of slave nodes, and faulty on the others, despite RabbitMQ version used).

Federation upstream settings are (requirements are guaranteed delivery):
{"uri":"amqp://user:pass@10.5.0.59?heartbeat=30","ack-mode":"on-confirm","trust-user-id":false}
                          [<<"amqp://user...@10.5.0.59">>],

                          <<"federated">>,<<"federated">>,1000,1,1,none,none,
                          false,'on-confirm',none,<<"master">>},
                         {upstream_params,
                          <<"amqp://user...@10.5.0.59">>,

Michael Klishin

unread,
Jun 27, 2016, 9:36:42 AM6/27/16
to rabbitm...@googlegroups.com
`connection_closed_abruptly` means a client (including federation links)
abruptly closed its TCP connection without cleanly closing AMQP 0-9-1 connection first.

If there are intermediaries (e.g. load balancers) involved, check their logs and inactivity timeouts.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Daniele Bartocci

unread,
Nov 8, 2016, 12:38:37 PM11/8/16
to rabbitmq-users
Hello Dave,

after new developments underlined long response times in services publishing messages to the local rabbit queues (on the faulty nodes) without a direct involvement of the federation plugin, we were able to pinpoint the problem.
Symptoms included slow responses even from rabbitmq's commands (time rabbitmqctl status showed it took up to 8 seconds on faulty nodes, less than a second on healty ones).

After a bit of debugging using strace we found out this was due to timeouts on poll calls on sockets while trying to contact DNS servers. Despite all relevant hostnames available in /etc/hosts, we guessed something triggered at the OS level due to VPN setup.
A quick look at /etc/resolv.conf and a quick ping check showed that one of the two DNS servers set for the nodes (every node had two DNS server in its own local network) was not reachable. Removing it from the list of available DNSs resolved the issue, both locally and on the federation.

I'm guessing again that every batch of messages from the upstream to the faulty node triggered the hostname resolution, which more often than not resulted in a timeout in the whole confirm mechanism.

Hope this helps!
Daniele

Michael Klishin

unread,
Nov 8, 2016, 12:41:23 PM11/8/16
to rabbitm...@googlegroups.com, Daniele Bartocci
Thanks for sharing. DNS resolution slowdowns is something that's pretty universally
confusing and painful to investigate. 
> >> {"uri":"amqp://user:pa...@10.5.0.59?heartbeat=30
> >> [#Fun,
> >> #Fun],
> >> [#Fun,#Fun],
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send an email to rabbitm...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Reply all
Reply to author
Forward
0 new messages