RabbitMQ federation timeout/failure

842 views
Skip to first unread message

Dave Kaufman

unread,
Jun 27, 2016, 9:30:00 AM6/27/16
to rabbitmq-users
Hi all! I could really use some help with this RabbitMQ federation timeout issue!

I'm using RabbitMQ broker servers that another individual originally set up.  He had an exchange being federated and it is working just fine.  I have added a new exchange and set up bi-directional federation between the two brokers.  Here are my observables:

1) Every 4 hours and 21 minutes (~15660 seconds), the federation stops for about 20 minutes.  The messages are durable so they get backed up at the upstream broker, before all "flooding" in at the end of that 20 minute outage.

2) When using the 'rabbitmqctl eval 'rabbit_federation_status:status().' command, the federation status stays "running" the entire time, and the timestamp does not update until the end of one of those outage periods, at which time the timestamp updates and stays the same for another 4 hours and 21 minutes.

3) Within the rabbitmq log, the following appears at both the beginning of the outage period and again at the end of the outage period:

=ERROR REPORT==== 27-Jun-2016::09:19:41 ===
** Generic server <0.8647.80> terminating 
** Last message in was heartbeat_timeout
** When Server state == {state,amqp_network_connection,
                            {state,#Port<0.1823790>,
                                <<"client [IP]:53556 -> [IP]:5672">>,
                                580,<0.8660.80>,131072,<0.8648.80>,undefined,
                                false},
                            <0.8659.80>,
                            {amqp_params_network,<<"federation_user">>,
                                <<"##########">>,<<"/">>,
                                "[UPSTREAM]",5672,0,0,0,infinity,
                                none,
                                [#Fun<amqp_uri.9.9354953>,
                                 #Fun<amqp_uri.9.9354953>],
                                [],[]},
                            0,
                            [{<<"capabilities">>,table,
                              [{<<"publisher_confirms">>,bool,true},
                               {<<"exchange_exchange_bindings">>,bool,true},
                               {<<"basic.nack">>,bool,true},
                               {<<"consumer_cancel_notify">>,bool,true},
                               {<<"connection.blocked">>,bool,true},
                               {<<"consumer_priorities">>,bool,true},
                               {<<"authentication_failure_close">>,bool,true},
                               {<<"per_consumer_qos">>,bool,true}]},
                             {<<"cluster_name">>,longstr,<<"\"[CLUSTER]\"">>},
                             {<<"copyright">>,longstr,
                              <<"Copyright (C) 2007-2014 GoPivotal, Inc.">>},
                             {<<"information">>,longstr,
                              <<"Licensed under the MPL.  See http://www.rabbitmq.com/">>},
                             {<<"platform">>,longstr,<<"Erlang/OTP">>},
                             {<<"product">>,longstr,<<"RabbitMQ">>},
                             {<<"version">>,longstr,<<"3.5.3">>}],
                            none,false}
** Reason for termination == 
** heartbeat_timeout

=ERROR REPORT==== 27-Jun-2016::09:19:41 ===
** Generic server <0.8633.80> terminating
** Last message in was {'DOWN',#Ref<0.0.428.138280>,process,<0.8663.80>,
                               killed}
** When Server state == {state,
                         {upstream,
                          [<<"amqp://federation_user:[##########]@[UPSTREAM]">>,
                           <<"amqp://federation_user:[##########]@]@[UPSTREAM 2]">>],
                          <<"[EXCHANGE]">>,<<"[EXCHANGE]">>,1000,1,1,3600000,none,
                          false,'on-confirm',none,<<"upstream-man">>},
                         {upstream_params,
                          <<"amqp://federation_user:[##########]@]@[UPSTREAM 2]">>,
                          {amqp_params_network,<<"federation_user">>,
                           <<"[##########]">>,<<"/">>,"[UPSTREAM 2]",
                           undefined,0,0,0,infinity,none,
                           [#Fun<amqp_uri.9.9354953>,#Fun<amqp_uri.9.9354953>],
                           [],[]},
                          {exchange,
                           {resource,<<"/">>,exchange,<<"[EXCHANGE]">>},
                           topic,true,false,false,[],
                           [{federation,
                             [{{<<"upstream-man">>,<<"[EXCHANGE]">>},<<"A">>}]}],
                           [{vhost,<<"/">>},
                            {name,<<"federation-scs">>},
                            {pattern,<<"^[EXCHANGE]$">>},
                            {'apply-to',<<"exchanges">>},
                            {definition,
                             [{<<"federation-upstream">>,<<"upstream-man">>}]},
                            {priority,0}],
                           {[],[rabbit_federation_exchange]}},
                          <<"[UPSTREAM]"">>,
                          [{<<"uri">>,longstr,
                            <<"[UPSTREAM]">>},
                           {<<"exchange">>,longstr,<<"[EXCHANGE]">>}]},
                         <<"\"man\"">>,<0.8647.80>,<0.8663.80>,
                         <<"amq.ctag-Lm4wCtvhIudSdFc3hh5m1Q">>,
                         <<"federation: [EXCHANGE] -> [CLUSTER]">>,
                         <<"federation: [EXCHANGE] -> [CLUSTER B]">>,
                         {0,nil},
                         1152,
                         {dict,3,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],
                            [[{<<"[KEY]">>,[]}|
                              {set,1,16,16,8,80,48,
                               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                []},
                               {{[],[],[],[],[],
                                 [{resource,<<"/">>,queue,
                                   <<"[QUEUE]">>}],
                                 [],[],[],[],[],[],[],[],[],[]}}}]],
                            [],[],
                            [[{<<"[KEY]">>,[]}|
                              {set,1,16,16,8,80,48,
                               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                []},
                               {{[],[],[],[],[],[],
                                 [{resource,<<"/">>,queue,
                                   <<"[QUEUE]">>}],
                                 [],[],[],[],[],[],[],[],[]}}}]],
                            [[{<<"[KEY]">>,[]}|
                              {set,1,16,16,8,80,48,
                               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                []},
                               {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 [{resource,<<"/">>,queue,
                                   <<"[QUEUE]">>}],
                                 []}}}]],
                            []}}},
                         <0.8652.80>,<0.8656.80>,
                         {resource,<<"/">>,exchange,<<"[EXCHANGE]">>},
                         {0,nil}}
** Reason for termination == 
** {upstream_channel_down,killed}

=INFO REPORT==== 27-Jun-2016::09:19:45 ===
Federation exchange '[EXCHANGE]' in vhost '/' connected to exchange '[EXCHANGE]' in vhost '/' on amqp://[UPSTREAM]


4) Note that I've tried to set up the federation parameter WITH and WITHOUT the heartbeat in the uri's, with no difference in functionality observed

5) If I change the federation policy or parameter, it resets the 4 hour countdown.

6) The good exchange with proper federation does not appear in the output of the 'rabbitmqctl eval 'rabbit_federation_status:status().' command.  


Michael Klishin

unread,
Jun 27, 2016, 9:32:49 AM6/27/16
to rabbitm...@googlegroups.com
There is a default heartbeat timeout (60s in recent versions), not having heartbeats at all is a poor idea
(and excluding the parameter is not how you do it). See http://www.rabbitmq.com/heartbeats.html.

Your federation link detects missing heartbeats from upstream. There probably is a reason for that.
There should be next to no delay in (attempted) link recovery. Have you checked the upstream server logs?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Dave Kaufman

unread,
Jun 27, 2016, 12:00:38 PM6/27/16
to rabbitmq-users
I'm working on getting the upstream server logs.  For some reason, the upstream server logs are empty, so may require a reboot.  I'm not sure if that's part of the problem or not!

I'm running version 3.5.3 on all nodes.

Any idea why the working federation is not showing any status when I run 'rabbitmqctl eval 'rabbit_federation_status:status().'?

Another data point: I tried adding connection_timeout and heartbeat to the URI in the federation parameter, but the connection_timeout made all of the connections fail, so I removed it.  Is this format correct?:

rabbitmqctl set_parameter federation-upstream upstream-man-hb '{"uri":["amqp://federation_user:##########@[UPSTREAM]?heartbeat=3&connection_timeout=10","amqp://federation_user:##########@[UPSTREAM]?heartbeat=3&connection_timeout=10"],"expires":3600000,"max-hops":1, "reconnect-delay":3}'

Michael Klishin

unread,
Jun 27, 2016, 5:46:40 PM6/27/16
to rabbitm...@googlegroups.com
If connection_timeout of 10 seconds makes all connections fail I'd seriously check network connectivity between
nodes.
Reply all
Reply to author
Forward
0 new messages