Issue with federation upstream links and cluster split autoheal

Tuukka Lahtela

unread,

Dec 22, 2016, 5:04:12 AM12/22/16

to rabbitmq-users

I have found a strange issue with federation links and cluster split recovery. The upstream stops working untill I recreate federation policy.

The problem occurs when I create a short network split on the downstream cluster. After autoheal finishes the federation link seems to break. The admin ui shows that all links are running but from the upstream side I can see that the exchange for the particular federation is missing the binding from the local exchange to the federation exchange (the from -> this exchange part is missing). The link starts working if I recreate the federation upstream policy in the downstream cluster. I am testing with 2 server clusters on both sides. The policy recreation is enough I do not need to reboot any of the servers.

KH

unread,

Dec 22, 2016, 10:21:35 PM12/22/16

to rabbitmq-users

Tuuka,

Please refer to https://www.rabbitmq.com/heartbeats.html for more details as to heartbeat which can possibly help to resolve this issue. When the upstream connection is recreated the link applies the federation policy, this will in turn recreate the missing exchanges . You could also try to modify your upstream configuration, in this way:

amqp://user:user@yourserver?heartbeat=10

KH

Michael Klishin

unread,

Dec 23, 2016, 4:59:12 AM12/23/16

to rabbitm...@googlegroups.com

Also relevant:

https://github.com/rabbitmq/rabbitmq-federation/issues/46

https://github.com/rabbitmq/rabbitmq-website/commit/422732177f744011c0705aaae871e6c1265353e2

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Tuukka Lahtela

unread,

Dec 27, 2016, 12:58:38 AM12/27/16

to rabbitmq-users

We have been using the heartbeat value and it does not help in this issue.

Tuukka Lahtela

unread,

Dec 27, 2016, 5:57:20 AM12/27/16

to rabbitmq-users

One thing to note in the issue is that the connection between the federated clusters is not broken at any point. The connection only breaks inside the cluster (between two clustered brokers) and the problem occurs once the connection is restored and autheal finishes.

Tuukka Lahtela

unread,

Jan 5, 2017, 7:26:22 AM1/5/17

to rabbitmq-users

I tried the latest milestone release of 3.6.7 which should have this fix. It did not help.

What I have found out is that it seems that if the federation exchange (e.g looking at the listing from federation status) is in the loser node (loser during autoheal) the problem does not happen but if the exchange is in the winning node then the problem happens. I am not 100% sure about this but I have been able to reproduce the situation a few times both ways.

Any idea on how to solve the issue? This is a critical issue for us as.

I can reprocude the situation quite easily. Unfortunately we are running the system on a closed network so I cannot include any logs. The logs however are not reporting anything strange. No relevant errors that I can see. If you have some way to get more detailed logging I am willing to try it or any settings that might help.

On Friday, December 23, 2016 at 11:59:12 AM UTC+2, Michael Klishin wrote:

Also relevant:
https://github.com/rabbitmq/rabbitmq-federation/issues/46

https://github.com/rabbitmq/rabbitmq-website/commit/422732177f744011c0705aaae871e6c1265353e2

On Fri, Dec 23, 2016 at 6:21 AM, KH <khuf...@pivotal.io> wrote:

Tuuka,
Please refer to https://www.rabbitmq.com/heartbeats.html for more details as to heartbeat which can possibly help to resolve this issue. When the upstream connection is recreated the link applies the federation policy, this will in turn recreate the missing exchanges . You could also try to modify your upstream configuration, in this way:
amqp://user:user@yourserver?heartbeat=10
KH

On Thursday, December 22, 2016 at 4:04:12 AM UTC-6, Tuukka Lahtela wrote:
I have found a strange issue with federation links and cluster split recovery. The upstream stops working untill I recreate federation policy.

The problem occurs when I create a short network split on the downstream cluster. After autoheal finishes the federation link seems to break. The admin ui shows that all links are running but from the upstream side I can see that the exchange for the particular federation is missing the binding from the local exchange to the federation exchange (the from -> this exchange part is missing). The link starts working if I recreate the federation upstream policy in the downstream cluster. I am testing with 2 server clusters on both sides. The policy recreation is enough I do not need to reboot any of the servers.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jan 5, 2017, 7:53:41 AM1/5/17

to rabbitm...@googlegroups.com

Federation links will voluntarily stop if the source exchange is not available. I suspect that may be the reason under some circumstances with partitions. We are going to add more logging when that happens.

Tuukka Lahtela

unread,

Jan 11, 2017, 2:16:27 AM1/11/17

to rabbitmq-users

I did some more investigating on the this issue and I was able reproduce it quite easily. The link breaks almost every time I cause a network partition which results in autoheal. The link is restored if the federation policy is recreated or if I restart the app. I can quite easily reproduce the problem so if there is something that I could try let me know.

I made a setup with two clusters both with two brokers. I tried to keep all settings as default as possible. RabbitMQ version used was 3.6.6 but the problem did happen with the milestone release of 3.6.7. The problem occurs at least in linux and windows environments.

The config is the same for all except for the ports and node names. I set the net tick times to short so that I get the partition to come quickly.

[

{rabbit,

[

{log_levels, [{connection, debug}, {channel, debug}, {federation, debug}, {mirroring, debug}]},

{tcp_listeners, [5672]},

{heartbeat, 5},

{cluster_partition_handling, autoheal},

{cluster_nodes, {['rabbitOne@brokerOne', 'rabbitOne@brokerOne'], disc}}

]},

{kernel, [ {net_ticktime, 5} ]},

{rabbitmq_management, [ {listener, [{port, 6672}, {ip, "127.0.0.1"}]}]}

].

Policies:

cluster message-ttl queues .* {"message-ttl":1000} 0

cluster federation exchanges heartbeats {"federation-upstream-set":"all"} 0

cluster mirror queues federation* {"ha-mode":"all"} 1

I only federated the heartbeats exchange to isolate any possible logging etc. We are mirroring the federated queues in our setup so I also used it here but I tried without it and I was able to get the similar results.

The federation configuration:

{

"uri": ["amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60", "amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60"],

"ack-mode": "no-ack",

"trust-user-id": true,

"message-ttl": 1000,

"max-hops": 3,

"expires": 10000

}

I made a very simple java app which sends and reads messages to the heartbeats exchange. The message is the pid of the java app so that I can detect when messages stop coming.

Broker details:

[{pid,7068},

{running_applications,

[{rabbitmq_federation,"RabbitMQ Federation","3.6.6"},

{rabbitmq_federation_management,"RabbitMQ Federation Management",

"3.6.6"},

{rabbitmq_management,"RabbitMQ Management Console","3.6.6"},

{rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.6"},

{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.6"},

{rabbit,"RabbitMQ","3.6.6"},

{amqp_client,"RabbitMQ AMQP Client","3.6.6"},

{rabbit_common,[],"3.6.6"},

{webmachine,"webmachine","1.10.3"},

{mochiweb,"MochiMedia Web Server","2.13.1"},

{ssl,"Erlang/OTP SSL application","8.0.2"},

{public_key,"Public key infrastructure","1.2"},

{crypto,"CRYPTO","3.7.1"},

{os_mon,"CPO CXC 138 46","2.4.1"},

{ranch,"Socket acceptor pool for TCP protocols.","1.2.1"},

{compiler,"ERTS CXC 138 10","7.0.2"},

{syntax_tools,"Syntax tools","2.1"},

{xmerl,"XML parser","1.3.12"},

{inets,"INETS CXC 138 49","6.3.3"},

{asn1,"The Erlang ASN1 compiler version 4.0.4","4.0.4"},

{mnesia,"MNESIA CXC 138 12","4.14.1"},

{sasl,"SASL CXC 138 11","3.0.1"},

{stdlib,"ERTS CXC 138 10","3.1"},

{kernel,"ERTS CXC 138 10","5.1"}]},

{os,{win32,nt}},

{erlang_version,

"Erlang/OTP 19 [erts-8.1] [64-bit] [smp:4:4] [async-threads:64]\n"},

{memory,

[{total,53622272},

{connection_readers,0},

{connection_writers,5472},

{connection_channels,0},

{connection_other,77624},

{queue_procs,11024},

{queue_slave_procs,14000},

{plugins,1487760},

{other_proc,12923432},

{mnesia,90032},

{mgmt_db,1595840},

{msg_index,52128},

{other_ets,1584776},

{binary,216816},

{code,24979196},

{atom,1033401},

{other_system,9550771}]},

{alarms,[]},

{listeners,[{clustering,7672,"::"},{amqp,5672,"::"},{amqp,5672,"0.0.0.0"}]},

{vm_memory_high_watermark,0.4},

{vm_memory_limit,3390111744},

{disk_free_limit,50000000},

{disk_free,56448684032},

{file_descriptors,

[{total_limit,8092},

{total_used,3},

{sockets_limit,7280},

{sockets_used,1}]},

{processes,[{limit,1048576},{used,285}]},

{run_queue,0},

{uptime,3317},

{kernel,{net_ticktime,5}}]

ClusterTester.java

Michael Klishin

unread,

Jan 11, 2017, 3:40:35 AM1/11/17

to rabbitm...@googlegroups.com

Do you have a script that you use to reproduce? It doesn't have to be completely self-contained

or runnable by us but we should be able to understand what steps are taken.

Thanks.

On Wed, Jan 11, 2017 at 3:16 PM, Tuukka Lahtela <tuukka....@gmail.com> wrote:

I did some more investigating on the this issue and I was able reproduce it quite easily. The link breaks almost every time I cause a network partition which results in autoheal. The link is restored if the federation policy is recreated or if I restart the app. I can quite easily reproduce the problem so if there is something that I could try let me know.

I made a setup with two clusters both with two brokers. I tried to keep all settings as default as possible. RabbitMQ version used was 3.6.6 but the problem did happen with the milestone release of 3.6.7. The problem occurs at least in linux and windows environments.

The config is the same for all except for the ports and node names. I set the net tick times to short so that I get the partition to come quickly.

[

{rabbit,

    [

      {log_levels, [{connection, debug}, {channel, debug}, {federation, debug}, {mirroring, debug}]},

      {tcp_listeners, [5672]},

      {heartbeat, 5},

      {cluster_partition_handling, autoheal},

      {cluster_nodes, {['rabbitOne@brokerOne', 'rabbitOne@brokerOne'], disc}}

    ]},

  {kernel, [ {net_ticktime, 5} ]},

{rabbitmq_management, [ {listener, [{port, 6672}, {ip, "127.0.0.1"}]}]}

].

Policies:

cluster message-ttl     queues .*      {"message-ttl":1000}    0

cluster federation      exchanges       heartbeats      {"federation-upstream-set":"all"}       0

cluster mirror queues federation*     {"ha-mode":"all"}       1

I only federated the heartbeats exchange to isolate any possible logging etc. We are mirroring the federated queues in our setup so I also used it here but I tried without it and I was able to get the similar results.

The federation configuration:

{

"uri": ["amqp://user:password@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60", "amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60"],

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tuukka Lahtela

unread,

Jan 11, 2017, 4:04:53 AM1/11/17

to rabbitmq-users

I do not have a script but the steps are quite simple.
Linux:
All brokers are on dedicated servers.
1. Start the app(s) on so that you have at least one for each cluster
2. Create a network partition to one of the clusters. I have used iptables to block the clustering traffic for a short period of time to simulate a network partition and eventually autoheal.
    Something like this, set PORT to whatever you are using:
    iptables -A OUTPUT -p tcp --dport $PORT -j DROP && iptables -A INPUT -p tcp --dport $PORT -j DROP
    sleep 30
    iptables -D OUTPUT -p tcp --dport $PORT -j DROP && iptables -D INPUT -p tcp --dport $PORT -j DROP
3. Observe the message flow to the apps as messages from the untouched cluster should stop appearing to the app connected to the other cluster.
4. Either re create the federation policy or re start the app. Message flow resumes.

Windows
I only have one windows pc so I set all of the brokers on a single machine so I had do some tricks to create the network partition.
1. Start the app(s) on so that you have at least one for each cluster
2. Create a network partition to one of the clusters. This bit is the nasty bit. I used Resource Monitor to suspend one of the brokers untill the other one noticed it was gone and then resumed it again which caused autoheal.
3. Observe the message flow to the apps as messages from the untouched cluster should stop appearing to the app connected to the other cluster.
4. Either re create the federation policy or re start the app. Message flow resumes.

We have seen this same behavior happen in our production environment with real network partition problems so it is not isolated to the testing setups.

- Tuukka

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jan 11, 2017, 4:35:20 AM1/11/17

to rabbitm...@googlegroups.com

Alright, thanks.

We could consider retrying for N times in the "source not found" scenarios

but it's too late to get this into 3.6.7. So, maybe for 3.7.0 or a future 3.6.x release (depending on how

many of those the future holds in stock.

We will, however, improve logging in those areas some for 3.6.7.

"uri": ["amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60", "amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60"],

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tuukka Lahtela

unread,

Jan 11, 2017, 4:59:44 AM1/11/17

to rabbitmq-users

Is there an issue that I can follow or should I create one?

- Tuukka

"uri": ["amqp://user:password@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60", "amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60"],

Michael Klishin

unread,

Jan 11, 2017, 5:19:57 AM1/11/17

to rabbitm...@googlegroups.com

We will create one once we understand what's going on.

"uri": ["amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60", "amqp://user:pass...@127.0.0.1:5674/cluster?heartbeat=10&connection_timeout=60"],

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

V Z

unread,

Jan 12, 2017, 11:33:12 AM1/12/17

to rabbitmq-users

Tuuka, what do you mean under "upstream stops working"? Do you see the federation link on the Federation Status screen, or is it gone?

Tuukka Lahtela

unread,

Jan 12, 2017, 1:05:04 PM1/12/17

to rabbitm...@googlegroups.com

The upstream is visible in the status screen and it is showing the status to be ok.

On 12 Jan 2017 6:33 p.m., "V Z" <uvzu...@gmail.com> wrote:

Tuuka, what do you mean under "upstream stops working"? Do you see the federation link on the Federation Status screen, or is it gone?

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Tuukka Lahtela

unread,

Jan 16, 2017, 2:00:06 AM1/16/17

to rabbitmq-users

The "upstream stops working" here is that even though the status of the upstream is reported to be ok the client is not getting any of the messages published "on the other side" of the federation. The message flow resumes if the policy for the federation is recreated or if the client is restarted.

On Thursday, January 12, 2017 at 8:05:04 PM UTC+2, Tuukka Lahtela wrote:

The upstream is visible in the status screen and it is showing the status to be ok.

Tuukka Lahtela

unread,

Jan 16, 2017, 2:24:22 AM1/16/17

to rabbitmq-users

One extra detail I just noticed is that the message flow continues also if a new client is started to the "broken" side of the federation. So even the client which stopped getting messages resumes getting the messages if a new client started.

- Tuukka

Tuukka Lahtela

unread,

Jan 30, 2017, 9:08:26 AM1/30/17

to rabbitmq-users

HI!

Any update to this issue? Were you able to reproduce the problem?

- Tuukka

Michael Klishin

unread,

Jan 30, 2017, 9:24:41 AM1/30/17

to rabbitm...@googlegroups.com

We were, at least some variation thereof. We improved logging and discovered

a case where federation links do not get a chance to start after a partial partition.

Without making federation links distributed across nodes, I'm not sure what can

be done about that in the general case.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tuukka Lahtela

unread,

Jan 31, 2017, 1:33:06 AM1/31/17

to rabbitmq-users

Hi,

Thanks for the update.

Should I assume that there will be no fix for this? If so do you have any suggestions on how to solve the issue. We need message flow the be uninterrupted in all situations and even though networks splits are (hopefully) rare they are nonetheless possible. Any possibility for a some type of workaround fix as the problem as the issue can be solved either by recreating the policies or simply starting a new client which published to the same stream. I was thinking of some type of recreate mechanism once the autoheal is completed.

- Tuukka

On Monday, January 30, 2017 at 4:24:41 PM UTC+2, Michael Klishin wrote:

We were, at least some variation thereof. We improved logging and discovered
a case where federation links do not get a chance to start after a partial partition.
Without making federation links distributed across nodes, I'm not sure what can
be done about that in the general case.

On Mon, Jan 30, 2017 at 5:08 PM, Tuukka Lahtela <tuukka....@gmail.com> wrote:

HI!

Any update to this issue? Were you able to reproduce the problem?

- Tuukka

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jan 31, 2017, 6:34:15 AM1/31/17

to rabbitm...@googlegroups.com

Someone is investigating if this can be an implementation issue.

Without distributing federation links across cluster nodes there are certain partial intra-cluster

partition cases in which such issues cannot be solved.

We couldn't reproduce this without using partial intra-cluster partitions.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Diana Parra Corbacho

unread,

Jan 31, 2017, 11:48:15 AM1/31/17

to rabbitm...@googlegroups.com

Hi Tuukka,

I've been investigating the issues with federation links for over a week, and couldn't find a way to consistently reproduce it. I suspect that there is a race condition that triggers it, but it only happened once in our environment.

I'm afraid that we cannot spend more time investigating this, unless we get a consistent way to reproduce it locally. If you find it, please let us know.

Kind regards,

Diana

Tuukka Lahtela

unread,

Jan 31, 2017, 2:02:15 PM1/31/17

to rabbitmq-users

Hi!

Did you try the same way I did. I posted earlier just about all of the details about my setup that I can think of. I can reproduce the issue almost every time. I am happy to try out any special builds in case you have any and send you the logs if that would help.

- Tuukka

Michael Klishin

unread,

Jan 31, 2017, 3:10:43 PM1/31/17

to rabbitm...@googlegroups.com

Yes, we started with your recommendations.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tuukka Lahtela

unread,

Mar 9, 2017, 6:31:48 AM3/9/17

to rabbitmq-users

Hi!

Any update? I can try to create a script to create the problematic situation if you have problems reproducing the issue.

- Tuukka

Michael Klishin

unread,

Mar 9, 2017, 6:37:24 AM3/9/17

to rabbitm...@googlegroups.com, Tuukka Lahtela

There are no updates. We are no longer working on this issue, at least for the time being. 3.6.7 will
include whatever improvements were made to date.

> >>> wrote:
> >>>
> >>>> Someone is investigating if this can be an implementation issue.
> >>>>
> >>>> Without distributing federation links across cluster nodes there are
> >>>> certain partial intra-cluster
> >>>> partition cases in which such issues cannot be solved.
> >>>>
> >>>> We couldn't reproduce this without using partial intra-cluster
> >>>> partitions.
> >>>>
> >>>> On Tue, Jan 31, 2017 at 9:33 AM, Tuukka Lahtela

> To post to this group, send an email to rabbitm...@googlegroups.com.

Tuukka Lahtela

unread,

Mar 16, 2017, 4:06:54 AM3/16/17

to rabbitmq-users, tuukka....@gmail.com

Hi!

I tried my setup with 3.6.7 and I was able reproduce problem very easily. Since you seem to have trouble reproducing the problem would it help if I create sometype of virtual image setup? Let me know if you have a preferred method of creating that type of setup. My idea is that, if possible, I could provide you with a setup that can reproduce the error 100% of the time.

- Tuukka

Diana Parra Corbacho

unread,

Mar 17, 2017, 4:19:43 AM3/17/17

to rabbitm...@googlegroups.com

Hi Tuukka,

If you can provide an environment to reproduce 100% of the time, that would be useful. Otherwise, as Michael said, we are no longer working on this issue.

Kind regards,

Diana

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Michael Klishin

unread,

Mar 17, 2017, 5:49:06 AM3/17/17

to rabbitm...@googlegroups.com

To be more specific: we are no longer working on it for the time being. We are very interested in changing federation architecture to not host all links on a single cluster node. There's a good chance we will take a look at that for 3.8.0 but currently we have

3.7.0 to finish.

Tuukka Lahtela

unread,

Mar 17, 2017, 6:54:31 AM3/17/17

to rabbitmq-users

HI!

I will try to create a 100% reproducable setup anyways so that the issue can be verify to be resolved once a fix (or refacting resulting in a fix) is available.

- Tuukka

charlie vuillemez

unread,

Mar 17, 2017, 10:18:01 AM3/17/17

to rabbitmq-users

Hi all,
On a RabbitMQ v3.6.4 - 3 nodes cluster - I experienced the same issue after a very short network outage (communication was lost between 2 nodes in the cluster). The problem has been mentionned in this closed Github issue.

I use federation exchange between 2 exchanges of 2 differents vhosts *on the same broker* (so amqp uri begins with amqp://localhost/... ).
During network split, federation link has been broken, I think because the queues binded to the downstream exchange where not on the same host than the upstream queue "federation: .xxxx -> yyyy" (?)

After only 7 seconds, network was UP:

=INFO REPORT==== 14-Mar-2017::22:46:15 ===
rabbit on node 'rab...@node01-vl3464.prod.msgq.b0.p.fti.net' down
--
=INFO REPORT==== 14-Mar-2017::22:46:22 ===
rabbit on node 'rab...@node01-vl3464.prod.msgq.b0.p.fti.net' up
--
=INFO REPORT==== 14-Mar-2017::22:46:22 ===
rabbit on node 'rab...@node03-vl3464.prod.msgq.b0.p.fti.net' up

... but federation link was missing on the cluster (no status, no errors in Web UI) ... until I recreate the policy.

I understand that the federation giving up when

Link no longer can locate its "source" queue or exchange.

But I don't understand exactly why it's a volontary implementation ... why the missing link in the cluster is not re-created ?

I precise RabbitMQ with mirrored queue/exchange is fully resilient except this.. so it's a disadvantage for all of us .

Please can you explain a little bit about that ?

Thanks a lot.

Michael Klishin

unread,

Mar 17, 2017, 11:12:15 AM3/17/17

to rabbitm...@googlegroups.com

The voluntary stop is not after a partition but when a link detects that its source (e.g. source exchange)

isn't found. What else can it do where it has nothing to work with on the other end?

We'd be happy to make it recover in that particular case and it's not entirely clear why that's happening.

However, with all links on a single node certain scenarios like that, in particular around partial partitions

cannot be avoided entirely. That's why the plan is to move away from all links residing on a single node.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

charlie vuillemez

unread,

Mar 17, 2017, 12:02:20 PM3/17/17

to rabbitmq-users

Hi Michael,
How the plugin consider if a source exchange is lost ?

In my issue:

* node01 was the downstream federation exchange

* node02 was the upstream federation exchange

* node03 is another node of the cluster

* communication was lost during few seconds between these 3 nodes

I have the relevant lines in logs files:

* On node01:

=ERROR REPORT==== 14-Mar-2017::22:46:16 ===
** Generic server <0.18725.3908> terminating
** Last message in was {'DOWN',#Ref<0.0.9699332.120384>,process,
                               <0.17500.3908>,killed}
** When Server state == {state,
                         {upstream,
                          [<<"amqp://user1:XXXX@localhost/service_tram">>],
                          <<"tram.publish.content">>,
                          <<"modservices.tram.publish.document">>,1000,1,5,
                          none,none,false,'on-confirm',none,
                          <<"upstream-pfs_engen-service_tram">>,false},
                         {upstream_params,
                          <<"amqp://user1:XXXXX@localhost/service_tram">>,
[...]


** Reason for termination ==
** {upstream_channel_down,killed}

* On node02:

=ERROR REPORT==== 14-Mar-2017::22:46:16 ===
** Generic server <0.18213.718> terminating
** Last message in was {'DOWN',#Ref<0.0.11796483.245683>,process,
                               <0.17807.718>,killed}
** When Server state == {state,
                         {upstream,
                          [<<"amqp://user1:XXXXXX@localhost/pfs_engen">>],
                          <<"modservices.contentes.publish.document">>,
                          <<"content.publish.tram">>,1000,1,5,none,none,false,
                          'on-confirm',none,
                          <<"upstream-service_tram-pfs_engen">>,false},
                         {upstream_params,
                          <<"amqp://user1:XXXXX@localhost/pfs_engen">>,
[...]

** Reason for termination ==
** {downstream_channel_down,killed}

So if the link is killed, the source exchange is considered lost ??

What is amazing is when all nodes where UP again for each others, the following log on node01 show that federation exchange is well recovered:

=INFO REPORT==== 14-Mar-2017::22:46:16 ===
node 'rab...@node03-vl3464.prod.msgq.b0.p.fti.net' up

=INFO REPORT==== 14-Mar-2017::22:46:16 ===
Federation exchange 'modservices.tram.publish.document' in vhost 'pfs_engen' connected to exchange 'tram.publish.content' in vhost 'service_tram' on amqp://localhost/service_tram

But on the morning, link status was simply absent.
If I can easily create link manually by deleting/creating the policy, why this link isn't re-created automatically when network connection is UP again ?

Not sure to understand the real mecanims of the federated exchange link..

Thanks.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Tuukka Lahtela

unread,

Mar 23, 2017, 9:28:24 AM3/23/17

to rabbitmq-users

Hi!

I made a docker based setup to test the issue.

https://github.com/tuukkala/docker-fedtester

- Tuukka

Michael Klishin

unread,

May 30, 2017, 2:29:57 PM5/30/17

to rabbitm...@googlegroups.com

Hi Tuukka,

We have recently merged a change that seems highly relevant to us:

https://github.com/rabbitmq/rabbitmq-common/pull/201.

While the problem wasn't federation-specific, we now can see how it could

affect federation link in a very difficult to reproduce way.

It will be in 3.6.11 Milestone 1 later this week. We will try it with your Docker image,

feel free to do some testing of your own (I'm happy to provide a one-off build,

just let me know what package type you need).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tuukka Lahtela

unread,

May 31, 2017, 12:51:30 AM5/31/17

to rabbitm...@googlegroups.com

HI!

Sounds good. I will certainly do some testing. We use suse versions but I can wait for the milestone release if it is coming already this week.

- Tuukka

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

V Z

unread,

Jun 1, 2017, 2:13:07 PM6/1/17

to rabbitmq-users

I wonder what was in the 6 deleted messages :)

Tuukka Lahtela

unread,

Jun 12, 2017, 7:48:43 AM6/12/17

to rabbitmq-users

Hi Michael!

I did some testing with 3.6.11 milestone 1 release and unfortunately I was still able to reproduce the problem.

- Tuukka

Diana Corbacho

unread,

Jul 17, 2017, 11:35:05 AM7/17/17

to rabbitmq-users

Thanks Tuukka. I'm using the Docker image that you provided, and I can consistently reproduce the issue.

I will continue the investigation now that we can reproduce it. Thanks again!

Michael Klishin

unread,

Jul 26, 2017, 5:10:32 PM7/26/17

to rabbitmq-users

Hi Tuukka,

Can you please try 3.6.11.M5? We have a couple of improvements in the federation plugin

that might be relevant for your case: https://groups.google.com/forum/#!topic/rabbitmq-users/9O52laLiPAQ.

Thank you.

Tuukka Lahtela

unread,

Aug 7, 2017, 8:18:53 AM8/7/17

to rabbitm...@googlegroups.com

Hi!

Sorry for the slow reply. I tried the latest rc release (3.6.11 RC2) which had the issue listed as one the fixed items. I am happy to report that I was not able to reproduce the issue with the rc version even though the issue reproduced quite consistently with old versions. Can't wait to the get release version.

- Tuukka

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Diana Parra Corbacho

unread,

Aug 7, 2017, 8:56:10 AM8/7/17

to rabbitm...@googlegroups.com

Those are great news! Thanks again for helping us and reporting back.

On Mon, Aug 7, 2017 at 2:18 PM, Tuukka Lahtela <tuukka....@gmail.com> wrote:

Hi!

Sorry for the slow reply. I tried the latest rc release (3.6.11 RC2) which had the issue listed as one the fixed items. I am happy to report that I was not able to reproduce the issue with the rc version even though the issue reproduced quite consistently with old versions. Can't wait to the get release version.

- Tuukka

On 27.07.2017 00:10, Michael Klishin wrote:

Hi Tuukka,

Can you please try 3.6.11.M5? We have a couple of improvements in the federation plugin

that might be relevant for your case: https://groups.google.com/forum/#!topic/rabbitmq-users/9O52laLiPAQ.

Thank you.

On Monday, July 17, 2017 at 6:35:05 PM UTC+3, Diana Corbacho wrote:

Thanks Tuukka. I'm using the Docker image that you provided, and I can consistently reproduce the issue.

I will continue the investigation now that we can reproduce it. Thanks again!

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Michael Klishin

unread,

Aug 7, 2017, 9:00:09 AM8/7/17

to rabbitm...@googlegroups.com

Thanks, Tuukka.

3.6.11 GA is a few days away.

On Mon, Aug 7, 2017 at 3:18 PM, Tuukka Lahtela <tuukka....@gmail.com> wrote:

Hi!

Sorry for the slow reply. I tried the latest rc release (3.6.11 RC2) which had the issue listed as one the fixed items. I am happy to report that I was not able to reproduce the issue with the rc version even though the issue reproduced quite consistently with old versions. Can't wait to the get release version.

- Tuukka

On 27.07.2017 00:10, Michael Klishin wrote:

Hi Tuukka,

Can you please try 3.6.11.M5? We have a couple of improvements in the federation plugin

that might be relevant for your case: https://groups.google.com/forum/#!topic/rabbitmq-users/9O52laLiPAQ.

Thank you.

On Monday, July 17, 2017 at 6:35:05 PM UTC+3, Diana Corbacho wrote:

Thanks Tuukka. I'm using the Docker image that you provided, and I can consistently reproduce the issue.

I will continue the investigation now that we can reproduce it. Thanks again!

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tuukka Lahtela

unread,

Aug 17, 2017, 3:38:56 AM8/17/17

to rabbitmq-users

Hi!

3.6.11 has been released but there seems to be a requirement for systemd. Most likely releated to https://github.com/rabbitmq/rabbitmq-server-release/pull/31

Any change of getting packages for older suse versions similar to centos? The systemd requirement blocks us from getting this fix to our production environments.

- Tuukka

On Monday, August 7, 2017 at 4:00:09 PM UTC+3, Michael Klishin wrote:

Thanks, Tuukka.

3.6.11 GA is a few days away.

On Mon, Aug 7, 2017 at 3:18 PM, Tuukka Lahtela <tuukka....@gmail.com> wrote:

Hi!

Sorry for the slow reply. I tried the latest rc release (3.6.11 RC2) which had the issue listed as one the fixed items. I am happy to report that I was not able to reproduce the issue with the rc version even though the issue reproduced quite consistently with old versions. Can't wait to the get release version.

- Tuukka

On 27.07.2017 00:10, Michael Klishin wrote:

Hi Tuukka,

Can you please try 3.6.11.M5? We have a couple of improvements in the federation plugin

that might be relevant for your case: https://groups.google.com/forum/#!topic/rabbitmq-users/9O52laLiPAQ.

Thank you.

On Monday, July 17, 2017 at 6:35:05 PM UTC+3, Diana Corbacho wrote:

Thanks Tuukka. I'm using the Docker image that you provided, and I can consistently reproduce the issue.

I will continue the investigation now that we can reproduce it. Thanks again!

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.

To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Aug 17, 2017, 1:11:39 PM8/17/17

to rabbitm...@googlegroups.com

We have plans to provide several Debian packages eventually but not for 3.6.11. You can switch to generic UNIX package or a more recent Debian/Ubuntu version.

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Aug 17, 2017, 1:16:48 PM8/17/17

to rabbitm...@googlegroups.com

The PR in question is merged only for 3.7.0.

Tuukka Lahtela

unread,

Aug 17, 2017, 1:36:13 PM8/17/17

to rabbitm...@googlegroups.com

Hi!

That was the only issue releated to systemd that I could find. 3.6.10 version of the suse rpm package had no dependency for systemd but 3.6.11 does and I could not find any mention of that in any of the release notes. I think that it should at least be mentioned somewhere as it is clearly a breaking change. One might even argue that it is no longer a maintenance release.

- Tuukka

Michael Klishin

unread,

Aug 18, 2017, 12:25:01 PM8/18/17

to rabbitm...@googlegroups.com

OpenSUSE package [in]compatibility is a known issue but I don't think it's new.

You can see package changes in 3.6.x on this branch:

https://github.com/rabbitmq/rabbitmq-server-release/commits/stable

I suggest that we start a separate thread about OpenSUSE as we'd like to produce two

packages but there are downsides to doing that and we were under a fair amount of pressure to ship 3.6.11.

There's always a chance to release 3.6.12, 3.6.13 or at least address this for 3.7.0. OpenSUSE community

feedback and suggestions would be most welcome as I don't think anyone on our team runs OpenSUSE

day-to-day.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Staff Software Engineer, Pivotal/RabbitMQ

--

Staff Software Engineer, Pivotal/RabbitMQ
--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/70p-7Udujzo/unsubscribe.

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward