Shovel lose messages on cluster startup

328 views
Skip to first unread message

Алексей Просвирнин

unread,
Aug 1, 2018, 9:08:32 AM8/1/18
to rabbitmq-users
Hello!
I have problem with cluster and shovels.

I have:
Clustered rabbitmq (3 instances on CentOs), 
Standalone rabbitmq server (in the same subnet)
and shovel, which transfer messages from standalone server to cluster. All messages persisted.

Everything works fine, except the moment when I try to reboot cluster. In this case all messages collected on standalone server just drop out, when cluster started and shovel start working. At the same time I can see IN traffic on exchange, but nothing OUT.



Steps to reproduce:
1) Create cluster (3 instances)
2) Create standalone server
3) Create shovel on cluster to transfer messages from standalone server to cluster
4) Stop all custer
5) Publish some messages to standalone server (this messages staying in queue, cause we have no consumers)
6) Start cluster
7)  When cluster started shovel get all messages from standalone server. But this messages losing and I can't fund it in cluster's queues.

If I publish message when all servers started and working - messages will transfer from standalone server and persist in cluster queue.
Problem only when whole cluster rebooting.

I created docker compose for reproducing (see attach), you can just execute issue.sh to see the problem.

P.S. fedarations have the same problem

thx
rabbit_drop_issue.zip

Алексей Просвирнин

unread,
Aug 1, 2018, 9:11:45 AM8/1/18
to rabbitmq-users
Can somebody help me? Why rabbitmq messages in this case?

среда, 1 августа 2018 г., 16:08:32 UTC+3 пользователь Алексей Просвирнин написал:

Michael Klishin

unread,
Aug 2, 2018, 4:04:36 PM8/2/18
to rabbitm...@googlegroups.com
What does "drop out" mean? Are you sure that the queue used by the shovel(s) are not auto-delete or exclusive?
Those would be deleted [1] if the only connection/consumer on that queue goes away.

A positive inbound (ingress) and zero outbound (egress) rate on an exchange suggests that messages published
to that exhcnage are not routed anywhere. Check the bindings on that exchange.

That's about as much as we can suggest with the amount of information provided. See queue properties, Shovel settings such as QoS/prefetch/acknowledgement mode [2],
server logs and consider providing a more detailed description of what exactly your setup looks like.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Ilya Zonov

unread,
Aug 4, 2018, 4:42:20 AM8/4/18
to rabbitmq-users
Hello, Michael!

I'm working with Alexey on this issue.

What does "drop out" mean? Are you sure that the queue used by the shovel(s) are not auto-delete or exclusive?
Those would be deleted [1] if the only connection/consumer on that queue goes away.

It means that accumulated messages on remote side (client node in our example) are missed after cluster restart. Queue is durable and not auto-deleted. Before cluster restart message are transferred to cluster without issues. Messages have delivery-mode=2. Shovel was tested with default ack-mode (which one is default? what does question mark "?" mean in ack-mode on web ui?) and on-confirm. It seems with on-confirm mode we have less reproduction rate.

A positive inbound (ingress) and zero outbound (egress) rate on an exchange suggests that messages published
to that exhcnage are not routed anywhere. Check the bindings on that exchange.

After cluster restart exchange have binding (#) to test queue. But we only see ingress messages in statistic.

Alexey has prepared reproduction script with docker containers. You can see it in first mail. I have coped it to gist: https://gist.github.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788. You can just run issue.sh to reproduce the issue and check test queue (curl-queue) on http://localhost:15678. We expect that it should not be empty. I hope we do something wrong and it is configuration issue.

Also I have attached rabbit logs during cluster restart to gist: https://gist.github.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788#file-startup-logs-log.

The same behaviour we can see with federations.
Is it possible that binding temporary unavailable during cluster startup? Is it possible that shovel/federation plugin starts before rabbitmq restores all exchanges, binding and queues?

This scenario is reproduced for shovel with exchange destination, but it seems it works correct for queue destination.

Just to summarise what we see:
  • We have 3 nodes cluster (rmq1, rmq2, rmq3) which receives messages from 1 node rabbitmq (client) via shovel from client queue to cluster exchange. Shovel works on cluster side.
  • Stop cluster (for example during maintenance or unexpected power off)
  • client accumulates messages (some software produces some data every 1 second)
  • Start cluster
  • We see that some messages are missed during cluster startup
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 6, 2018, 2:37:04 AM8/6/18
to rabbitm...@googlegroups.com
Thanks but before you continue any experiments please move away from 3.6.6. It is 9 releases behind even in the now EOL'ed 3.6.x series [1].
"Please upgrade" [2] is the only piece of advice we have for 3.6.6 users. No issues reported against 3.6.6 will be investigated, sorry.

You have 3 nodes, a queue, a topic exchange that's used as a fanout (bound with routing key = "#") and
a Shovel with dummy source and destination. The Shovel is also set to auto-delete. I am still not sure what the end goal is here
and what is expected.

The only exception I see in the log is from a management stats collector so I wouldn't be surprised if the stats were off
but nothing else stands out.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Zonov

unread,
Aug 6, 2018, 5:05:24 PM8/6/18
to rabbitm...@googlegroups.com
Hello, Michael!

I have updated rabbitmq in gist to 3.7:  https://gist.github.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788. Also I have added ha policy, changed shovel ack mode to on-confirm and remove delete-after flag.

Our goal is to ensure that data is transferred from the client node to the cluster. Our app collects metrics from video stream from many points (up to 3000). Every point has rabbitmq node. It is like “client" in gist example. On other side we have aggregation server which used for user mgmt and data visualisation. Here we used rabbitmq cluster with 3 nodes. Now we see that during full cluster restart some data are missed sometimes.

I have added overview screenshots which were captured during cluster startup (with 3.7 version). First (https://gist.githubusercontent.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788/raw/c88164d4f66b38e20f2ef6556cb5aa31141ac6f3/Screen%2520Shot%25202018-08-06%2520at%252023.04.32.png) is for client. We see that 3 messages were added to queue on client side at 23:03:45. At this time cluster is down, so messages are stored on the client node. About 23:04:05 the cluster was up and messages have been moved to cluster. But on the screenshot with overview from cluster (https://gist.githubusercontent.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788/raw/c88164d4f66b38e20f2ef6556cb5aa31141ac6f3/Screen%2520Shot%25202018-08-06%2520at%252023.04.42.png) there are only publish traffic. And there are no any messages in queues.

I expect that these 3 messages should be on cluster side after restart. Also after restart I see durable exchange (curl-exchange), queue (curl-queue) and fanout # binding between them. Fanout # binding is used just for example. So it is very strange for me that curl-exchange have only ingress traffic. All messages which sent after restart are transferred correctly.

Sometimes described issue is not reproduced. For example, today it was reproduced 3 times with our script of 6 attempts.


I’m ready to collect any additional information. Also if it is possible I’m ready to discuss this issue in Skype, Slack, Telegram or other messager and show you live demo of this issue. It is urgent issue for us. 

Thank you in advance.

Илья Зонов (Ilya Zonov) aka puzan
Нижний Новгород, Россия (Nizhny Novgorod, Russia)
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Aug 6, 2018, 5:34:18 PM8/6/18
to rabbitm...@googlegroups.com
So there are two nodes, A (an intermediate collector or source or client) and B (the final destination),
and you shovel messages from a queue in A to an exchange in B with confirms enabled and a durable topology.

What node is restarted, A or B?

To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ilya Zonov

unread,
Aug 7, 2018, 12:45:31 AM8/7/18
to rabbitm...@googlegroups.com
Just want to add that:
  • A is single rabbitmq node.
  • B is rabbitmq cluster with 3 nodes

All nodes from B are restarted.

This issue is not reproduced if B is single rabbitmq node.

Илья Зонов (Ilya Zonov) aka puzan
Нижний Новгород, Россия (Nizhny Novgorod, Russia)
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Aug 7, 2018, 8:59:12 AM8/7/18
to rabbitm...@googlegroups.com
Can logs and config files from all B nodes be posted?

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 7, 2018, 9:09:01 AM8/7/18
to rabbitm...@googlegroups.com
Or better yet, effective configuration and policies [1] and the logs from A as well.

Can you reproduce the same behavior with e.g. PerfTest [2] or any other publisher?


To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ



--

Michael Klishin

unread,
Aug 7, 2018, 9:11:52 AM8/7/18
to rabbitm...@googlegroups.com
Also, if B is a cluster then what node does the Shovel on A connect to? Is it the restarted node
and is it the same node every time?

Michael Klishin

unread,
Aug 7, 2018, 9:34:48 AM8/7/18
to rabbitm...@googlegroups.com
 Our team is leaning towards the "the issue is on the publishing end" hypothesis but let's see the logs
and effective configuration first to avoid chasing the wrong goose.

A traffic capture [1] for the entire test between A and its target node in B would reveal a lot of very important information,
e.g.

 * What does the topology look like
 * What routing keys are used by the Shovel connection/channel running on A
 * Are publisher confirms actually enabled

and so on.

I'm trying to reproduce something similar with 2 separate machines on a LAN.

Ilya Zonov

unread,
Aug 7, 2018, 1:14:50 PM8/7/18
to rabbitm...@googlegroups.com
Hello!

I will provide all requested information later. Few comments on your questions.

> Can logs and config files from all B nodes be posted?
rmq* are clustered nodes (B) and client_* is a client node (A).

> Or better yet, effective configuration and policies [1] and the logs from A as well.

> Can you reproduce the same behavior with e.g. PerfTest [2] or any other publisher?
Is it really needed? Issue reproduced with few messages. In our example just 3 messages.

> Also, if B is a cluster then what node does the Shovel on A connect to? Is it the restarted node and is it the same node every time?
The shovel is run on one of the cluster nodes. It seems after full cluster restart (restart all cluster nodes at the same time) shovel migrated between cluster nodes. I will find this information in the logs. As far as I understand there is no ability to control where to run shovel in the cluster.

Илья Зонов (Ilya Zonov) aka puzan
Нижний Новгород, Россия (Nizhny Novgorod, Russia)
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Aug 7, 2018, 1:57:04 PM8/7/18
to rabbitm...@googlegroups.com
I have only spent some 5 minutes looking at the logs but one thing that immediately stands out for me in the logs is that when nodes report a mirror going down,
coming up and then synchronisation status, it always says

> Mirrored queue 'curl-queue' in vhost '/': Synchronising: 0 messages to synchronise

so the target queue during restarts is empty. This seems pretty suspicious and would be the first thing I'd investigate.

I am terribly sorry but I still do not have a decent understanding of what exactly your test involves.
Previously I thought that only one node is restarted but according to the log

> rmq1_1 | 2018-08-06 20:03:17.207 [info] <0.1098.0> RabbitMQ is asked to stop...
> rmq2_1 | 2018-08-06 20:03:15.577 [info] <0.694.0> RabbitMQ is asked to stop...
> rmq3_1 | 2018-08-06 20:03:16.999 [info] <0.824.0> RabbitMQ is asked to stop...

all 3 nodes were shut down within a 2 second period.

This leads me to only two hypotheses:

 * That your cluster ends up in the "no synchronised mirror left to promote" [1] but in that case there would be log entries about that
 * That Shovel consumes all messages, keeps them in RAM and then all nodes are restarted at the same time

in the latter case those messages in theory should be requeued if the confirms and acknowledgement mode is right but it is greatly
complicated by the fact that Shovels also have to migrate at the same time while the cluster is being shut down, so there can
be the same fundamental problem of "no peer to promote/migrate to" at some point.

This is a non-trivial scenario for the plugin to handle and even with eventual adoption of a different consensus algorithm it would still be problematic because most consensus
algorithms assume the quorum (majority) of nodes are online or at least shut down sequentially so that the state transition log
can be preserved.

Combined with your earlier note that this doesn't happen with a single node (no migration necessary/possible) and doesn't happen in a cluster
reliably, I conclude that this is a very plausible scenario.Try restarting nodes one by one and see
if it makes a difference.

If such events (full cluster shutdown within a few seconds) are common in your system then I suggest that you move Shovels to
a different node/cluster that is not shut down all at once (Shovels can run anywhere and connect to remote nodes for both source
and destination), or pick a different tool/design.

There are no errors in the log that I could spot, nothing else that stands out as suspicious either.


MK

Staff Software Engineer, Pivotal/RabbitMQ



-- 
MK

Staff Software Engineer, Pivotal/RabbitMQ



-- 
MK

Staff Software Engineer, Pivotal/RabbitMQ

-- 
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Zonov

unread,
Aug 7, 2018, 4:33:02 PM8/7/18
to rabbitm...@googlegroups.com
Hello!

> I am terribly sorry but I still do not have a decent understanding of what exactly your test involves. Previously I thought that only one node is restarted but according to the log

We are restarting all three nodes. These lines in our reproduction script:

> #Stop rabbitmq cluster
> echo 'Stop rabbitmq cluster'
> docker-compose stop rmq1 rmq2 rmq3

Our QA team found this issue. They are testing our app behavior during restarts of different components. And it seems our customers could not guarantee that machines with rabbitmq cluster will be restarted one by one. We provide software for intranet setup without hardware support.

> That Shovel consumes all messages, keeps them in RAM and then all nodes are restarted at the same time

I think it is incorrect hypotheses. Our scenario the following:
* Stop the cluster (B)
* Publish some messages to the client (A)
* Start the cluster.

Data is missed on the cluster start, but not when it is stopped.

> If such events (full cluster shutdown within a few seconds) are common in your system then I suggest that you move Shovels to a different node/cluster that is not shut down all at once (Shovels can run anywhere and connect to remote nodes for both source and destination), or pick a different tool/design.

Yes, it seems it is the best workaround now. We will test this scenario with a shovel on client (A) side. We have already tested changing shovel destination from exchange to queue and it seems it works correctly. That is why I thought that the problem is related with bindings. Is there any debug logs during message routing?

Tomorrow I'm going to collect the following information:
* tcp dump in the test network
* debug logs

for the following scenarios:
* Standard reproduction with our script
* Restart cluster nodes one by one
* Move shovel to client (A) node
* Test exchange with alternate-exchange option. I already checked this. And there was no any traffic in alternate-exchange when messaged are dropped.

Also, I will provide:
* test topology
* effective configuration
* policies

Илья Зонов (Ilya Zonov) aka puzan
Нижний Новгород, Россия (Nizhny Novgorod, Russia)
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Алексей Просвирнин

unread,
Aug 29, 2018, 10:42:43 AM8/29/18
to rabbitmq-users
Hello!
Sorry for late answer!

debug logs in attach
This is standard reproduction with our script (when all 3 nodes restarting)

вторник, 7 августа 2018 г., 20:57:04 UTC+3 пользователь Michael Klishin написал:
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



-- 
MK

Staff Software Engineer, Pivotal/RabbitMQ

-- 

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



-- 
MK

Staff Software Engineer, Pivotal/RabbitMQ



-- 

MK

Staff Software Engineer, Pivotal/RabbitMQ



-- 
MK

Staff Software Engineer, Pivotal/RabbitMQ



-- 
MK

Staff Software Engineer, Pivotal/RabbitMQ

-- 
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
rmq1.log
rmq2.log
rmq3.log

Michael Klishin

unread,
Aug 30, 2018, 1:46:57 AM8/30/18
to rabbitm...@googlegroups.com
I don't have much to add.

Your restart all nodes at the same time, meaning shovels have no opportunity to migrate and shut down cleanly. Stop them one by one.

Node 1:
> 2018-08-29 13:58:20.962 [info] <0.1104.0> RabbitMQ is asked to stop...

Node2:
> 2018-08-29 13:58:21.229 [info] <0.696.0> RabbitMQ is asked to stop...

Node 3:
> 2018-08-29 13:58:21.591 [info] <0.861.0> RabbitMQ is asked to stop...

Certain plugins and other stateful things may or may not tolerate simultaneous node shut down well.
Future versions of RabbitMQ will use a consensus protocol that requires a quorum of nodes to be online
by design.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Алексей Просвирнин

unread,
Aug 30, 2018, 4:39:33 AM8/30/18
to rabbitm...@googlegroups.com
Yes, but we have no problem on shutdown. We lose messages on startup.
I have the same issue if I stop cluster one by one (15 seconds timeout)

In my mind it should be like this:
1) Node with shovel is started
2) Shovel takes messages from client (B) queue
3) Shovel puts messages to exchange on cluster (A)
4) Exchange put messages to queue

But I don't see the last step. Shovel takes message from client (B) and put it to exchange but that's all. All queues are empty.
Is it expected situation?


чт, 30 авг. 2018 г. в 8:46, Michael Klishin <mkli...@pivotal.io>:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Aug 30, 2018, 5:35:57 AM8/30/18
to rabbitm...@googlegroups.com
How do you observe that Shovel "puts" (publishes) messages to an exchange in cluster A?

Questions about how to troubleshoot routing come up on this list every month or so.

See what the Shovel publishes, and what bindings exist where the target exchange is the source.
Shovels don't do anything particularly clever: it opens two connections (inbound, outbound), two channels,
consumes and republishes, handling failures and doing acks in the process.

According to the traffic dump there's a few Shovel connections that open channels, consume and acknowledge
but never publishes anything (on that connection anyway). And Shovel's version is 3.6.6 which is 10 patches behind
even the now EOL 3.6.x series.

I highly suggest moving to at least 3.6.16 before you spend any more time investigating.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Алексей Просвирнин

unread,
Sep 5, 2018, 9:09:46 AM9/5/18
to rabbitmq-users

How do you observe that Shovel "puts" (publishes) messages to an exchange in cluster A?
I saw it in exchange chart ("Publish In").

Maybe Shovel publishes messages to exchange in time when exchange have no any bound queue yet?

Is it expected behavior for rabbitMQ to miss messages after cluster restart?
I mean that accumulated messages on remote side (client) are missed after cluster restart.

четверг, 30 августа 2018 г., 12:35:57 UTC+3 пользователь Michael Klishin написал:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Sep 5, 2018, 11:11:25 AM9/5/18
to rabbitm...@googlegroups.com
Messages published as transient and non-durable queues won't be recovered after node restart per AMQP 0-9-1
requirements.

Again, you are making broad generalizations and provide little specifics to suggest much. This is not a productive approach and this thread has
been going in circles for a long time.

I believe the idea of messages not being routeable has been mentioned earlier in this thread several times.
A Shovel uses a regular client under the hood so troubleshooting that is no different from any other publishers. That topic has been discussed many times before on this list.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Sep 5, 2018, 11:12:57 AM9/5/18
to rabbitm...@googlegroups.com
I am not sure how this dump is different. I see a consumer that consumes N messages
and acknowledges them, then does nothing.

I'm sorry but unless you provide an automated way to reproduce this I'm afraid I'm out of ideas.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Zonov

unread,
Sep 6, 2018, 3:11:43 PM9/6/18
to rabbitm...@googlegroups.com
Hello, Michael!

We have already provided script for reproduction based on docker containers: https://gist.github.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788. If you have docker and docker-compose, you can just download all gist files and run issue.sh (https://gist.githubusercontent.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788/raw/c88164d4f66b38e20f2ef6556cb5aa31141ac6f3/issue.sh). Whole scenario is described inside issue.sh. Is it enough for analysis? Is it clear?

Also the issue is reproduced with federation (from one node client to the cluster). With shovels we have workaround: move it to client side. In federation case it seems the only case: migrate to the shovels too.

I fully agree that our scenario is rare. Full cluster restart is unusual. But we have no any control over hardware of out customers. Issue could be reproduced during some maintenance or disaster, when all hardware restarted at the same time.

I think our issue can be related with restoring queues and bindings during cluster startup when all nodes are started at the same time (±10 seconds). And I hope raft implementation will help to fix it. But first of all we should show you that issue really exists )

---
Илья Зонов




Michael Klishin

unread,
Sep 6, 2018, 6:01:54 PM9/6/18
to rabbitm...@googlegroups.com
Again, all nodes restarted at the same time and booting in parallel is a problematic scenario in most distributed systems.

Raft is not a panacea and it wlikely WILL NOT address most cluster wide reboot issues  as Raft requires that the majority of nodes are online at any given moment
for it to guarantee its log consistency.

I had a hypothesis yesterday as to what is going on:

 * A Shovel consumes and acknowledges all messages using a network connection
 * The Shovel then publishes those messages over a direct connection (that uses network distribution instead of a regular AMQP 0-9-1 connection)
 * Those messages are not routed anywhere
 * But they also won't show up in Wireshark when you filter for "amqp" [0-9-1]

Whether a Shovel uses a regular client or direct connection depends on the URI. URIs without a hostname will lead to local (to the node itself) direct connections.

We are not aware of any scenarios in which nodes would not sync their schema from peers after restart but it takes some time and in the meantime another
node already booted might be trying to publish something to it. This in theory can happen even with just one node
being restarted.That's my best guess at the moment.

Shovels can use a user provided reconnection delay [1].

Firehose [2] can be used to trace all messages published in a virtual host on a given node.


On Thu, Sep 6, 2018 at 10:06 PM, Ilya Zonov <izo...@gmail.com> wrote:
Hello, Michael!

We have already provided script for reproduction based on docker containers: https://gist.github.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788. If you have docker and docker-compose, you can just download all gist files and run issue.sh (https://gist.githubusercontent.com/puzan/d15d57dc4d8b9726e2bc1a19c03ea788/raw/c88164d4f66b38e20f2ef6556cb5aa31141ac6f3/issue.sh). Whole scenario is described inside issue.sh. Is it enough for analysis? Is it clear?

Also the issue is reproduced with federation (from one node client to the cluster). With shovels we have workaround: move it to client side. In federation case it seems the only case: migrate to the shovels too.

I fully agree that our scenario is rare. Full cluster restart is unusual. But we have no any control over hardware of out customers. Issue could be reproduced during some maintenance or disaster, when all hardware restarted at the same time.

I think our issue can be related with restoring queues and bindings during cluster startup when all nodes are started at the same time (±10 seconds). And I hope raft implementation will help to fix it. But first of all we should show you that issue really exists )

---
Илья Зонов





On 5 Sep 2018, at 18:12, Michael Klishin <mkli...@pivotal.io> wrote:

I am not sure how this dump is different. I see a consumer that consumes N messages
and acknowledges them, then does nothing.

I'm sorry but unless you provide an automated way to reproduce this I'm afraid I'm out of ideas.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ilya Zonov

unread,
Sep 7, 2018, 1:12:52 AM9/7/18
to rabbitm...@googlegroups.com
Ok, I see about raft. Thank you for clarification.

What about our reproduction script? Is it useful for you?

I think that issue is reproducing now only with cluster because (it is just my assumptions):

* timing. Cluster startup requires more time.
* shovel/federation plugin can be run on any node of cluster. It can be node with mirrored or master queue. May be it is matter during startup.

We will check scenarios with hostname in URI.

--
Ilya Zonov

Michael Klishin

unread,
Sep 7, 2018, 12:38:39 PM9/7/18
to rabbitm...@googlegroups.com
We will get to try it as time permits. Before that we'll assume that it's sufficient.


--
Ilya Zonov
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/22heKWxb9kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages