Question about autoheal mode

436 views
Skip to first unread message

Ilya Maltsev

unread,
May 10, 2018, 4:22:02 AM5/10/18
to rabbitmq-users

Hi,

I have a little question.

Sometimes after network partitions and recovering by autoheal mechanism remain stucked HA queues, which I define in the Management plugin, into a columns "messages ready", "messages unacked" and "messages total" was set NAN value. Get rid of such queues is obtained only through a special eval rabbit_amqqueue:delete_crashed.

Is it a normal behaviour of the RMQ cluster that after a network failure occurs it is necessary to use manual operation to restore?

Michael Klishin

unread,
May 10, 2018, 10:28:29 AM5/10/18
to rabbitm...@googlegroups.com
"NaN columns" is no evidence of "stuck queues". It is evidence of missing stats for some queues, which temporarily can happen
after disconnection and in particular autoheal.

There were a few reports of such behavior regardless of whether autoheal was used in the last year or so. I can't know if all
of them were addressed since you haven't provided any logs or version details.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Ilya Maltsev

unread,
May 11, 2018, 4:36:06 AM5/11/18
to rabbitmq-users
"NaN columns" is no evidence of "stuck queues"
Yes it is. But after the QueueDeclare gets all the time TimeoutException, the queue must be removed.

Attached two screenshot (before and after recovery) and log to the message. The cluster have three nodes. Erlang 19.3, RabbitMQ 3.7.4. OS : Windows Server 2012 R2. 
If you need more information write some, the situation is quite easily reproduced.


четверг, 10 мая 2018 г., 17:28:29 UTC+3 пользователь Michael Klishin написал:
"NaN columns" is no evidence of "stuck queues". It is evidence of missing stats for some queues, which temporarily can happen
after disconnection and in particular autoheal.

There were a few reports of such behavior regardless of whether autoheal was used in the last year or so. I can't know if all
of them were addressed since you haven't provided any logs or version details.
On Thu, May 10, 2018 at 3:22 AM, Ilya Maltsev <ivma...@gmail.com> wrote:

Hi,

I have a little question.

Sometimes after network partitions and recovering by autoheal mechanism remain stucked HA queues, which I define in the Management plugin, into a columns "messages ready", "messages unacked" and "messages total" was set NAN value. Get rid of such queues is obtained only through a special eval rabbit_amqqueue:delete_crashed.

Is it a normal behaviour of the RMQ cluster that after a network failure occurs it is necessary to use manual operation to restore?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
AfterRecovery.PNG
BeforeRecovery.png
logs.zip

Michael Klishin

unread,
May 11, 2018, 10:55:03 AM5/11/18
to rabbitm...@googlegroups.com
If it is easily reproduced then please provide the steps. There's nothing suspicious in the logs, only some node restarts (or termination for any other reason).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Maltsev

unread,
May 11, 2018, 1:33:27 PM5/11/18
to rabbitmq-users

The cluster consists of three nodes, there is one virtual host in the cluster where HA policy is configured ({ha-mode:all}). The test program connects and creates 30 queues, 10 durable, 10 autodeleted, 10 normal, and the test program puts one message in each queue. Then I simulate a network partition between nodes, just close the socket. That is all.


Maybe I can somehow help with the diagnosing, the cluster is now in the same state as I described in the previous message.


пятница, 11 мая 2018 г., 17:55:03 UTC+3 пользователь Michael Klishin написал:

Ilya Maltsev

unread,
May 15, 2018, 2:15:21 AM5/15/18
to rabbitmq-users
Hi, have you been able to reproduce this case?

пятница, 11 мая 2018 г., 20:33:27 UTC+3 пользователь Ilya Maltsev написал:

Michael Klishin

unread,
May 17, 2018, 6:00:57 PM5/17/18
to rabbitm...@googlegroups.com
We can reproduce. It's not specific to autoheal: restarting a 3.7.4 or 3.7.5 node has a similar effect.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Maltsev

unread,
May 18, 2018, 2:38:40 AM5/18/18
to rabbitmq-users

Hmm ... the simple stopping of the RMQ service or killing the RMQ process did not achieve the same effect. This problem occurs only when some network failures occur.

Is I understand correctly that the only way to recover is to restart RMQ nodes one by one?

Is there some way to determine which node needs to be restarted to restore the cluster?


пятница, 18 мая 2018 г., 1:00:57 UTC+3 пользователь Michael Klishin написал:

Michael Klishin

unread,
May 18, 2018, 9:19:51 AM5/18/18
to rabbitm...@googlegroups.com
We cannot know that since we don't know what the root cause is.

A queue's master node can be seen in the management UI or as the suffix of the "pid" column in

`rabbitmqctl list_queues name pid`

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 18, 2018, 11:20:17 AM5/18/18
to rabbitm...@googlegroups.com

Is the Prometheus exporter involved in your case?

Ilya Maltsev

unread,
May 21, 2018, 2:25:01 AM5/21/18
to rabbitmq-users

This case is look like mine, but I don't use Prometheus exporter. And in my case, still the main problem is connection timeout errors to stucked queues.


пятница, 18 мая 2018 г., 18:20:17 UTC+3 пользователь Michael Klishin написал:

Michael Klishin

unread,
May 23, 2018, 11:11:07 AM5/23/18
to rabbitm...@googlegroups.com
We managed to reproduce a scenario where stats emission timers are not reinitialised, so
no stats are emitted.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 23, 2018, 12:48:52 PM5/23/18
to rabbitm...@googlegroups.com

Ilya Maltsev

unread,
May 24, 2018, 4:35:26 AM5/24/18
to rabbitmq-users
Can this issue cause timeouts when I call QueueDeclare?

среда, 23 мая 2018 г., 19:48:52 UTC+3 пользователь Michael Klishin написал:

Michael Klishin

unread,
May 24, 2018, 4:36:42 AM5/24/18
to rabbitm...@googlegroups.com
I don't see how it would. It's merely a stats [non-]emission thing.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Maltsev

unread,
May 24, 2018, 4:57:03 AM5/24/18
to rabbitmq-users

Well, earlier I wrote (the third message from the top) that after cluster recovered, I see a strange message in the statistics and get timeouts when calling QueueDeclare. 

In the logs it looks like this:

2018-05-11 10:46:29.568 [error] <0.2850.0> Channel error on connection <0.2841.0> (10.81.129.71:14723 -> 10.81.139.207:5672, vhost: 'TestHost', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'ha.queue_autoDeleted_1' in vhost 'TestHost' due to timeout

четверг, 24 мая 2018 г., 11:36:42 UTC+3 пользователь Michael Klishin написал:

Michael Klishin

unread,
May 24, 2018, 5:14:37 AM5/24/18
to rabbitm...@googlegroups.com
We use one thread for one topic on this list.

Timeouts are a fact of life in distributed systems. There is no evidence of a system issue in the log message.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Maltsev

unread,
May 24, 2018, 7:27:40 AM5/24/18
to rabbitmq-users

Hmm, as I wrote above I have one problem. After the recovery of the cluster, some queues hang for a long time, longer than a day and they need to be deleted by special scripts (rabbit_amqqueue:delete_crashed). Symptoms I observed two strange messages in statistics and timeouts.



четверг, 24 мая 2018 г., 12:14:37 UTC+3 пользователь Michael Klishin написал:

Michael Klishin

unread,
May 24, 2018, 9:21:13 AM5/24/18
to rabbitm...@googlegroups.com
Hi Ilya,

We have recently identified an issue that affected queue stats emission. That was my understanding of the problem up to this point.
One of our team members will try to reproduce using the provide steps.

There is nothing suspicious in the logs provided except for a few lines:

2018-05-11 10:41:37.032 [warning] <0.1184.0> Received a 'DOWN' message from 'rabbit@otikk-cluster1' but still can communicate with it
2018-05-11 10:41:37.079 [warning] <0.1208.0> Received a 'DOWN' message from 'rabbit@otikk-testboxr' but still can communicate with it
2018-05-11 10:41:37.079 [error] <0.345.0> Partial partition disconnect from rabbit@otikk-cluster1

and there can be all kinds of issues that can lead to an asymmetric partial partition.

You are running a relatively old version of Erlang on Windows, which further complicates the investigation. So it may take some time.
You are welcome to inspect the rest of the logs, try deploying into a different environment and do your own research. All of RabbitMQ is open source.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
May 31, 2018, 2:42:17 PM5/31/18
to rabbitmq-users
Hello Ilya,

I am using RabbitMQ 3.7.5, Erlang 19.3 on an Arch Linux workstation to try to reproduce what you are seeing. I created the following repository that contains a README with details on how I set up my environment and ran my tests:


I don't see the issue that you report, but I may not be following exactly the same steps as you, and I'm not using Windows.

I think my next step will be to try using Windows. How exactly did you simulate your network partition? "just close the socket" doesn't tell me how. In the past I have used the Windows firewall to block traffic to/from the distributed Erlang port (25672).

Thanks,
Luke

Ilya Maltsev

unread,
Jun 1, 2018, 3:41:49 AM6/1/18
to rabbitmq-users

The network partition I simulate using CurrPorts(cports.exe). A screenshot is attached. Today I was able to reproduce the same situation on Windows Server with RMQ 3.7.5 and erlang 19.3. I will try to reproduce using erlang 20.3.


четверг, 31 мая 2018 г., 21:42:17 UTC+3 пользователь Luke Bakken написал:
CurrPorts.png

Luke Bakken

unread,
Jun 1, 2018, 9:39:54 AM6/1/18
to rabbitmq-users
Thanks Ilya, I will give that a try.

Ilya Maltsev

unread,
Jun 1, 2018, 9:46:13 AM6/1/18
to rabbitmq-users

You're welcome. Luke if I can do something to help, please write. Now I have a cluster in this state.


пятница, 1 июня 2018 г., 16:39:54 UTC+3 пользователь Luke Bakken написал:

Luke Bakken

unread,
Jun 1, 2018, 5:11:29 PM6/1/18
to rabbitmq-users
Hi Ilya,

I tried to reproduce using three Windows 2012 R2 VMs (https://app.vagrantup.com/mwrock/boxes/Windows2012R2), RabbitMQ 3.7.5 and Erlang 20.3. I used 20.3 because that is the version that chocolatey defaults to (choco install erlang).

I used CurrPorts to kill the TCP connections to port 25672 and the autoheal function worked fine in all of my test cases. No messages were lost and the queue state looks correct in the management interface. If I re-run my script that declares queues and publishes messages, it works fine and the message count increases by 1 each time.

I'm assuming that the main difference between my environment and yours is that you are using servers connected via a real network, as opposed to the virtual network that VirtualBox provides. I don't see how this could make a difference, however.

Could you try testing in your environment using Erlang 20.3?

Thanks!
Luke

Ilya Maltsev

unread,
Jun 4, 2018, 2:01:09 AM6/4/18
to rabbitmq-users

Hi Luke,


Yes, I now have two real servers and one virtual. However, before that I tried everything on virtual servers.

I was able to reproduce the problem and on erl 20.3 see the screenshot.

The client after this cannot call the QueueDeclare, because it receives the timeout error.

2018-06-04 08:31:19.982 [error] <0.14396.2> Channel error on connection <0.14387.2> (10.81.129.71:23660 -> 10.81.128.57:5672, vhost: 'TestHost', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'ha.queue_autoDeleted_1' in vhost 'TestHost' due to timeout



суббота, 2 июня 2018 г., 0:11:29 UTC+3 пользователь Luke Bakken написал:
AfterRecovery_1.PNG

Michael Klishin

unread,
Jun 4, 2018, 5:31:54 AM6/4/18
to rabbitm...@googlegroups.com
The `ha.queue_autoDeleted_1` queue is not mirrored according to the screenshot. Plus it is, well, auto-delete.

Non-durable non-mirrored queues will move to a different node but it can take some time and any client operation during that
time window will fail with exactly that message.

A whole bunch of non-mirrored queues seem to be either uninitialised or had emitted no stats at that particular moment in time.
Unfortunately without a way to reproduce we cannot speculate as to why and whether the condition is transient.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ilya Maltsev

unread,
Jun 4, 2018, 6:14:37 AM6/4/18
to rabbitmq-users
The `ha.queue_autoDeleted_1` queue is not mirrored according to the screenshot. Plus it is, well, auto-delete.

Before the network partition, all the queues were synchronized. Politics looks like this “Pattern: ^ha\. Definition: {ha-mode:all,ha-sync-mode:automatic}”


it can take some time and any client operation during that time window will fail with exactly that message.
 I understand that. But it has been more than 4 hours since the network partition, and the client still receives errors.

2018-06-04 13:00:33.898 [error] <0.3140.2> Channel error on connection <0.3131.2> (10.81.129.71:43721 -> 10.81.129.184:5672, vhost: 'TestHost', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'ha.queue_autoDeleted_1' in vhost 'TestHost' due to timeout


Unfortunately without a way to reproduce we cannot speculate as to why and whether the condition is transient.

I understand that too. So I ask how else can I help?


понедельник, 4 июня 2018 г., 12:31:54 UTC+3 пользователь Michael Klishin написал:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
Jun 4, 2018, 9:21:45 AM6/4/18
to rabbitmq-users
Hi Ilya,

This is the first time you've mentioned that you're using "ha-sync-mode: automatic". I did try with that setting and without it and still could not reproduce the issue.

At this point there's not much else I can suggest as I think I'm following your reproduction steps and have a similar environment. The only way for me to be 100% sure I'm following your steps would be for you to document everything from installation on clean servers to running your tests, or take a screen capture video of reproducing the issue.

I did completely disable the Windows firewall on all three of my virtual machines rather than try to open up ports. Do you have the firewall enabled?

Thanks -
Luke


On Monday, June 4, 2018 at 3:14:37 AM UTC-7, Ilya Maltsev wrote:
The `ha.queue_autoDeleted_1` queue is not mirrored according to the screenshot. Plus it is, well, auto-delete.

Before the network partition, all the queues were synchronized. Politics looks like this “Pattern: ^ha\. Definition: {ha-mode:all,ha-sync-mode:automatic}”


Luke Bakken

unread,
Jun 4, 2018, 4:50:21 PM6/4/18
to rabbitmq-users
Hi Ilya,

Could you please share the code you're running to declare queues? Based on the logs you've provided I assume you're starting up an application that is running the entire time the partition takes place, and does some sort of automatic connection recovery and (perhaps) topology recovery.

Thanks -
Luke

Ilya Maltsev

unread,
Jun 5, 2018, 7:01:53 AM6/5/18
to rabbitmq-users
Hi Luke.


The code is very simple, the video will show how I use it.

using System;
using RabbitMQ.Client;


namespace RMQTest
{
   
class Program
   
{
       
static void Main(string[] args)
       
{
           
var cf = new ConnectionFactory();
            cf
.RequestedConnectionTimeout *= 2;
            cf
.AutomaticRecoveryEnabled = true;
            cf
.TopologyRecoveryEnabled = true;
            cf
.AuthMechanisms = new AuthMechanismFactory[] { new PlainMechanismFactory() };
            cf
.UserName = "guest";
            cf
.Password = "guest";
            cf
.VirtualHost = "TestHost";
           
for (var i = 1; i <= 10; i++)
           
{
               
using (var cn = cf.CreateConnection(new[] { "tfs-build1-ptg", "tfs-build2-ptg", "tfs-test02" }))
               
using (var model = cn.CreateModel())
               
{
                   
var z = model.QueueDeclare($"ha.queue_autoDeleted_{i}", false, false, true);
                    model
.BasicPublish("", z.QueueName, false, model.CreateBasicProperties(), new byte[] { 1, 2, 3 });
                    z
= model.QueueDeclare($"ha.queue_durable_{i}", true, false, false);
                    model
.BasicPublish("", z.QueueName, false, model.CreateBasicProperties(), new byte[] { 1, 2, 3 });
                    z
= model.QueueDeclare($"ha.queue{i}", false, false, false);
                    model
.BasicPublish("", z.QueueName, false, model.CreateBasicProperties(), new byte[] { 1, 2, 3 });
               
}
               
Console.WriteLine(i);
           
}
           
Console.WriteLine("Press any key");
           
Console.ReadKey();
       
}
   
}
}



понедельник, 4 июня 2018 г., 23:50:21 UTC+3 пользователь Luke Bakken написал:

Luke Bakken

unread,
Jun 5, 2018, 3:44:28 PM6/5/18
to rabbitmq-users
Hi Ilya,

That video was helpful to see exactly how you are closing ports. Unfortunately, even after using your program to create queues and following your steps to reproduce, I don't see the same issue in my 3-node Windows 2012R2 environment. The autoheal process works correctly every time.

I don't have any other suggestions at this point, but I'll keep this in the back of my mind in case I think of something.

Thanks -
Luke
Reply all
Reply to author
Forward
0 new messages