Hi,
I have a little question.
Sometimes after network partitions and recovering by autoheal mechanism remain stucked HA queues, which I define in the Management plugin, into a columns "messages ready", "messages unacked" and "messages total" was set NAN value. Get rid of such queues is obtained only through a special eval rabbit_amqqueue:delete_crashed.
Is it a normal behaviour of the RMQ cluster that after a network failure occurs it is necessary to use manual operation to restore?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
"NaN columns" is no evidence of "stuck queues"
"NaN columns" is no evidence of "stuck queues". It is evidence of missing stats for some queues, which temporarily can happenafter disconnection and in particular autoheal.There were a few reports of such behavior regardless of whether autoheal was used in the last year or so. I can't know if allof them were addressed since you haven't provided any logs or version details.
On Thu, May 10, 2018 at 3:22 AM, Ilya Maltsev <ivma...@gmail.com> wrote:
Hi,
I have a little question.
Sometimes after network partitions and recovering by autoheal mechanism remain stucked HA queues, which I define in the Management plugin, into a columns "messages ready", "messages unacked" and "messages total" was set NAN value. Get rid of such queues is obtained only through a special eval rabbit_amqqueue:delete_crashed.
Is it a normal behaviour of the RMQ cluster that after a network failure occurs it is necessary to use manual operation to restore?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
The cluster consists of three nodes, there is one virtual host in the cluster where HA policy is configured ({ha-mode:all}). The test program connects and creates 30 queues, 10 durable, 10 autodeleted, 10 normal, and the test program puts one message in each queue. Then I simulate a network partition between nodes, just close the socket. That is all.
Maybe I can somehow help with the diagnosing, the cluster is now in the same state as I described in the previous message.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
Hmm ... the simple stopping of the RMQ service or killing the RMQ process did not achieve the same effect. This problem occurs only when some network failures occur.
Is I understand correctly that the only way to recover is to restart RMQ nodes one by one?
Is there some way to determine which node needs to be restarted to restore the cluster?
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
This case is look like mine, but I don't use Prometheus exporter. And in my case, still the main problem is connection timeout errors to stucked queues.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
Well, earlier I wrote (the third message from the top) that after cluster recovered, I see a strange message in the statistics and get timeouts when calling QueueDeclare.
In the logs it looks like this:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
Hmm, as I wrote above I have one problem. After the recovery of the cluster, some queues hang for a long time, longer than a day and they need to be deleted by special scripts (rabbit_amqqueue:delete_crashed). Symptoms I observed two strange messages in statistics and timeouts.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
The network partition I simulate using CurrPorts(cports.exe). A screenshot is attached. Today I was able to reproduce the same situation on Windows Server with RMQ 3.7.5 and erlang 19.3. I will try to reproduce using erlang 20.3.
You're welcome. Luke if I can do something to help, please write. Now I have a cluster in this state.
Hi Luke,
Yes, I now have two real servers and one virtual. However, before that I tried everything on virtual servers.
I was able to reproduce the problem and on erl 20.3 see the screenshot.
The client after this cannot call the QueueDeclare, because it receives the timeout error.
2018-06-04 08:31:19.982 [error] <0.14396.2> Channel error on connection <0.14387.2> (10.81.129.71:23660 -> 10.81.128.57:5672, vhost: 'TestHost', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'ha.queue_autoDeleted_1' in vhost 'TestHost' due to timeout
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
The `ha.queue_autoDeleted_1` queue is not mirrored according to the screenshot. Plus it is, well, auto-delete.
Before the network partition, all the queues were synchronized. Politics
looks like this “Pattern: ^ha\. Definition: {ha-mode:all,ha-sync-mode:automatic}”
… it can take some time and any client operation during that time window will fail with exactly that message.
2018-06-04 13:00:33.898 [error] <0.3140.2> Channel error on connection <0.3131.2> (10.81.129.71:43721 -> 10.81.129.184:5672, vhost: 'TestHost', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'ha.queue_autoDeleted_1' in vhost 'TestHost' due to timeout
Unfortunately without a way to reproduce we cannot speculate as to why and whether the condition is transient.
I understand that too. So I ask how else can I help?
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
The `ha.queue_autoDeleted_1` queue is not mirrored according to the screenshot. Plus it is, well, auto-delete.Before the network partition, all the queues were synchronized. Politics looks like this “Pattern: ^ha\. Definition: {ha-mode:all,ha-sync-mode:automatic}”
using System;
using RabbitMQ.Client;
namespace RMQTest
{
class Program
{
static void Main(string[] args)
{
var cf = new ConnectionFactory();
cf.RequestedConnectionTimeout *= 2;
cf.AutomaticRecoveryEnabled = true;
cf.TopologyRecoveryEnabled = true;
cf.AuthMechanisms = new AuthMechanismFactory[] { new PlainMechanismFactory() };
cf.UserName = "guest";
cf.Password = "guest";
cf.VirtualHost = "TestHost";
for (var i = 1; i <= 10; i++)
{
using (var cn = cf.CreateConnection(new[] { "tfs-build1-ptg", "tfs-build2-ptg", "tfs-test02" }))
using (var model = cn.CreateModel())
{
var z = model.QueueDeclare($"ha.queue_autoDeleted_{i}", false, false, true);
model.BasicPublish("", z.QueueName, false, model.CreateBasicProperties(), new byte[] { 1, 2, 3 });
z = model.QueueDeclare($"ha.queue_durable_{i}", true, false, false);
model.BasicPublish("", z.QueueName, false, model.CreateBasicProperties(), new byte[] { 1, 2, 3 });
z = model.QueueDeclare($"ha.queue{i}", false, false, false);
model.BasicPublish("", z.QueueName, false, model.CreateBasicProperties(), new byte[] { 1, 2, 3 });
}
Console.WriteLine(i);
}
Console.WriteLine("Press any key");
Console.ReadKey();
}
}
}