RabbitMQ crash and freeze

736 views
Skip to first unread message

Ilya Maltsev

unread,
Mar 27, 2018, 9:22:20 AM3/27/18
to rabbitmq-users
Hi,
From time to time I see errors in the logs RMQ servers. I see the following lines in the logs. The most unpleasant thing is that the broker after that as if freezing and you need to restart the service. What could be the problem?

2018-03-23 18:43:15.675 [error] <0.22463.0> ** Generic server <0.22463.0> terminating
** Last message in was 
{'$gen_cast',
{method,
{'basic.publish',0,<<"Exchange">>,<<"routing.key">>,false,false},
{content,60,none,<<128,24,1,49,42,111,100,117,99,122,92,115,118,99,45,97,105,112,45,111,100,117,99,122,59,119,115,45,114,111,100,99,104,101,110,107,111,118,46,111,100,117,99,122,46,115,111,10,77,97,108,83,101,114,118,105,99,101>>,rabbit_framing_amqp_0_9_1,[<<80,252,255,95,0,0,0,0,176,3,0,0,0,0,0,0,0,0,0,96,0,0,0,0>>]
},flow
}
}
** When Server state == 
{ch,running,rabbit_framing_amqp_0_9_1,1,<0.22442.0>,<0.22461.0>,<0.22442.0>,<<"1.1.1.217:63246 -> 1.1.1.85:5672">>,
{lstate,<0.22462.0>,false},none,1,{[],[]},
{user,<<"user">>,[],[{rabbit_auth_backend_internal,none}]},<<"vhost">>,<<>>,#{},
{state,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},erlang},#{},#{},
{set,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<0.22455.0>,
{state,fine,5000,undefined},false,1,{{0,nil},{0,nil}},[],[],{{0,nil},{0,nil}},
[{<<"publisher_confirms">>,bool,true},
{<<"exchange_exchange_bindings">>,bool,true},
{<<"basic.nack">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true},
{<<"connection.blocked">>,bool,true},
{<<"authentication_failure_close">>,bool,true}],
{exchange,{resource,<<"vhost">>,exchange,<<"amq.rabbitmq.trace">>},topic,true,false,true,[],undefined,undefined,undefined,{[],[]},#{user => <<"user2">>}},0,none,flow,[]}
** Reason for termination == 
** 
{{normal,{gen_server,call,[<0.22442.0>,
{info,[amqp_params]},15000]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},
{rabbit_channel,check_topic_authorisation,5,
[{file,"src/rabbit_channel.erl"},{line,855}]},
{rabbit_channel,handle_method,3,
[{file,"src/rabbit_channel.erl"},{line,1081}]},
{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},
{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}

Server info. Windows Server, RMQ 3.7.3.

Luke Bakken

unread,
Mar 27, 2018, 9:55:02 AM3/27/18
to rabbitmq-users
Hi Ilya

In a previous message you said you are using Erlang 19.3, is that true?

What version of Windows Server?

Can you provide the complete logs instead of just one error from them?

When you say "freezing" does this mean that your applications can't publish or consume messages? What does "freezing" mean?

Thanks,
Luke

Ilya Maltsev

unread,
Mar 28, 2018, 3:31:20 AM3/28/18
to rabbitmq-users

In a previous message you said you are using Erlang 19.3, is that true?
Yes.

What version of Windows Server?
Windwos Server 2012

Can you provide the complete logs instead of just one error from them?
Attached files from node A and node B.

When you say "freezing" does this mean that your applications can't publish or consume messages? What does "freezing" mean?
Yes, some clients stop receiving messages, and some receive here such here errors:
Unhandled error System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean passive, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.AutorecoveringModel.QueueDeclare(String queue, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)

вторник, 27 марта 2018 г., 16:55:02 UTC+3 пользователь Luke Bakken написал:
NodeA.log
NodeB.log

Michael Klishin

unread,
Mar 28, 2018, 7:33:46 AM3/28/18
to rabbitm...@googlegroups.com
The C# client error says that a queue.declare operation timed out.

In NodeA.log we immediately find a few red flags:

> {inconsistent_database, running_partitioned_network, 'rabbit@nodeB'}

which is covered in http://www.rabbitmq.com/partitions.html

> Mirrored queue 'ha.queue4.Commands' in vhost 'mainVhost': Stopping all nodes on master shutdown since no synchronised slave is available

which is a scenario that has a dedicated section: http://www.rabbitmq.com/ha.html#unsynchronised-mirrors.

In NodeB.log we see

> 2018-03-23 18:43:15.325 [error] <0.808.0> ** Node 'rabbit@nodeA' not responding **

as well as

> 2018-03-23 18:58:45.227 [error] <0.6376.1> Channel error on connection <0.6367.1> (10.1.1.3:61788 -> 10.1.1.1:5672, vhost: 'mainVhost', user: 'user3'), channel 1:
> operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'ha.queue4.Commands' in vhost 'mainVhost' due to timeout

So it's pretty clear what had happened in your system:

 * NodeA lost connectivity to NodeB or the link experienced a several slowdown
 * Some queues did not have an in-sync mirror to promote to new master
 * Clients connected to NodeB that were consuming or otherwise using a queue with master hosted on NodeA observed that their operations timed out (NodeB logs as much)

There is no evidence of a "RabbitMQ crash" or "freeze" otherwise. What's called "crashes" in the logs are unhandled exceptions
that are always logged.



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Message has been deleted
Message has been deleted

Ilya Maltsev

unread,
Mar 28, 2018, 9:04:33 AM3/28/18
to rabbitmq-users
But nodeA didn't restart after autoheal. 
Log contains next records:
2018-03-23 18:43:23.639 [warning] <0.383.0> Autoheal: we were selected to restart; winner is 'rabbit@nodeB'.
And the work was restore only after restart rebbitmq service manually.

Michael Klishin

unread,
Mar 28, 2018, 4:30:29 PM3/28/18
to rabbitm...@googlegroups.com
Ilya,

We cannot suggest much without having full logs. If you have a way to reproduce the issue at least some of the time, we’d be interested in hearing about it.

MK
среда, 28 марта 2018 г., 14:33:46 UTC+3 пользователь Michael Klishin написал:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Ilya Maltsev

unread,
Mar 29, 2018, 1:25:21 AM3/29/18
to rabbitmq-users
Unfortunately with nodeA this all logs that there is. If I have more information I will write.

Luke Bakken

unread,
Feb 21, 2019, 7:45:36 PM2/21/19
to rabbitmq-users
Hello Ilya,

Please test the upcoming RabbitMQ 3.7.13 release candidates which contain a fix for the issue you report:


If you are using topic exchanges you are probably experiencing this issue.

Thanks,
Luke

On Tuesday, March 27, 2018 at 6:22:20 AM UTC-7, Ilya Maltsev wrote:

Luke Bakken

unread,
Feb 22, 2019, 11:16:10 AM2/22/19
to rabbitmq-users
FTR, a beta build is available from [1] and [2]. Note that the apt repo also includes up to 3 most recent alpha builds, so version pinning is probably a good idea.

seabir...@gmail.com

unread,
Mar 15, 2019, 1:42:01 AM3/15/19
to rabbitmq-users
How to recover the queue if following log reported?

2019-03-08 07:33:50.700 [warning] <0.2528.0> Mirrored queue 'vitrage.evaluator.info' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
2019-03-08 08:07:38.637 [error] <0.4884.5> Channel error on connection <0.4875.5> (10.10.10.4:37746 -> 10.10.10.14:5673, vhost: '/', user: 'cbnfvormq'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'vitrage.evaluator.info' in vhost '/' due to timeout

Manually sync the queue or restart the service? The queue can't be declared now.

在 2019年2月23日星期六 UTC+8上午12:16:10,Luke Bakken写道:

Michael Klishin

unread,
Mar 15, 2019, 3:23:18 AM3/15/19
to rabbitmq-users
Please start new threads for new questions.

If you can restart the node hosting the queue master, try it. If it was one of the scenarios with non-mirrored
queues described in [1] but that would report a different channel exception.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

seabir...@gmail.com

unread,
Mar 18, 2019, 4:05:38 AM3/18/19
to rabbitmq-users
Thanks!
I have created new thread for this.

在 2019年3月15日星期五 UTC+8下午3:23:18,Michael Klishin写道:

Michael Klishin

unread,
Mar 18, 2019, 10:35:17 AM3/18/19
to rabbitmq-users
FTR, a queue in said thread has no eligible (in sync) mirrors for promotion [1].
So it doesn't freeze, it has no master and thus no activity. [1] describes the range of available options.

Reply all
Reply to author
Forward
0 new messages