Queue state down after network partition crash

138 views
Skip to first unread message

Isman Hakim

unread,
May 12, 2020, 10:41:17 PM5/12/20
to rabbitmq-users


Hi All,

We have been RabbitMQ on 3 clustered nodes, all running 3.7.8 version with erlang 21.1, Linux OS runing on VM. We have some queues with default no mirroring and using automatic auto heal handling partitions.
All nodes are started and messaging runs normal. After the VM was upgraded, when the VM run backup, some instance raise network issue include RabbitMQ node server. 
we encounter network partition issue, in crash log:

<< coremqprd4 >>
2020-05-11 18:04:05 =ERROR REPORT====
Mnesia(rabbit@coremqprd4): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@coremqprd5}

<< coremqprd5 >>
2020-05-11 18:04:18 =ERROR REPORT====
** Node rabbit@coremqprd4 not responding **
** Removing (timedout) connection **
2020-05-11 18:04:55 =ERROR REPORT====
Mnesia(rabbit@coremqprd5): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@coremqprd6}
2020-05-11 18:04:55 =ERROR REPORT====
** gen_event handler lager_exchange_backend crashed.
** Was installed in lager_event
** Last event was: {log,{lager_msg,[],[{pid,<0.42.0>}],info,{["2020",45,"05",45,"11"],["18",58,"04",58,"55",46,"616"]},{1589,195095,616397},[65,112,112,108,105,99,97,116,105,111,110,32,"mnesia",32,101,120,105,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"stopped"]}}
** When handler state == {state,{mask,127},lager_default_formatter,[date," ",time," ",color,"[",severity,"] ",{pid,[]}," ",message,"\n"],-573485176,{resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}}
** Reason == {badarg,[{rabbit_misc,dirty_read,1,[]},{rabbit_basic,publish,1,[]}]}
2020-05-11 18:04:55 =ERROR REPORT====
Mnesia(rabbit@coremqprd5): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, rabbit@coremqprd4}
2020-05-11 18:04:55 =ERROR REPORT====
Mnesia(rabbit@coremqprd5): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, rabbit@coremqprd6}

<< coremqprd6 >>
2020-05-11 18:04:47 =SUPERVISOR REPORT====
     Supervisor: {<0.30081.13>,rabbit_channel_sup}
     Context:    shutdown_error
     Reason:     noproc
     Offender:   [{pid,<0.30082.13>},{name,writer},{mfargs,{rabbit_writer,start_link,[#Port<0.439721>,1,4096,rabbit_framing_amqp_0_9_1,<0.30075.13>,{<<"172.16.3.199:24530 -> 172.16.3.214:5672">>,1},true]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]

2020-05-11 18:04:49 =ERROR REPORT====
Mnesia(rabbit@coremqprd6): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@coremqprd5}

The next problem is we encounter intermittent queue state down after partitioned network event, the time interval is quite long (occurs the next day when many applications are accessing)


state_down.png

queue_details.png


What is the cause of queue without mirroring (no synchronize with other nodes) state is down? 


Thanks & Regards,
Isman Hakim

Michael Klishin

unread,
May 14, 2020, 11:17:28 AM5/14/20
to rabbitmq-users
[1] explains how non-mirrored classic queues behave in presence of node failures.

The exception says that an internal table read fails for that queue. Restarting the node might help.

As a general recommendation, please upgrade and RabbitMQ 3.7.26 (will require Erlang 21.3) or 3.8.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/2d666c70-0273-4fc8-bbc6-66f5308e40f9%40googlegroups.com.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages