cluster crashes badmatch error misread old

Juraj Gospodarić

unread,

Apr 26, 2016, 9:56:24 AM4/26/16

to rabbitmq-users

Hi,

we have 3 nodes configured in cluster (RabbitMQ 3.5.5, Erlang 17.5.3, with keepalived in front, Linux 2.6.32-504.30.3.el6.x86_64), all queues are persistent with persistent messages. All three nodes are synced. What happens is when one of the nodes dies, after cluster is running few days, complete cluster starts to fail apart with starting with one crash on one node ("master", virtual ip points to that node):

** Generic server <0.576.0> terminating
** Last message in was {'$gen_cast',{ack,[2433590],<0.14113.6387>}}
** When Server state == {q,
......

** Reason for termination ==
** {{badmatch,
        {error,
            {misread,
                [{old_state,
                     {client_msstate,msg_store_persistent,

and after that two other nodes start to sync which cause another node crash following with leavening only one node running.

Second node during sync dies with:

** Last message in was sync_mirrors
** When Server state == {q,
                         {amqqueue,
                          {resource,<<"/">>,queue

...

** Reason for termination ==
** {{badmatch,
        {error,
            {misread,
                [{old_state,
                     {client_msstate,msg_store_persistent,

Usually procedure for bringing up cluster is turning on crashed nodes one by one back. What we noticed is sometimes loss of messages.

This happens no matter we have 1 M or 4 M messages in queue. Only thing we noticed is that this doesn't happens when queues are empty, or not having messages in queue for longer period (days). In second observation is noticed that when messages are longer in queue when consumers start to fetch messages and queue messages start to drop usually we see above described behavior.

Do you have similar experience

Thnx in advance for help.

BR, Jura.

Juraj Gospodarić

unread,

Apr 26, 2016, 10:15:01 AM4/26/16

to rabbitmq-users

Also upon bringing up node and during sync there is also error:

=CRASH REPORT==== **-***-****::15:56:44 ===
crasher:
    initial call: gen:init_it/6
    pid: <0.1595.0>
    registered_name: []
    exception exit: {{badmatch,-1},
                     [{rabbit_mirror_queue_slave,update_delta,2,[]},
                      {rabbit_mirror_queue_slave,process_instruction,2,[]},
                      {rabbit_mirror_queue_slave,handle_cast,2,[]},
                      {gen_server2,handle_msg,2,[]},
                      {proc_lib,wake_up,3,
                                [{file,"proc_lib.erl"},{line,247}]}]}
      in function gen_server2:terminate/3
    ancestors: [<0.1594.0>,rabbit_amqqueue_sup_sup,rabbit_sup,<0.129.0>]
    messages: [{'$gen_cast',
                      {gm,{ack,
                              [<<116,27,205,85,192,146,183,208,45,94,245,
                                 19,196,94,184,184>>]}}},
                  {'$gen_cast',{gm,{drop,136036,1,true}}},

....

trap_exit: true
    status: running
    heap_size: 1727361
    stack_size: 27
    reductions: 56519255
neighbours:
    neighbour: [{pid,<0.1600.0>},
                  {registered_name,[]},
                  {initial_call,
                      {gen,init_it,
                          ['Argument__1','Argument__2','Argument__3',
                           'Argument__4','Argument__5','Argument__6']}},
                  {current_function,{gen_server2,process_next_msg,1}},
                  {ancestors,
                      [<0.1595.0>,<0.1594.0>,rabbit_amqqueue_sup_sup,
                       rabbit_sup,<0.129.0>]},
                  {messages,[]},
                  {links,[<0.1595.0>]},
                  {dictionary,
                      [{random_seed,{7850,24353,14826}},
                       {process_name,
                           {gm,{resource,<<"/">>,queue,
                                   <<"production_status_queue">>}}}]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,1598},
                  {stack_size,7},
                  {reductions,7282821}]

=SUPERVISOR REPORT==== **-***-****::15:56:44 ===
     Supervisor: {<0.1594.0>,rabbit_amqqueue_sup}
     Context:    child_terminated
     Reason:     {{badmatch,-1},
                  [{rabbit_mirror_queue_slave,update_delta,2,[]},
                   {rabbit_mirror_queue_slave,process_instruction,2,[]},
                   {rabbit_mirror_queue_slave,handle_cast,2,[]},
                   {gen_server2,handle_msg,2,[]},
                   {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,247}]}]}
     Offender:   [{pid,<0.1595.0>},
                  {name,rabbit_amqqueue},
                  {mfargs,
                      {rabbit_prequeue,start_link,
                          [{amqqueue,
                               {resource,<<"/">>,queue,
                                   <<"production_status_queue">>},
                               true,false,none,[],<3067.21582.863>,[],[],[],
                               [{vhost,<<"/">>},
                                {name,<<"ha-all">>},
                                {pattern,<<".">>},
                                {'apply-to',<<"all">>},
                                {definition,
                                    [{<<"ha-mode">>,<<"all">>},
                                     {<<"ha-sync-mode">>,<<"automatic">>}]},
                                {priority,0}],
                               [{<3067.16757.866>,<3067.21582.863>}],
                               [],live},
                           slave,<0.1593.0>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]

CRASH REPORT==== **-***-****::16:05:12 ===
crasher:
    initial call: gen:init_it/6
    pid: <0.21670.8>
    registered_name: []
    exception exit: {{badmatch,-1},
                     [{rabbit_mirror_queue_slave,update_delta,2,[]},
                      {rabbit_mirror_queue_slave,process_instruction,2,[]},
                      {rabbit_mirror_queue_slave,handle_cast,2,[]},
                      {gen_server2,handle_msg,2,[]},
                      {proc_lib,init_p_do_apply,3,
                                [{file,"proc_lib.erl"},{line,237}]}]}
      in function gen_server2:terminate/3
    ancestors: [<0.1594.0>,rabbit_amqqueue_sup_sup,rabbit_sup,<0.129.0>]
    messages: [{'$gen_cast',{gm,{drop,135954,1,true}}},

....

=SUPERVISOR REPORT==== **-***-****::16:05:12 ===
     Supervisor: {<0.1594.0>,rabbit_amqqueue_sup}
     Context:    child_terminated
     Reason:     {{badmatch,-1},
                  [{rabbit_mirror_queue_slave,update_delta,2,[]},
                   {rabbit_mirror_queue_slave,process_instruction,2,[]},
                   {rabbit_mirror_queue_slave,handle_cast,2,[]},
                   {gen_server2,handle_msg,2,[]},
                   {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,237}]}]}
     Offender:   [{pid,<0.21670.8>},
                  {name,rabbit_amqqueue},
                  {mfargs,
                      {rabbit_prequeue,start_link,
                          [{amqqueue,
                               {resource,<<"/">>,queue,
                                   <<"production_status_queue">>},
                               true,false,none,[],<3067.21582.863>,[],[],[],
                               [{vhost,<<"/">>},
                                {name,<<"ha-all">>},
                                {pattern,<<".">>},
                                {'apply-to',<<"all">>},
                                {definition,
                                    [{<<"ha-mode">>,<<"all">>},
                                     {<<"ha-sync-mode">>,<<"automatic">>}]},
                                {priority,0}],
                               [{<3067.16757.866>,<3067.21582.863>}],
                               [],live},
                           slave,<0.1593.0>]}},
                  {restart_type,intrinsic},
                  {shutdown,4294967295},
                  {child_type,worker}]

Michael Klishin

unread,

Apr 26, 2016, 10:27:23 AM4/26/16

to rabbitm...@googlegroups.com

The exception comes from a message store. I don't remember seeing this particular exception

but if you need to store millions of messages without consumers online for a while, the right thing to do is to upgrade to 3.6.1

(or 3.6.2 once it is out) and use lazy queues [1].

If you can't move to 3.6 then at least 3.5.7.

1. http://www.rabbitmq.com/lazy-queues.html

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Apr 26, 2016, 10:28:52 AM4/26/16

to rabbitm...@googlegroups.com

The 2nd one is pretty much expected after the original one. It only affects a single queue [mirror],

while message store is node-wide.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Juraj Gospodarić

unread,

Apr 26, 2016, 10:49:50 AM4/26/16

to rabbitmq-users

Thnx MK, yes we have seen lazy queue - nice feature we are looking forward to use them, but for now we need to stay on 3.5.x version

Maybe I wrote wrong, what happens is that we have production bursts where we have higher ingestion of messages that our consumers can consume, so messages from queue are always consumed but with slower rate. Should upgrade to 3.5.7 possibly fix this behavior.

BR, Jura.

Michael Klishin

unread,

Apr 26, 2016, 10:55:22 AM4/26/16

to rabbitm...@googlegroups.com

There is no evidence that this exception has anything to do with message rates. 3.5.6 and 3.5.7 contain

enough bug fixes and improvements in paging out to disk when that I wouldn't think twice

about upgrading if I were you:

http://www.rabbitmq.com/changelog.html

That said, ultimately this kind of workload is why lazy queues were introduced.

Juraj Gospodarić

unread,

Apr 26, 2016, 11:00:24 AM4/26/16

to rabbitmq-users

Ok, thnx Michael will let you know if upgrade, and which one, possibly solved this behavior.

Thnx one more time for prompt help!

BR, Jura.

Reply all

Reply to author

Forward

cluster crashes badmatch error misread old_state

Juraj Gospodarić

Juraj Gospodarić

Michael Klishin

Michael Klishin

Juraj Gospodarić

Michael Klishin

Juraj Gospodarić