Server and queues having issues when upgrading to 3.10.6

108 views
Skip to first unread message

Michael Solberg

unread,
Aug 24, 2022, 2:35:54 PM8/24/22
to rabbitmq-users
I've been upgrading our rabbitmq servers from 3.8.11 to 3.10.6 and expierenced a similar issue twice during the upgrades.

For the upgrade, we're taking the following hops to keep rabbit happy with versions + erling versions

3.8.11
3.8.15
3.9.14
3.10.6

We've done this in 5 different environments. 

In 3 of the 5 environments the upgrade has been very smooth. No issues, consumers and connections come back without restarts and everyhing is good.

In 2 of the 5 so far, we see consumers drop and never come back. Then when we're finished upgrading and restarting services to bring back consumers, something happens and nodes start restarting. 

Still working on that something happens and trying to replicate with perf test, but I'll put the crash log below. Has anyone seen anything similar?

Another thing to note, we seem to see this when we're upgrading environments with a large number of messages on queues. These are in dlqs and just sitting.

Crash log:

2022-08-03 18:48:49.529671+00:00 [warning] <0.516.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** Generic server <0.374.0> terminating
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** Last message in was {'$gen_cast',
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>                            {submit_async,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>                                #Fun<rabbit_classic_queue_index_v2.11.133535598>,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>                                <0.369.0>}}
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** When Server state == undefined
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** Reason for termination ==
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** {function_clause,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>        [{rabbit_queue_index,journal_minus_segment1,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>             [{no_pub,no_del,ack},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>              {{true,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>                   <<86,29,26,171,113,108,212,249,46,163,253,85,120,251,207,21,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>                     0,0,0,0,0,0,0,0,0,0,220,99>>,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>                   <<>>},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>               no_del,no_ack}],
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>             [{file,"rabbit_queue_index.erl"},{line,1390}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {rabbit_queue_index,'-journal_minus_segment/3-fun-0-',4,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>             [{file,"rabbit_queue_index.erl"},{line,1367}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {array,sparse_foldl_3,7,[{file,"array.erl"},{line,1684}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {array,sparse_foldl_2,9,[{file,"array.erl"},{line,1678}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {rabbit_queue_index,'-recover_journal/1-fun-0-',1,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>             [{file,"rabbit_queue_index.erl"},{line,1054}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {lists,map,2,[{file,"lists.erl"},{line,1243}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {rabbit_queue_index,segment_map,2,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>             [{file,"rabbit_queue_index.erl"},{line,1178}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>         {rabbit_queue_index,recover_journal,1,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>             [{file,"rabbit_queue_index.erl"},{line,1045}]}]}
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>   crasher:
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     initial call: worker_pool_worker:init/1
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     pid: <0.374.0>
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     registered_name: []
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     exception exit: {function_clause,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                         [{rabbit_queue_index,journal_minus_segment1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{no_pub,no_del,ack},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                               {{true,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                                    <<86,29,26,171,113,108,212,249,46,163,253,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                                      85,120,251,207,21,0,0,0,0,0,0,0,0,0,0,220,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                                      99>>,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                                    <<>>},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                                no_del,no_ack}],
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"rabbit_queue_index.erl"},{line,1390}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {rabbit_queue_index,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              '-journal_minus_segment/3-fun-0-',4,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"rabbit_queue_index.erl"},{line,1367}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {array,sparse_foldl_3,7,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"array.erl"},{line,1684}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {array,sparse_foldl_2,9,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"array.erl"},{line,1678}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {rabbit_queue_index,'-recover_journal/1-fun-0-',1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"rabbit_queue_index.erl"},{line,1054}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {lists,map,2,[{file,"lists.erl"},{line,1243}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {rabbit_queue_index,segment_map,2,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"rabbit_queue_index.erl"},{line,1178}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          {rabbit_queue_index,recover_journal,1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              [{file,"rabbit_queue_index.erl"},{line,1045}]}]}
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1183)
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     ancestors: [worker_pool_sup,rabbit_sup,<0.221.0>]
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     message_queue_len: 0
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     messages: []
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     links: [<0.368.0>]
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     dictionary: [{worker_pool_worker,true},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                   {{#Ref<0.817629484.2394947586.238686>,fhc_handle},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                    {handle,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        {file_descriptor,prim_file,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                            #{handle => #Ref<0.817629484.2395078677.233764>,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              owner => <0.374.0>,r_ahead_size => 0,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                              r_buffer => #Ref<0.817629484.2395078658.238688>}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        #Ref<0.817629484.2394947586.238686>,11752,false,0,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        infinity,[],<<>>,0,0,0,0,0,false,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        "/bitnami/rabbitmq/mnesia/rab...@rabbit-rabbitmq-0.rabbit-rabbitmq-headless.devciny-default.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7AFI7AV4T4MB7Y178Z0P1L8A9/journal.jif",
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        [write,binary,raw,read],
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        [{write_buffer,infinity}],
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        true,true,-576460742998785996}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                   {rand_seed,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                       {#{jump => #Fun<rand.3.92093067>,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          max => 288230376151711743,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          next => #Fun<rand.5.92093067>,type => exsplus},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        [155047331894241093|277277552724105423]}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                   {segment_entry_count,16384},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                   {fhc_age_tree,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                       {1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                        {{-576460742998785996,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                          #Ref<0.817629484.2394947586.238686>},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                         true,nil,nil}}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                   {{"/bitnami/rabbitmq/mnesia/rab...@rabbit-rabbitmq-0.rabbit-rabbitmq-headless.devciny-default.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7AFI7AV4T4MB7Y178Z0P1L8A9/journal.jif",
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                     fhc_file},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                    {file,1,true}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>                   {worker_pool_name,worker_pool}]
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     trap_exit: false
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     status: running
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     heap_size: 10958
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     stack_size: 29
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>     reductions: 871502
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>   neighbours:
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>     supervisor: {local,worker_pool_sup}
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>     errorContext: child_terminated
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>     reason: {function_clause,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                 [{rabbit_queue_index,journal_minus_segment1,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                      [{no_pub,no_del,ack},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                       {{true,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                            <<86,29,26,171,113,108,212,249,46,163,253,85,120,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                              251,207,21,0,0,0,0,0,0,0,0,0,0,220,99>>,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                            <<>>},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                        no_del,no_ack}],
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                      [{file,"rabbit_queue_index.erl"},{line,1390}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {rabbit_queue_index,'-journal_minus_segment/3-fun-0-',4,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                      [{file,"rabbit_queue_index.erl"},{line,1367}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {array,sparse_foldl_3,7,[{file,"array.erl"},{line,1684}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {array,sparse_foldl_2,9,[{file,"array.erl"},{line,1678}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {rabbit_queue_index,'-recover_journal/1-fun-0-',1,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                      [{file,"rabbit_queue_index.erl"},{line,1054}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {lists,map,2,[{file,"lists.erl"},{line,1243}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {rabbit_queue_index,segment_map,2,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                      [{file,"rabbit_queue_index.erl"},{line,1178}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                  {rabbit_queue_index,recover_journal,1,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                      [{file,"rabbit_queue_index.erl"},{line,1045}]}]}
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>     offender: [{pid,<0.374.0>},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                {id,5},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                {mfargs,{worker_pool_worker,start_link,[worker_pool]}},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                {restart_type,transient},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0>                {significant,false},


Michael Solberg

unread,
Aug 24, 2022, 2:37:00 PM8/24/22
to rabbitmq-users

Also I forgot to note, the node is able to be restarted once we delete the bad queue from the mnesia tables. So we've been deleting  msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7AFI7AV4T4MB7Y178Z0P1L8A9 and then the node comes back to life. 

jo...@cloudamqp.com

unread,
Aug 24, 2022, 2:44:31 PM8/24/22
to rabbitmq-users
The bad news is that there is no single fix for this (yet), the good news is you only need to delete the .jif-file and not the whole directory.

/Johan

Michael Solberg

unread,
Aug 24, 2022, 3:06:44 PM8/24/22
to rabbitmq-users
Thank you!!!
Reply all
Reply to author
Forward
0 new messages