I've been upgrading our rabbitmq servers from 3.8.11 to 3.10.6 and expierenced a similar issue twice during the upgrades.
For the upgrade, we're taking the following hops to keep rabbit happy with versions + erling versions
3.8.11
3.8.15
3.9.14
3.10.6
We've done this in 5 different environments.
In 3 of the 5 environments the upgrade has been very smooth. No issues, consumers and connections come back without restarts and everyhing is good.
In 2 of the 5 so far, we see consumers drop and never come back. Then when we're finished upgrading and restarting services to bring back consumers, something happens and nodes start restarting.
Still working on that something happens and trying to replicate with perf test, but I'll put the crash log below. Has anyone seen anything similar?
Another thing to note, we seem to see this when we're upgrading environments with a large number of messages on queues. These are in dlqs and just sitting.
Crash log:
2022-08-03 18:48:49.529671+00:00 [warning] <0.516.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** Generic server <0.374.0> terminating
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** Last message in was {'$gen_cast',
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {submit_async,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> #Fun<rabbit_classic_queue_index_v2.11.133535598>,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> <0.369.0>}}
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** When Server state == undefined
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** Reason for termination ==
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> ** {function_clause,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{rabbit_queue_index,journal_minus_segment1,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{no_pub,no_del,ack},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {{true,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> <<86,29,26,171,113,108,212,249,46,163,253,85,120,251,207,21,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> 0,0,0,0,0,0,0,0,0,0,220,99>>,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> <<>>},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> no_del,no_ack}],
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1390}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {rabbit_queue_index,'-journal_minus_segment/3-fun-0-',4,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1367}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {array,sparse_foldl_3,7,[{file,"array.erl"},{line,1684}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {array,sparse_foldl_2,9,[{file,"array.erl"},{line,1678}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {rabbit_queue_index,'-recover_journal/1-fun-0-',1,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1054}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {lists,map,2,[{file,"lists.erl"},{line,1243}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {rabbit_queue_index,segment_map,2,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1178}]},
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> {rabbit_queue_index,recover_journal,1,
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1045}]}]}
2022-08-03 18:48:49.909840+00:00 [error] <0.374.0>
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> crasher:
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> initial call: worker_pool_worker:init/1
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> pid: <0.374.0>
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> registered_name: []
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> exception exit: {function_clause,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{rabbit_queue_index,journal_minus_segment1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{no_pub,no_del,ack},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {{true,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> <<86,29,26,171,113,108,212,249,46,163,253,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> 85,120,251,207,21,0,0,0,0,0,0,0,0,0,0,220,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> 99>>,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> <<>>},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> no_del,no_ack}],
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1390}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {rabbit_queue_index,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> '-journal_minus_segment/3-fun-0-',4,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1367}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {array,sparse_foldl_3,7,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"array.erl"},{line,1684}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {array,sparse_foldl_2,9,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"array.erl"},{line,1678}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {rabbit_queue_index,'-recover_journal/1-fun-0-',1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1054}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {lists,map,2,[{file,"lists.erl"},{line,1243}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {rabbit_queue_index,segment_map,2,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1178}]},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {rabbit_queue_index,recover_journal,1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{file,"rabbit_queue_index.erl"},{line,1045}]}]}
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> in function gen_server2:terminate/3 (gen_server2.erl, line 1183)
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> ancestors: [worker_pool_sup,rabbit_sup,<0.221.0>]
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> message_queue_len: 0
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> messages: []
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> links: [<0.368.0>]
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> dictionary: [{worker_pool_worker,true},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {{#Ref<0.817629484.2394947586.238686>,fhc_handle},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {handle,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {file_descriptor,prim_file,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> #{handle => #Ref<0.817629484.2395078677.233764>,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> owner => <0.374.0>,r_ahead_size => 0,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> r_buffer => #Ref<0.817629484.2395078658.238688>}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> #Ref<0.817629484.2394947586.238686>,11752,false,0,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> infinity,[],<<>>,0,0,0,0,0,false,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> "/bitnami/rabbitmq/mnesia/rab...@rabbit-rabbitmq-0.rabbit-rabbitmq-headless.devciny-default.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7AFI7AV4T4MB7Y178Z0P1L8A9/journal.jif",
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [write,binary,raw,read],
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [{write_buffer,infinity}],
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> true,true,-576460742998785996}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {rand_seed,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {#{jump => #Fun<rand.3.92093067>,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> max => 288230376151711743,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> next => #Fun<rand.5.92093067>,type => exsplus},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> [155047331894241093|277277552724105423]}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {segment_entry_count,16384},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {fhc_age_tree,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {1,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {{-576460742998785996,
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> #Ref<0.817629484.2394947586.238686>},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> true,nil,nil}}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {{"/bitnami/rabbitmq/mnesia/rab...@rabbit-rabbitmq-0.rabbit-rabbitmq-headless.devciny-default.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7AFI7AV4T4MB7Y178Z0P1L8A9/journal.jif",
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> fhc_file},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {file,1,true}},
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> {worker_pool_name,worker_pool}]
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> trap_exit: false
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> status: running
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> heap_size: 10958
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> stack_size: 29
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> reductions: 871502
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0> neighbours:
2022-08-03 18:48:49.910657+00:00 [error] <0.374.0>
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> supervisor: {local,worker_pool_sup}
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> errorContext: child_terminated
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> reason: {function_clause,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{rabbit_queue_index,journal_minus_segment1,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{no_pub,no_del,ack},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {{true,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> <<86,29,26,171,113,108,212,249,46,163,253,85,120,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> 251,207,21,0,0,0,0,0,0,0,0,0,0,220,99>>,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> <<>>},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> no_del,no_ack}],
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{file,"rabbit_queue_index.erl"},{line,1390}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {rabbit_queue_index,'-journal_minus_segment/3-fun-0-',4,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{file,"rabbit_queue_index.erl"},{line,1367}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {array,sparse_foldl_3,7,[{file,"array.erl"},{line,1684}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {array,sparse_foldl_2,9,[{file,"array.erl"},{line,1678}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {rabbit_queue_index,'-recover_journal/1-fun-0-',1,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{file,"rabbit_queue_index.erl"},{line,1054}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {lists,map,2,[{file,"lists.erl"},{line,1243}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {rabbit_queue_index,segment_map,2,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{file,"rabbit_queue_index.erl"},{line,1178}]},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {rabbit_queue_index,recover_journal,1,
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> [{file,"rabbit_queue_index.erl"},{line,1045}]}]}
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> offender: [{pid,<0.374.0>},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {id,5},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {mfargs,{worker_pool_worker,start_link,[worker_pool]}},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {restart_type,transient},
2022-08-03 18:48:49.912917+00:00 [error] <0.368.0> {significant,false},