Stream marked as "Down" after network partition

Nicolas Piguet

unread,

Oct 7, 2022, 4:44:54 AM10/7/22

to rabbitmq-users

Hello,

I have a bit of a strange case. We have a 3-node RabbitMQ cluster (with nodes named 04, 05, 06) with a 4 streams defined on it. We had a network partition recently, where node 05 became partitioned from 04 and 06. The network partition has since then been resolved, but only 3 of our 4 streams managed to recover.

From the management interface, 3 streams appear as "running" and 1 appears as "down" with the message "the queue is located on a cluster node or nodes that are down".

However by now all of our cluster nodes are up, and they can all see each other.

In the management interface, when I click on the stream that is down, I get an error page with the message "The object you clicked on was not found; it may have been deleted on the server."

When my application tries to connect to this stream using the java stream client lib, I get the following error message: "java.lang.IllegalStateException: Could not get stream metadata, response code: 6 (STREAM_NOT_AVAILABLE) at com.rabbitmq.stream.impl.ConsumersCoordinator.findBrokersForStream(ConsumersCoordinator.java:155) ~[stream-client-0.6.0.jar:0.6.0]"

I've tried looking at the logs of each of the 3 cluster nodes, but there are no errors or warnings messages regarding this particular stream (named s-sr240-259). The log messages look like:

Node 05:
2022-10-06 15:12:25.341007+02:00 [info] <0.264.0> Stream: s-sr240-359_q_1663163778293078542 will use /var/lib/rabbitmq/mnesia/rabbit@pre-bpubsub05glc/stream/s-sr240-359_q_1663163778293078542 for osiris log data directory
2022-10-06 15:12:25.342467+02:00 [info] <0.264.0> s-sr240-359_q_1663163778293078542 [osiris_replica:init/1] next offset 10, tail info {10,{11,9,1664786023389}}

Node 04:
2022-10-06 15:12:25.344788+02:00 [info] <0.25739.99> starting osiris replica reader s-sr240-359_q_1663163778293078542 at offset 10
2022-10-06 15:12:25.347016+02:00 [info] <0.25789.99> rabbit_stream_coordinator: s-sr240-359_q_1663163778293078542: replica started on rabbit@pre-bpubsub05glc in 19 pid <13046.264.0>

Node 06:

2022-10-05 12:00:52.899827+02:00 [info] <0.260.0> Stream: s-sr240-359_q_1663163778293078542 will use /var/lib/rabbitmq/mnesia/rabbit@pre-bpubsub06glc/stream/scs_securities-brokerage_oems_trigger_events_s-sr240-359_q_1663163778293078542 for osiris log data directory
2022-10-05 12:00:52.901226+02:00 [info] <0.260.0> s-sr240-359_q_1663163778293078542 [osiris_replica:init/1] next offset 10, tail info {10,{11,9,1664786023389}}

Does anyone have an idea of what might have happened here?

Nicolas Piguet

unread,

Oct 7, 2022, 4:51:34 AM10/7/22

to rabbitmq-users

Here the version numbers of the software running on the servers:

RabbitMQ 3.10.8
Erlang 25.0.4
Ubuntu 20.04.2

kjnilsson

unread,

Oct 7, 2022, 11:22:19 AM10/7/22

to rabbitmq-users

Hi,

Could you run the following command and share the output?

rabbitmqctl eval 'sys:get_status(whereis(rabbit_stream_coordinator)).'

Cheers

Karl

Nicolas Piguet

unread,

Oct 13, 2022, 4:07:38 AM10/13/22

to rabbitmq-users

Hi,

It took me a while to obtain the rights to run the command you suggested, but I managed to get it in the end. Here is the output:

{status,<12163.140.18>,
{module,gen_statem},
[[{'$rabbit_vm_category',rabbit_stream_coordinator},
{'$initial_call',{ra_server_proc,init,1}},
{'$ancestors',
[<12163.139.18>,ra_coordination_server_sup_sup,<12163.32662.17>,
ra_systems_sup,ra_sup,<12163.32519.17>]},
{rand_seed,
{#{bits => 58,jump => #Fun<rand.3.34006561>,next => #Fun<rand.0.34006561>,
type => exsss,uniform => #Fun<rand.1.34006561>,
uniform_n => #Fun<rand.2.34006561>},
[38346938440337510|136188484175195874]}}],
running,<12163.139.18>,[],
[{header,"Status for state machine rabbit_stream_coordinator"},
{data,
[{"Status",running},
{"Parent",<12163.139.18>},
{"Modules",[ra_server_proc]},
{"Time-outs",{1,[{{timeout,tick},tick_timeout}]}},
{"Logged Events",[]},
{"Postponed",[]}]},
{id,{rabbit_stream_coordinator,'rabbit@pre-bpubsub06glc'}},
{opt,normal},
{raft_state,follower},
{leader_last_seen,1665645204325},
{num_pending_commands,0},
{num_delayed_commands,0},
{num_pending_applied_notifications,0},
{election_timeout_set,false},
{ra_server_state,
#{aux => {aux,#{},undefined},
cluster =>
#{{rabbit_stream_coordinator,'rabbit@pre-bpubsub04glc'} =>
#{commit_index_sent => 0,match_index => 0,next_index => 1,
query_index => 0,status => normal},
{rabbit_stream_coordinator,'rabbit@pre-bpubsub05glc'} =>
#{commit_index_sent => 0,match_index => 0,next_index => 1,
query_index => 0,status => normal},
{rabbit_stream_coordinator,'rabbit@pre-bpubsub06glc'} =>
#{commit_index_sent => 0,match_index => 0,next_index => 1,
query_index => 0,status => normal}},
commit_index => 2981,
counter => {write_concurrency,#Ref<12163.2697593085.3256745985.223672>},
current_term => 31,
effective_machine_module => rabbit_stream_coordinator,
effective_machine_version => 2,
id => {rabbit_stream_coordinator,'rabbit@pre-bpubsub06glc'},
last_applied => 2981,
leader_id => {rabbit_stream_coordinator,'rabbit@pre-bpubsub04glc'},
log =>
#{cache_size => 0,first_index => 0,last_index => 2981,
last_written_index_term => {2981,31},
num_segments => 1,open_segments => 1,snapshot_index => undefined,
type => ra_log},
log_id => "rabbit_stream_coordinator",
machine =>
{rabbit_stream_coordinator,
#{"perftest_stream_1663186697432696293" =>
{stream,"perftest_stream_1663186697432696293",26,
{resource,<<"perftest">>,queue,<<"stream">>},
#{epoch => 1,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"perftest">>,queue,<<"stream">>}]},
leader_locator_strategy => <<"least-leaders">>,
leader_node => 'rabbit@pre-bpubsub05glc',
max_segment_size_bytes => 500000000,
name => "perftest_stream_1663186697432696293",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference => {resource,<<"perftest">>,queue,<<"stream">>},
replica_nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,100000000}]},
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
#{'rabbit@pre-bpubsub04glc' =>
{member,
{running,26,<12388.8794.198>},
{writer,26},
'rabbit@pre-bpubsub04glc',undefined,
#{epoch => 26,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"perftest">>,queue,<<"stream">>}]},
leader_locator_strategy => <<"least-leaders">>,
leader_node => 'rabbit@pre-bpubsub04glc',
max_segment_size_bytes => 500000000,
name => "perftest_stream_1663186697432696293",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference => {resource,<<"perftest">>,queue,<<"stream">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,100000000}]},
running},
'rabbit@pre-bpubsub05glc' =>
{member,
{running,26,<12389.3658.196>},
{replica,26},
'rabbit@pre-bpubsub05glc',undefined,
#{epoch => 26,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"perftest">>,queue,<<"stream">>}]},
leader_locator_strategy => <<"least-leaders">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.8794.198>,
max_segment_size_bytes => 500000000,
name => "perftest_stream_1663186697432696293",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference => {resource,<<"perftest">>,queue,<<"stream">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,100000000}]},
running},
'rabbit@pre-bpubsub06glc' =>
{member,
{running,26,<12163.32565.17>},
{replica,26},
'rabbit@pre-bpubsub06glc',undefined,
#{epoch => 26,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"perftest">>,queue,<<"stream">>}]},
leader_locator_strategy => <<"least-leaders">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.8794.198>,
max_segment_size_bytes => 500000000,
name => "perftest_stream_1663186697432696293",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference => {resource,<<"perftest">>,queue,<<"stream">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,100000000}]},
running}},
#{},undefined,
{updated,26},
running},
"s-sr0-119_q_1663163778291262666" =>
{stream,
"s-sr0-119_q_1663163778291262666",
24,
{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},
#{epoch => 1,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr0-119_q_1663163778291262666",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
#{'rabbit@pre-bpubsub04glc' =>
{member,
{running,24,<12388.7094.198>},
{writer,24},
'rabbit@pre-bpubsub04glc',undefined,
#{epoch => 24,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr0-119_q_1663163778291262666",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub05glc' =>
{member,
{running,24,<12389.23378.195>},
{replica,24},
'rabbit@pre-bpubsub05glc',undefined,
#{epoch => 24,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7094.198>,
name =>
"s-sr0-119_q_1663163778291262666",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub06glc' =>
{member,
{running,24,<12163.32563.17>},
{replica,24},
'rabbit@pre-bpubsub06glc',undefined,
#{epoch => 24,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7094.198>,
name =>
"s-sr0-119_q_1663163778291262666",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running}},
#{},undefined,
{updated,24},
running},
"s-sr120-239_q_1663163778292681865" =>
{stream,
"s-sr120-239_q_1663163778292681865",
26,
{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>},
#{epoch => 1,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr120-239_q_1663163778292681865",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
#{'rabbit@pre-bpubsub04glc' =>
{member,
{running,26,<12388.7093.198>},
{writer,26},
'rabbit@pre-bpubsub04glc',undefined,
#{epoch => 26,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr120-239_q_1663163778292681865",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub05glc' =>
{member,
{running,26,<12389.24576.195>},
{replica,26},
'rabbit@pre-bpubsub05glc',undefined,
#{epoch => 26,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7093.198>,
name =>
"s-sr120-239_q_1663163778292681865",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub06glc' =>
{member,
{running,26,<12163.32561.17>},
{replica,26},
'rabbit@pre-bpubsub06glc',undefined,
#{epoch => 26,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7093.198>,
name =>
"s-sr120-239_q_1663163778292681865",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr120-239.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running}},
#{},undefined,
{updated,26},
running},
"s-sr240-359_q_1663163778293078542" =>
{stream,
"s-sr240-359_q_1663163778293078542",
25,
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
#{epoch => 1,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr240-359_q_1663163778293078542",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
#{'rabbit@pre-bpubsub04glc' =>
{member,
{running,25,<12388.7129.198>},
{writer,25},
'rabbit@pre-bpubsub04glc',undefined,
#{epoch => 25,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr240-359_q_1663163778293078542",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub05glc' =>
{member,
{running,25,<12389.24602.195>},
{replica,25},
'rabbit@pre-bpubsub05glc',undefined,
#{epoch => 25,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7129.198>,
name =>
"s-sr240-359_q_1663163778293078542",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub06glc' =>
{member,
{running,25,<12163.32567.17>},
{replica,25},
'rabbit@pre-bpubsub06glc',undefined,
#{epoch => 25,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7129.198>,
name =>
"s-sr240-359_q_1663163778293078542",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running}},
#{},undefined,
{updating,12},
running},
"s-sr360-479_q_1663163778291849769" =>
{stream,
"s-sr360-479_q_1663163778291849769",
24,
{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>},
#{epoch => 1,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr360-479_q_1663163778291849769",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
#{'rabbit@pre-bpubsub04glc' =>
{member,
{running,24,<12388.7128.198>},
{writer,24},
'rabbit@pre-bpubsub04glc',undefined,
#{epoch => 24,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>
"s-sr360-479_q_1663163778291849769",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub05glc' =>
{member,
{running,24,<12389.23515.195>},
{replica,24},
'rabbit@pre-bpubsub05glc',undefined,
#{epoch => 24,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7128.198>,
name =>
"s-sr360-479_q_1663163778291849769",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub06glc' =>
{member,
{running,24,<12163.32547.17>},
{replica,24},
'rabbit@pre-bpubsub06glc',undefined,
#{epoch => 24,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
leader_pid => <12388.7128.198>,
name =>
"s-sr360-479_q_1663163778291849769",
nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr360-479.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
running}},
#{},undefined,
{updated,24},
running}},
#{<12163.32547.17> =>
{"s-sr360-479_q_1663163778291849769",
member},
<12163.32561.17> =>
{"s-sr120-239_q_1663163778292681865",
member},
<12163.32563.17> =>
{"s-sr0-119_q_1663163778291262666",
member},
<12163.32565.17> => {"perftest_stream_1663186697432696293",member},
<12163.32567.17> =>
{"s-sr240-359_q_1663163778293078542",
member},
<12389.23378.195> =>
{"s-sr0-119_q_1663163778291262666",
member},
<12389.23515.195> =>
{"s-sr360-479_q_1663163778291849769",
member},
<12389.24576.195> =>
{"s-sr120-239_q_1663163778292681865",
member},
<12389.24602.195> =>
{"s-sr240-359_q_1663163778293078542",
member},
<12389.3658.196> => {"perftest_stream_1663186697432696293",member},
<12388.7093.198> =>
{"s-sr120-239_q_1663163778292681865",
member},
<12388.7094.198> =>
{"s-sr0-119_q_1663163778291262666",
member},
<12388.7128.198> =>
{"s-sr360-479_q_1663163778291849769",
member},
<12388.7129.198> =>
{"s-sr240-359_q_1663163778293078542",
member},
<12388.8794.198> => {"perftest_stream_1663186697432696293",member}},
undefined,undefined,undefined},
machine_version => 2,
machine_versions => [{1,2},{0,0}],
max_pipeline_count => 4096,metrics_key => rabbit_stream_coordinator,
system_config =>
#{data_dir =>
"/var/lib/rabbitmq/mnesia/rabbit@pre-bpubsub06glc/coordination/rabbit@pre-bpubsub06glc",
name => coordination,
names =>
#{closed_mem_tbls => ra_coordination_log_closed_mem_tables,
directory => ra_coordination_directory,
directory_rev => ra_coordination_directory_reverse,
log_ets => ra_coordination_log_ets,
log_meta => ra_coordination_log_meta,
log_sup => ra_coordination_log_sup,
open_mem_tbls => ra_coordination_log_open_mem_tables,
segment_writer => ra_coordination_segment_writer,
server_sup => ra_coordination_server_sup_sup,
wal => ra_coordination_log_wal,
wal_sup => ra_coordination_log_wal_sup},
segment_max_entries => 4096,wal_compute_checksums => true,
wal_data_dir =>
"/var/lib/rabbitmq/mnesia/rabbit@pre-bpubsub06glc/coordination/rabbit@pre-bpubsub06glc",
wal_garbage_collect => false,wal_max_batch_size => 4096,
wal_max_entries => undefined,wal_max_size_bytes => 64000000,
wal_pre_allocate => false,wal_sync_method => datasync,
wal_write_strategy => default},
uid => <<"RABBITHPUV3JOV2APU">>,
voted_for => {rabbit_stream_coordinator,'rabbit@pre-bpubsub06glc'}}}]]}

Nicolas Piguet

unread,

Oct 13, 2022, 5:45:05 AM10/13/22

to rabbitmq-users

So I tried understanding the differences between the Stream that is marked "down" and the other streams that are marked as "running", and the only significant difference appears to be at the end of their respective blocks:

The down stream "sr240-359" shows

{updating, 12}

while the running streams show variations of

{updated, 24}

We did do an upgrade of the cluster from 3.9 to 3.10.8 recently. Is it possible that something went wrong during the update?

Nicolas Piguet

unread,

Oct 13, 2022, 5:49:46 AM10/13/22

to rabbitmq-users

This issue looks like it could be related to https://github.com/rabbitmq/rabbitmq-server/discussions/5120

Nicolas Piguet

unread,

Oct 14, 2022, 5:40:06 AM10/14/22

to rabbitmq-users

`rabbitmqctl report` returns this, which shows the difference between the 3 streams that are running, and the one stream that is down

name durable auto_delete arguments policy pid owner_pid exclusive exclusive_consumer_pid exclusive_consumer_tag messages_ready messages_unacknowledged messages messages_ready_ram messages_unacknowledged_ram messages_ram messages_persistent message_bytes message_bytes_ready message_bytes_unacknowledged message_bytes_ram message_bytes_persistent head_message_timestamp disk_reads disk_writes consumers consumer_utilisation consumer_capacity memory slave_pids synchronised_slave_pids state type leader members online
s-sr240-359.q true false [{"x-max-length-bytes",1000000000},{"x-queue-type","stream"}] <rab...@pre-bpubsub05glc.1663057949.18868.193> downrabbit_stream_queue
s-sr0-119.q true false [{"x-max-length-bytes",1000000000},{"x-queue-type","stream"}] 3 0 3 0 14104 running stream rabbit@pre-bpubsub04glc [rabbit@pre-bpubsub04glc, rabbit@pre-bpubsub05glc, rabbit@pre-bpubsub06glc] [rabbit@pre-bpubsub06glc, rabbit@pre-bpubsub05glc, rabbit@pre-bpubsub04glc]
s-sr360-479.q true false [{"x-max-length-bytes",1000000000},{"x-queue-type","stream"}] 11 0 11 0 14104 running stream rabbit@pre-bpubsub04glc [rabbit@pre-bpubsub04glc, rabbit@pre-bpubsub05glc, rabbit@pre-bpubsub06glc] [rabbit@pre-bpubsub06glc, rabbit@pre-bpubsub05glc, rabbit@pre-bpubsub04glc]
s-sr120-239.q true false [{"x-max-length-bytes",1000000000},{"x-queue-type","stream"}] 14 0 14 0 14104 running stream rabbit@pre-bpubsub04glc [rabbit@pre-bpubsub04glc, rabbit@pre-bpubsub05glc, rabbit@pre-bpubsub06glc] [rabbit@pre-bpubsub06glc, rabbit@pre-bpubsub05glc, rabbit@pre-bpubsub04glc]

Nicolas Piguet

unread,

Oct 17, 2022, 5:11:23 AM10/17/22

to rabbitmq-users

Digging further into the logs reveals that after the restart of the RabbitMQ server immediately following the update to 3.10.8, a crash occurred. 2 or 3 similar crashes happened in the next hour (as the server was restarted multiple time, I supposed), with the same error message in rabbit_stream_manager, "no case clause matching {badrpc,nodedown}". The content of the log follows.

Any idea what could be causing this? Can my Stream be recovered, or will I have to delete it and re-create it?

2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** Generic server rabbit_stream_manager terminating
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** Last message in was {lookup_leader,<<"scs">>,
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> <<"s-sr240-359.q">>}
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** When Server state == {state,#{nodes =>
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ['rabbit@pre-bpubsub05glc',
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> 'rabbit@pre-bpubsub06glc',
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> 'rabbit@pre-bpubsub04glc']}}
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** Reason for termination ==
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** {{case_clause,{badrpc,nodedown}},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> [{rabbit_stream_manager,handle_call,3,
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> [{file,"rabbit_stream_manager.erl"},{line,297}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,1146}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1175}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** Client <0.6758.2> stacktrace
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> ** [{gen,do_call,4,[{file,"gen.erl"},{line,256}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {gen_server,call,2,[{file,"gen_server.erl"},{line,363}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {rabbit_stream_reader,lookup_leader,2,
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> [{file,"rabbit_stream_reader.erl"},{line,2542}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {rabbit_stream_reader,handle_frame_post_auth,4,
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> [{file,"rabbit_stream_reader.erl"},{line,1493}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {lists,foldl,3,[{file,"lists.erl"},{line,1350}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {rabbit_stream_reader,open,3,
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> [{file,"rabbit_stream_reader.erl"},{line,721}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1419}]},
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
2022-10-03 11:52:19.410533+02:00 [error] <0.3870.1>
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> crasher:
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> initial call: rabbit_stream_manager:init/1
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> pid: <0.3870.1>
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> registered_name: rabbit_stream_manager
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> exception error: no case clause matching {badrpc,nodedown}
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> in function rabbit_stream_manager:handle_call/3 (rabbit_stream_manager.erl, line 297)
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> in call from gen_server:try_handle_call/4 (gen_server.erl, line 1146)
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> in call from gen_server:handle_msg/6 (gen_server.erl, line 1175)
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> ancestors: [rabbit_stream_sup,<0.3868.1>]
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> message_queue_len: 3
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> messages: [{'$gen_call',
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {<0.4011.1>,[alias|#Ref<0.2905471650.1159266306.23889>]},
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {topology,<<"perftest">>,<<"stream">>}},
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {'$gen_call',
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {<0.6767.2>,[alias|#Ref<0.2905471650.1159266306.23891>]},
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {lookup_leader,<<"perftest">>,<<"stream">>}},
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {'$gen_call',
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {<0.4020.1>,[alias|#Ref<0.2905471650.1159528449.36139>]},
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> {topology,<<"scs">>,
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> <<"s-sr360-479.q">>}}]
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> links: [<0.3869.1>]
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> dictionary: []
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> trap_exit: false
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> status: running
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> heap_size: 10958
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> stack_size: 28
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> reductions: 60431
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1> neighbours:
2022-10-03 11:52:19.410953+02:00 [error] <0.3870.1>
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> supervisor: {local,rabbit_stream_sup}
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> errorContext: child_terminated
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> reason: {{case_clause,{badrpc,nodedown}},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> [{rabbit_stream_manager,handle_call,3,
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> [{file,"rabbit_stream_manager.erl"},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {line,297}]},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {gen_server,try_handle_call,4,
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> [{file,"gen_server.erl"},{line,1146}]},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1175}]},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {proc_lib,init_p_do_apply,3,
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> [{file,"proc_lib.erl"},{line,240}]}]}
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> offender: [{pid,<0.3870.1>},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {id,rabbit_stream_manager},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {mfargs,
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {rabbit_stream_manager,start_link,
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> [#{nodes =>
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> ['rabbit@pre-bpubsub05glc',
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> 'rabbit@pre-bpubsub06glc',
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> 'rabbit@pre-bpubsub04glc']}]}},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {restart_type,permanent},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {significant,false},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {shutdown,5000},
2022-10-03 11:52:19.411360+02:00 [error] <0.3869.1> {child_type,worker}]

kjnilsson

unread,

Oct 17, 2022, 6:20:41 AM10/17/22

to rabbitmq-users

I am wondering if this is due to an mnesia corruption and that the stream is actually running but mnesia hasn't updated correctly for some reason and thus the stream isn't discoverable everywhere. The {updating, 25} refers to the mnesia meta data store update status.

Can you try the follow command:

rabbitmqctl eval 'rabbit_amqqueue:lookup([{resource,<<"scs">>,queue, <<"s-sr240-359.q">>}]).'

Cheers

Karl

Nicolas Piguet

unread,

Oct 17, 2022, 8:17:25 AM10/17/22

to rabbitmq-users

This command just returns [] on all 3 nodes.

It does return a bunch of data for the other 3 similar streams.

kjnilsson

unread,

Oct 17, 2022, 12:11:43 PM10/17/22

to rabbitmq-users

Ok so it looks like we have a stream, it is working but the record of the queue (stream) is missing from the mnesia database which means most bits of functionality won't be able to find it. This is something that I suspect can happen depending on

the partition handling mode possibly. Which one do you use?

Now how would we fix it? One option would be to re-declare the stream but of course you'd lose the data in that case and internally the stream would still be running.

We _may_ be able to patch up the mnesia database with another eval command but it would be a best effort thing with no guarantees of success or promises.

Nicolas Piguet

unread,

Oct 18, 2022, 4:04:50 AM10/18/22

to rabbitmq-users

Our cluster is running in `pause_minority` mode with 3 nodes.

If you know of a command to patch up the mnesia database (even if the success rate is not guaranteed), I would be happy to take it. If it doesn't work or screws something up we can always delete and re-create the stream, which we would have had to do anyway. So there is no real downside to trying :-)

kjnilsson

unread,

Oct 18, 2022, 4:14:05 AM10/18/22

to rabbitmq-users

Ok can you share the output of one of the previous lookup commands for one of the other streams? I can use this to create a record with the missing name, I think.

Nicolas Piguet

unread,

Oct 18, 2022, 5:14:46 AM10/18/22

to rabbitmq-users

Here's what I get for one of the other streams:

pre-bpubsub04glc:~$ sudo rabbitmqctl eval 'rabbit_amqqueue:lookup([{resource,<<"scs">>,queue, <<"s-sr0-119.q">>}]).'
[{amqqueue,

{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},

true,false,none,
[{<<"x-max-length-bytes">>,long,1000000000},
{<<"x-queue-type">>,longstr,<<"stream">>}],
<12359.20754.57>,[],[],[],undefined,undefined,[],[],live,0,[],
<<"scs">>,
#{user => <<"t_itpa">>},
rabbit_stream_queue,
#{epoch => 28,

event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>}]},
leader_locator_strategy => <<"client-local">>,

leader_node => 'rabbit@pre-bpubsub05glc',
leader_pid => <12359.20754.57>,
name =>
"scs_s-sr0-119_q_1663163778291262666",

nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr0-119.q">>},

replica_nodes => ['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub06glc'],

retention => [{max_bytes,1000000000}]}}]

kjnilsson

unread,

Oct 18, 2022, 6:10:29 AM10/18/22

to rabbitmq-users

cat the attached file into `rabbitmqctl eval` and it should create an mnesia record that the stream coordinator _should_ eventually repair with the correct details.

NB: I haven't tested anything but the insert itself.

You should have periodic debug logs on one of your nodes to the effect of: "rabbit_stream_coordinator: running mnesia update for s-sr240-359_q_1663163778293078542"

followed by something like "rabbit_stream_coordinator: resource for stream id s-sr240-359_q_1663163778293078542 not found, recovering from rabbit_durable_queue"

This is the mnesia update process that hopefully should repair stuff.

On Tuesday, 18 October 2022 at 09:04:50 UTC+1 Nicolas Piguet wrote:

recover-stream-command

kjnilsson

unread,

Oct 18, 2022, 6:11:51 AM10/18/22

to rabbitmq-users

Also please check the retention is correct for the given stream.

Nicolas Piguet

unread,

Oct 18, 2022, 8:35:29 AM10/18/22

to rabbitmq-users

Hi,

this appears to have killed the "Queues" tab of the admin interface. The page now crashes with the error below. Is there a way to fix it? If not, how can I just remove the faulty entry from mnesia?

2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>   crasher:
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     initial call: cowboy_stream_h:request_process/3
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     pid: <0.22942.61>
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     registered_name: []
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     exception error: bad argument
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in function node/1
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>          called as node(undefined)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>          *** argument 1: not a pid
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from rabbit_amqqueue_process:format/1 (rabbit_amqqueue_process.erl, line 1816)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from rabbit_mgmt_format:queue/1 (rabbit_mgmt_format.erl, line 358)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from rabbit_mgmt_wm_queues:'-basic/1-lc$^0/1-0-'/1 (rabbit_mgmt_wm_queues.erl, line 64)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from rabbit_mgmt_wm_queues:basic/1 (rabbit_mgmt_wm_queues.erl, line 65)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from rabbit_mgmt_wm_queues:basic_vhost_filtered/2 (rabbit_mgmt_wm_queues.erl, line 95)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from rabbit_mgmt_wm_queues:to_json/2 (rabbit_mgmt_wm_queues.erl, line 43)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>       in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1575)
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     ancestors: [<0.22941.61>,<0.3793.1>,<0.3787.1>,<0.3786.1>,<0.3784.1>,
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>                   rabbit_web_dispatch_sup,<0.3773.1>]
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     message_queue_len: 0
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     messages: []
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     links: [<0.22941.61>]
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     dictionary: []
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     trap_exit: false
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     status: running
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     heap_size: 28690
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     stack_size: 28
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>     reductions: 149027
2022-10-18 14:32:53.689101+02:00 [error] <0.22942.61>   neighbours:

Nicolas Piguet

unread,

Oct 18, 2022, 9:00:45 AM10/18/22

to rabbitmq-users

Also, it doesn't seem to have fixed the Stream. My applications still get response code: 6 (STREAM_NOT_AVAILABLE) when trying to connect to the broken one.

kjnilsson

unread,

Oct 18, 2022, 9:02:45 AM10/18/22

to rabbitmq-users

eval with 'mnesia:dirty_delete(rabbit_durable_queue, {resource,<<"scs">>,queue, <<"s-sr240-359.q">>}).'

you can also try the attached updated file with the previous command but only if you get the debug logs I mentioned before.

recover-stream-command

Nicolas Piguet

unread,

Oct 18, 2022, 9:23:47 AM10/18/22

to rabbitmq-users

Ok, so replacing undefined with self() as you suggested fixed the admin interface.

However the stream is still marked as down and rabbit_amqqueue:lookup still returns []

I unfortunately do not have debug logs available. The logs only contain info and higher level entries, and there is no mention of the dead stream.

Is there maybe a command to read the mnesia record so I can compare the one that ended up being inserted with the ones from the working streams?

Nicolas Piguet

unread,

Oct 18, 2022, 9:34:27 AM10/18/22

to rabbitmq-users

Also, the status command that you provided in your first response now says:

"scs_s-sr240-359_q_1663163778293078542" =>
{stream,
"scs_s-sr240-359_q_1663163778293078542",
29,

{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
#{epoch => 1,
event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,
leader_node => 'rabbit@pre-bpubsub04glc',
name =>

"scs_s-sr240-359_q_1663163778293078542",

nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>
['rabbit@pre-bpubsub05glc','rabbit@pre-bpubsub06glc'],
retention => [{max_bytes,1000000000}]},
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
#{'rabbit@pre-bpubsub04glc' =>
{member,

{running,29,<12163.4035.1>},
{replica,29},
'rabbit@pre-bpubsub04glc',undefined,
#{epoch => 29,

event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,

leader_node => 'rabbit@pre-bpubsub05glc',
leader_pid => <12388.20747.57>,
name =>
"scs_s-sr240-359_q_1663163778293078542",

nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>

['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub06glc'],

retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub05glc' =>
{member,

{running,29,<12388.20747.57>},
{writer,29},
'rabbit@pre-bpubsub05glc',undefined,
#{epoch => 29,

event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,

leader_node => 'rabbit@pre-bpubsub05glc',
name =>
"scs_s-sr240-359_q_1663163778293078542",

nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>

['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub06glc'],

retention => [{max_bytes,1000000000}]},
running},
'rabbit@pre-bpubsub06glc' =>
{member,

{running,29,<12389.27966.54>},
{replica,29},
'rabbit@pre-bpubsub06glc',undefined,
#{epoch => 29,

event_formatter =>
{rabbit_stream_queue,format_osiris_event,
[{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>}]},
leader_locator_strategy => <<"client-local">>,

leader_node => 'rabbit@pre-bpubsub05glc',
leader_pid => <12388.20747.57>,
name =>
"scs_s-sr240-359_q_1663163778293078542",

nodes =>
['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub05glc',
'rabbit@pre-bpubsub06glc'],
reference =>
{resource,<<"scs">>,queue,
<<"s-sr240-359.q">>},
replica_nodes =>

['rabbit@pre-bpubsub04glc','rabbit@pre-bpubsub06glc'],

retention => [{max_bytes,1000000000}]},
running}},
#{},undefined,
{updating,12},
running},

Nicolas Piguet

unread,

Oct 18, 2022, 9:40:09 AM10/18/22

to rabbitmq-users

I notice that the epoch property is different on the first record (which I'm not sure what it corresponds to), and the 3 records that appear to be linked to each cluster node.
This looks like some kind of optimistic locking or another concurrency control mechanism. Should I maybe set epoch => 30 so all nodes understand this update to the key as a newer value than what they have last seen?

kjnilsson

unread,

Oct 18, 2022, 10:02:43 AM10/18/22

to rabbitmq-users

The thinking was that the stream coordinator should update the mnesia record with the correct details but we don't know if it is still trying to do so (which is indicated by the {updating, _} status for this stream. The debug logs would tell us if the stream was trying to repair the mnesia record or not.

Another thing to try if you don't mind is do kill the current stream leader to trigger restart. This should do it (the -n flag is important):

rabbitmqctl -n rabbit@pre-bpubsub05glc eval '{ok, Pid} = rabbit_stream_coordinator:local_pid("scs_s-sr240-359_q_1663163778293078542"), exit(Pid, kill).'

kjnilsson

unread,

Oct 18, 2022, 10:41:09 AM10/18/22

to rabbitmq-users

Ok failing that this eval may work as it should reset the mnesia update task which _should_ trigger the stream coordinator to repair the mnesia record (assuming it is still there since before).

rabbitmqctl eval 'ra:pipeline_command({rabbit_stream_coordinator, node()}, {action_failed, "scs_s-sr240-359_q_1663163778293078542", #{action => updating_mnesia}}).'

Again - all commands are provided without warranty :)

Nicolas Piguet

unread,

Oct 18, 2022, 11:31:53 AM10/18/22

to rabbitmq-users

You're the man!

Killing the stream on the leader node didn't really have any effect. The stream was still marked down. The only observable difference from 'sys:get_status(whereis(rabbit_stream_coordinator)).' is that the "epoch" for the stream was incremented.

However, resetting the mnesia update task worked. The stream is now shown as "running" in the admin interface, and applications can connect to it.

Thanks a lot for the help.

OK, now I need to write some sort of report about this, so that if it happens again, we will know what to do. Do you have any idea of what might have put the mnesia record and the actual status of the stream in this kind of inconsistent state? You mentioned something about the partition handling mode in a previous message, do you know what kind of specific situation could cause this problem? Is it likely to also affect quorum queues, or is it stream specific?

kjnilsson

unread,

Oct 19, 2022, 4:00:15 AM10/19/22

to rabbitmq-users

Great to hear we finally got there!

In this case I suspect something went wrong during the mnesia update that would have happened during your partition event which meant the stream coordinator thought the mnesia update was still on going but in fact it was not. That combined with the record being clobbered made a slightly tricky situation.

I don't know why the queue record was lost, mnesia isn't very good at handling network partitions which is why we are working on replacing it with our own Raft based meta data store.

For now I wonder if the pause_minority is the best mode for you or if it would be better to use something like autoheal. Do you still have any classic mirrored queues in the Rabbit cluster?

The issue with the mnesia update won't affect quorum queues. I will create an issue our end to force retry all pending mnesia updates whenever a new stream coordinator leader is elected.

The issue with the lost mnesia record could affect any system but it is very rare and I honestly don't even know how it was possible in your case. Full debug logs might have told us more but we don't have that so can't say.

Cheers

Karl

Nicolas Piguet

unread,

Oct 19, 2022, 4:33:11 AM10/19/22

to rabbitmq-users

We have several hundred classic mirrored queues in the cluster. Those 4 streams are actually part of the very few objects that use the quorum mechanics.

If you are going to create an issue on your side, would it be possible to share a public link to it so I can include it in my report?

Thanks again for your help.

kjnilsson

unread,

Oct 19, 2022, 4:42:26 AM10/19/22

to rabbitmq-users

Of course here it is: https://github.com/rabbitmq/rabbitmq-server/issues/6179

Are you planning to move your mirrored queues to quorum queues?

Nicolas Piguet

unread,

Oct 19, 2022, 4:59:34 AM10/19/22

to rabbitmq-users

I guess we're going to have to eventually. Unfortunately this decision is not entirely in my hands, but if you have any good references that explain the differences between classic queues and quorum queues, I'll take them.

It usually takes a significant amount of effort and a good while to convince the Infra/Operating people to move to another implementation that they haven't used before.

kjnilsson

unread,

Oct 19, 2022, 7:01:52 AM10/19/22

to rabbitmq-users

Here are a couple of links that they need to consider.

Firstly classic mirrored queues are deprecated and will be removed in RabbitMQ 4.0.

The mirroring features has significant drawbacks and data safety issues that cannot be fixed which is why quorum queues were developed.

https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/#removal-of-classic-queue-mirroring

https://www.rabbitmq.com/quorum-queues.html

Cheers

Karl

Reply all

Reply to author

Forward