High memory consumption in RabbitMQ cluster in two nodes of three

179 views
Skip to first unread message

Ivan Eryshov

unread,
Jul 9, 2020, 6:12:17 AM7/9/20
to rabbitmq-users
We are using RabbitMQ as MQTT Broker for our IoT devices. Right now about 3200 devices. They are silent most of the time. In the mean we have about 30-40 messages in a second, in peaks it might doubles, but nothing big. Queues are always empty, consumers pick up messages on their arrival.

We were using RabbitMQ on Windows (because of historical reasons). It was a single node with regular definition backups. Now we want to have a Linux cluster. For migration period we would like to keep Windows node running in the cluster as well.

Yesterday we configured the cluster (by simply using rabbitmqctl join_cluster), and since then we see that memory on Windows node and on one of Linux nodes constantly grows. The third node works stable and without noticeable issues.

According to the RabbitMQ status (see below) on both nodes a log of memory allocated for quorum_ets (according to docs, it is raft consensus, which should not take that much). 

Both problematic nodes have a lot of errors in their logs (also listed below). Unfortunately, they do not give me any hints about the reason. I see that somehow leader is lost again and again, and new election leads to mqtt_node termintation, but no idea why. Strangely it never happens with third node, where we see just a short notice about an election right after the node starts.

If somebody can give me a hint, I will be appreciate.



******************************************************************************************
*   Output of the "rabbitmqctl status" on the Windows node:
******************************************************************************************

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.8.5\sbin>rabbitmqctl status
Status of node rabbit@Win2012 ...
Runtime

OS PID: 11940
OS: Windows
Uptime (seconds): 298556
RabbitMQ version: 3.8.5
Node name: rabbit@Win2012
Erlang configuration: Erlang/OTP 23 [erts-11.0.2] [source] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:64]
Erlang processes: 77123 used, 1048576 limit
Scheduler run queue: 0
Cluster heartbeat timeout (net_ticktime): 60

Plugins

Enabled plugin file: c:/Users/root/AppData/Roaming/RabbitMQ/enabled_plugins
Enabled plugins:

 * rabbitmq_event_exchange
 * rabbitmq_mqtt
 * rabbitmq_management
 * amqp_client
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_management_agent

Data directory

Node data directory: c:/Users/root/AppData/Roaming/RabbitMQ/db/rabbit@Win2012-mnesia
Raft data directory: c:/Users/root/AppData/Roaming/RabbitMQ/db/rabbit@Win2012-mnesia/quorum/rabbit@Win2012

Config files

 * c:/Users/root/AppData/Roaming/RabbitMQ/advanced.config
 * c:/Users/root/AppData/Roaming/RabbitMQ/rabbitmq.conf

Log file(s)

 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rab...@Win2012.log
 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rabbit@Win2012_upgrade.log

Alarms

(none)

Memory

Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 5.3853 gb
quorum_ets: 1.8512 gb (38.01 %)
allocated_unused: 0.8833 gb (18.14 %)
connection_other: 0.6941 gb (14.25 %)
other_proc: 0.653 gb (13.41 %)
queue_procs: 0.224 gb (4.6 %)
mgmt_db: 0.2061 gb (4.23 %)
other_ets: 0.1117 gb (2.29 %)
plugins: 0.0858 gb (1.76 %)
other_system: 0.0346 gb (0.71 %)
code: 0.0326 gb (0.67 %)
binary: 0.0298 gb (0.61 %)
metrics: 0.0257 gb (0.53 %)
quorum_queue_procs: 0.0165 gb (0.34 %)
mnesia: 0.0161 gb (0.33 %)
connection_channels: 0.0036 gb (0.07 %)
atom: 0.0014 gb (0.03 %)
msg_index: 0.0004 gb (0.01 %)
connection_writers: 0.0002 gb (0.0 %)
connection_readers: 0.0002 gb (0.0 %)
queue_slave_procs: 0.0 gb (0.0 %)
reserved_unallocated: 0.0 gb (0.0 %)

File Descriptors

Total: 3098, limit: 65439
Sockets: 3092, limit: 58893

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 117.5893 gb

Totals

Connection count: 6
Queue count: 3100
Virtual host count: 1

Listeners

Interface: [::], port: 55956, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: 0.0.0.0, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: 0.0.0.0, port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: 0.0.0.0, port: 15672, protocol: http, purpose: HTTP API
Interface: [::], port: 1883, protocol: mqtt, purpose: MQTT
Interface: 0.0.0.0, port: 1883, protocol: mqtt, purpose: MQTT
Interface: [::], port: 8883, protocol: mqtt/ssl, purpose: MQTT over TLS
Interface: 0.0.0.0, port: 8883, protocol: mqtt/ssl, purpose: MQTT over TLS






******************************************************************************************
*   Output of the "rabbitmqctl status" on the first Linux node:
******************************************************************************************
root@deb10_node01:/etc/rabbitmq# /sbin/rabbitmqctl status
Status of node rabbit@deb10_node01 ...
Runtime

OS PID: 31829
OS: Linux
Uptime (seconds): 65859
RabbitMQ version: 3.8.5
Node name: rabbit@deb10_node01
Erlang configuration: Erlang/OTP 23 [erts-11.0.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:64]
Erlang processes: 540 used, 1048576 limit
Scheduler run queue: 3
Cluster heartbeat timeout (net_ticktime): 60

Plugins

Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:

 * rabbitmq_top
 * rabbitmq_event_exchange
 * rabbitmq_mqtt
 * rabbitmq_management
 * amqp_client
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_management_agent

Data directory

Node data directory: /var/lib/rabbitmq/mnesia/rabbit@deb10_node01
Raft data directory: /var/lib/rabbitmq/mnesia/rabbit@deb10_node01/quorum/rabbit@deb10_node01

Config files

 * /etc/rabbitmq/rabbitmq.conf

Log file(s)

 * /var/log/rabbitmq/rabbit@deb10_node01.log
 * /var/log/rabbitmq/rabbit@deb10_node01_upgrade.log

Alarms

(none)

Memory

Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 5.0375 gb
quorum_ets: 2.0149 gb (84.12 %)
other_proc: 0.0804 gb (3.36 %)
other_ets: 0.0773 gb (3.23 %)
reserved_unallocated: 0.0662 gb (2.76 %)
allocated_unused: 0.0486 gb (2.03 %)
code: 0.0325 gb (1.36 %)
plugins: 0.0251 gb (1.05 %)
mnesia: 0.0152 gb (0.63 %)
quorum_queue_procs: 0.0144 gb (0.6 %)
other_system: 0.0142 gb (0.59 %)
binary: 0.0035 gb (0.15 %)
atom: 0.0014 gb (0.06 %)
mgmt_db: 0.0008 gb (0.03 %)
connection_other: 0.0003 gb (0.01 %)
metrics: 0.0002 gb (0.01 %)
connection_channels: 0.0001 gb (0.01 %)
connection_writers: 0.0 gb (0.0 %)
connection_readers: 0.0 gb (0.0 %)
msg_index: 0.0 gb (0.0 %)
queue_procs: 0.0 gb (0.0 %)
queue_slave_procs: 0.0 gb (0.0 %)

File Descriptors

Total: 3, limit: 32671
Sockets: 1, limit: 29401

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 39.4397 gb

Totals

Connection count: 1
Queue count: 3105
Virtual host count: 1

Listeners

Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: [::], port: 1883, protocol: mqtt, purpose: MQTT
Interface: [::], port: 8883, protocol: mqtt/ssl, purpose: MQTT over TLS
Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)






********************************************************************************
* Logs from first Linux node
********************************************************************************
2020-07-09 10:15:56.667 [warning] <0.30340.6> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-09 10:15:56.709 [info] <0.30340.6> mqtt_node: granting vote for {mqtt_node,'rabbit@Win2012'} with last indexterm {9017585,20143} for term 20144 previous term was 20143
2020-07-09 10:15:56.716 [info] <0.30340.6> mqtt_node: detected a new leader {mqtt_node,'rabbit@Win2012'} in term 20144
2020-07-09 10:15:56.723 [info] <0.30340.6> mqtt_node: follower received leader command from <17633.29821.451>. Rejecting to {mqtt_node,'rabbit@Win2012'}
2020-07-09 10:15:56.724 [info] <0.30340.6> mqtt_node: follower received leader command from <17633.25202.322>. Rejecting to {mqtt_node,'rabbit@Win2012'}
2020-07-09 10:15:56.794 [info] <0.30340.6> mqtt_node: Leader monitor down with {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,...},{ra_server,...},{...}|...]}}, setting election timeout
2020-07-09 10:15:56.803 [info] <0.30340.6> mqtt_node: detected a new leader {mqtt_node,'rabbit@Win2012'} in term 20144
2020-07-09 10:16:17.107 [notice] <0.30340.6> mqtt_node: candidate -> leader in term: 20145
2020-07-09 10:16:17.175 [error] <0.30340.6> handle_leader err {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
2020-07-09 10:16:17.175 [info] <0.30340.6> mqtt_node: terminating with {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,[115,114,99,47,109,113,116,116,95,109,97,99,104,105,110,101,46,101,114,108]},{line,45}]},{ra_server,apply_with,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,apply_to,5,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,327}]},{ra_server_proc,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,911}]},{ra_server_proc,leader,3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,426}]},{gen_statem,loop_state_callback,11,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1166}]}]}} in state leader
2020-07-09 10:16:17.182 [error] <0.30340.6> ** State machine mqtt_node terminating
** Last event = {cast,{{mqtt_node,'rabbit@Win2012'},{append_entries_reply,20145,true,9019063,9019015,20145}}}
** When server state  = [{id,{mqtt_node,rabbit@deb10_node01}},{opt,terminate},{raft_state,leader},{leader_last_seen,undefined},{num_pending_commands,0},{num_delayed_commands,0},{election_timeout_set,false},{ra_server_state,#{aux => undefined,cluster => #{{mqtt_node,'rabbit@Win2012'} => #{commit_index_sent => 221295,match_index => 9019014,next_index => 9020451,query_index => 0},{mqtt_node,rabbit@deb10_node01} => #{commit_index_sent => 0,match_index => 0,next_index => 1,query_index => 0}},commit_index => 221295,current_term => 20145,effective_machine_module => ...,...}}]
** Reason for termination = exit:{'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,918}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
** Time-outs: {1,[{{timeout,tick},tick_timeout}]}
2020-07-09 10:16:17.185 [error] <0.30340.6> CRASH REPORT Process mqtt_node with 0 neighbours exited with reason: {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,...},...]}} in ra_server_proc:handle_leader/2 line 918
2020-07-09 10:16:17.185 [error] <0.6716.0> Supervisor {<0.6716.0>,ra_server_sup} had child mqtt_node started with ra_server_proc:start_link(#{cluster_name => mqtt_node,friendly_name => mqtt_node,id => {mqtt_node,rabbit@deb10_node01},initial_members => ...,...}) at <0.30340.6> exit with reason {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,...},...]}} in context child_terminated
2020-07-09 10:16:27.154 [warning] <0.30455.6> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-09 10:16:27.158 [warning] <0.30455.6> mqtt_node: leader cast - leader not known. Command is dropped.





********************************************************************************
* Logs from Windows node
********************************************************************************
2020-07-08 20:11:37.693 [warning] <0.19403.388> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-08 20:11:37.693 [warning] <0.19403.388> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-08 20:11:37.693 [warning] <0.19403.388> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-08 20:11:37.896 [notice] <0.19403.388> mqtt_node: candidate -> leader in term: 15638
2020-07-08 20:11:37.974 [error] <0.19403.388> handle_leader err {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
2020-07-08 20:11:37.974 [info] <0.19403.388> mqtt_node: terminating with {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,[115,114,99,47,109,113,116,116,95,109,97,99,104,105,110,101,46,101,114,108]},{line,45}]},{ra_server,apply_with,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,apply_to,5,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,327}]},{ra_server_proc,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,911}]},{ra_server_proc,leader,3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,426}]},{gen_statem,loop_state_callback,11,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1166}]}]}} in state leader
2020-07-08 20:11:37.990 [info] <0.26078.388> MQTT vhost picked using plugin configuration or default
2020-07-08 20:11:37.990 [error] <0.19403.388> ** State machine mqtt_node terminating
** Last event = {cast,{{mqtt_node,rabbit@deb10_node01},{append_entries_reply,15638,true,2952249,2952150,15638}}}
** When server state  = [{id,{mqtt_node,'rabbit@Win2012'}},{opt,terminate},{raft_state,leader},{leader_last_seen,undefined},{num_pending_commands,0},{num_delayed_commands,0},{election_timeout_set,false},{ra_server_state,#{aux => undefined,cluster => #{{mqtt_node,'rabbit@Win2012'} => #{commit_index_sent => 0,match_index => 0,next_index => 1,query_index => 0},{mqtt_node,rabbit@deb10_node01} => #{commit_index_sent => 221295,match_index => 2952125,next_index => 2953400,query_index => 0}},commit_index => 221295,current_term => 15638,effective_machine_module => ...,...}}]
** Reason for termination = exit:{'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,918}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
** Time-outs: {1,[{{timeout,tick},tick_timeout}]}
2020-07-08 20:11:38.005 [error] <0.19403.388> CRASH REPORT Process mqtt_node with 0 neighbours exited with reason: {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[...]},...]}} in ra_server_proc:handle_leader/2 line 918
2020-07-08 20:11:38.005 [error] <0.878.0> Supervisor {<0.878.0>,ra_server_sup} had child mqtt_node started with ra_server_proc:start_link(#{await_condition_timeout => 30000,broadcast_time => 100,cluster_name => mqtt_node,friendly_name => ...,...}) at <0.19403.388> exit with reason {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[...]},...]}} in context child_terminated






********************************************************************************
* Logs from second Linux node
********************************************************************************
2020-07-08 16:49:26.753 [info] <0.3842.0> Server startup complete; 5 plugins started.
 * rabbitmq_event_exchange
 * rabbitmq_mqtt
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
2020-07-08 16:49:34.586 [info] <0.4063.0> mqtt_node: Leader monitor down with {'EXIT',{{case_clause,{ok,<42941.10615.240>}},[{mqtt_machine,apply,...},{ra_server,...},{...}|...]}}, setting election timeout
2020-07-08 16:49:34.586 [info] <0.4063.0> mqtt_node: detected a new leader {mqtt_node,rabbit@deb10_node01} in term 14134
2020-07-08 16:49:34.589 [info] <0.4063.0> mqtt_node: granting vote for {mqtt_node,'rabbit@Win2012'} with last indexterm {1084039,14134} for term 14135 previous term was 14134
2020-07-08 16:49:34.591 [info] <0.4063.0> mqtt_node: follower did not have entry at 1084039 in 14134. Requesting {mqtt_node,'rabbit@Win2012'} from 221980
2020-07-08 16:49:34.591 [info] <0.4063.0> mqtt_node: detected a new leader {mqtt_node,'rabbit@Win2012'} in term 14135


Michael Klishin

unread,
Jul 9, 2020, 8:27:02 AM7/9/20
to rabbitmq-users
Thanks for providing the logs. It suggests that a Raft state machine that implements client ID tracking fails with an exception,
so my guess is that results in log entries never being truncated because, well, the log entry has not been successfully applied.

If this affects just one node, you can try disabling and re-enabled the plugin on the problematic node. However, it would be great to dump
the table in question to see what exactly is in the log. We will provide a `rabbitmqctl eval` snippet to run in a bit.

While very unlikely to be relevant here, we do not support such mixed OS clusters.
 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rabbit@Win2012.log

Ivan A Eryshov

unread,
Jul 9, 2020, 9:11:52 AM7/9/20
to rabbitm...@googlegroups.com
Hello Michael,

Thanks for the promising reply. 

Unfortunately, we had to stop the Linux node completely for now. Otherwise, we had a change having memory issues on the Windows workstation. So if you provide me a script, I'll be able to run it today on Windows Server only. On the weekend I can simulate the same cluster configuration again and run the provided script on both Windows and Linux.

Yet now I'm going to setup another cluster of three, where the "problematic" Linux node is involved and the other two (both Linux). 
Another test I plan, replacing the "problematic" Linux node with another one in my mixed OS cluster.

I'll let you know how it goes.

 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rab...@Win2012.log

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/8f595a49-d323-46f2-aa5e-6288204a0ef3o%40googlegroups.com.

Michael Klishin

unread,
Jul 9, 2020, 9:35:50 AM7/9/20
to rabbitmq-users
We believe we know what cases the plugin does not take into account but not sure how to reproduce them in practice.
There is no need to dump the tracker memtables.

Once we have a fix we can upload a new build here. There is no workaround besides upgrading to the future 3.8.6 version or replacing
the plugin with a temporary build on all nodes.

The issue almost certainly has nothing to do with the OS (at least it's not obvious to us how the registration event sequence would differ from Windows to Linux).
 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rabbit@Win2012.log
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Ivan A Eryshov

unread,
Jul 10, 2020, 6:37:58 AM7/10/20
to rabbitm...@googlegroups.com
I was playing with Clusters and MQTT queues today and faced another issue. I'm not sure how it is related to what we discussed before. Probably it is the same problem, but it also might be another. So I just provide my steps here.

(1) Created 3 new VMs. All three are Debian buster. Installed RabbitMQ 3.8.5 on erlang 23.0.2
(2) Activated everywhere plugins: rabbitmq_event_exchange, rabbitmq_management, rabbitmq_mqtt
(3) Clustered all 3 - no problems here.
(4) Connected 2-3 MQTT devices and performed some simple messaging via web management pluging
(5) Stopped one of the nodes (/sbin/rabbitmqctl stop_app)
(6) Made a try to start it again. At this point, it fails.
First I got the message:

************************************* 
Starting node rabbit@test20071001 ...
Error: {:rabbitmq_mqtt, {:bad_return, {{:rabbit_mqtt, :start, [:normal, []]}, {:EXIT, {:shutdown, {:gen_statem, :call, [{:mqtt_node, :rabbit@ test20071001 }, {:leader_call, {:state_query, :members}}, 100000]}}}}}}
************************************* 
 
Nothing helpful in logs, last records are related to app_stop:

*************************************
2020-07-10 11:34:48.185 [info] <0.24529.3> Closing all connections in vhost '/' on node 'rabbit@test20071001' because the vhost is stopping
2020-07-10 11:34:48.185 [info] <0.24545.3> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@test20071001/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2020-07-10 11:34:48.188 [info] <0.24545.3> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@test20071001/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2020-07-10 11:34:48.188 [info] <0.24541.3> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@test20071001/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2020-07-10 11:34:48.191 [info] <0.24541.3> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@test20071001/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2020-07-10 11:34:48.196 [info] <0.44.0> Application rabbit exited with reason: stopped
*************************************

There is some info in /var/log/rabbitmq/log/crash.log which quite about the same as console output:

*************************************
2020-07-10 11:34:47 =CRASH REPORT====
  crasher:
    initial call: application_master:init/4
    pid: <0.24890.3>
    registered_name: []
    exception exit: {{bad_return,{{rabbit_mqtt,start,[normal,[]]},{'EXIT',{shutdown,{gen_statem,call,[{mqtt_node,rabbit@test20071001},{leader_call,{state_query,members}},100000]}}}}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
    ancestors: [<0.24889.3>]
    message_queue_len: 1
    messages: [{'EXIT',<0.24891.3>,normal}]
    links: [<0.24889.3>,<0.44.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 28
    reductions: 212
  neighbours:
*************************************

Interesting thing happens if I try to start_app once again:

*************************************
2020-07-10 12:10:09 =CRASH REPORT====
  crasher:
    initial call: application_master:init/4
    pid: <0.28007.3>
    registered_name: []
    exception exit: {{{assert,[{module,rabbit_prelaunch},{line,119},{expression,"lists : keymember ( mnesia , 1 , application : which_applications ( ))"},{expected,false},{value,true}]},{rabbit,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
    ancestors: [<0.28006.3>]
    message_queue_len: 1
    messages: [{'EXIT',<0.28008.3>,normal}]
    links: [<0.28006.3>,<0.44.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 28
    reductions: 266
  neighbours:
*************************************

It looks like rabbitmq can't start again until I disable the MQTT plugin. Without the plugin it works well, but if re-enabled, it crashes on start again with {:rabbitmq_mqtt, {:bad_return ...



 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rab...@Win2012.log
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/33976422-984c-4ae6-8766-0259fb349856o%40googlegroups.com.

Michael Klishin

unread,
Jul 10, 2020, 9:37:34 AM7/10/20
to rabbitmq-users
It's not the same problem. The client ID tracker Raft cluster seemingly has no leader. We'll see if we can reproduce but it would help a lot if you share the logs from all nodes.

Michael Klishin

unread,
Jul 10, 2020, 9:43:21 AM7/10/20
to rabbitmq-users
Below is a build of the plugin that contains [1]. You can *replace* the rabbitmq-mqtt .ez file that ships with RabbitMQ
under the plugins directory [2] with it and see if you can reproduce the original problem with memtable (log) memory growth.

I'm also attaching a .zip archive in case Google Groups reject .ez file uploads.

We tentatively plan on shipping 3.8.6 sometime next week. Any feedback on [1] would be great.

2. See "Plugin archives directory" in `rabbitmq-plugins directories --offline` output


 * c:/Users/root/AppData/Roaming/RabbitMQ/log/rab...@Win2012.log
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
rabbitmq_mqtt-3.8.5+4.g8557c0e.ez
rabbitmq_mqtt-3.8.5+4.g8557c0e.ez.zip

Ivan A Eryshov

unread,
Jul 12, 2020, 6:09:50 PM7/12/20
to rabbitm...@googlegroups.com
I replaced the MQTT plugin. Exactly the same cluster does not produce any errors anymore. Seems that the problem is gone.

I will let it run over the night, and if something happens, will let you know.

Michael Klishin

unread,
Jul 14, 2020, 6:16:45 PM7/14/20
to rabbitmq-users
Thank you for the confirmation. So does it mean that the behavior in the single node restart case is also what'd you expect?

Ivan A Eryshov

unread,
Jul 16, 2020, 4:41:55 PM7/16/20
to rabbitm...@googlegroups.com
Yes, I can now stop and start nodes without any issues.

I still actively monitor the cluster's health and see some anomalies, but I'm not ready to provide details yet. I'll create another thread as soon as I get some.

Reply all
Reply to author
Forward
0 new messages