We are using RabbitMQ as MQTT Broker for our IoT devices. Right now about 3200 devices. They are silent most of the time. In the mean we have about 30-40 messages in a second, in peaks it might doubles, but nothing big. Queues are always empty, consumers pick up messages on their arrival.
We were using RabbitMQ on Windows (because of historical reasons). It was a single node with regular definition backups. Now we want to have a Linux cluster. For migration period we would like to keep Windows node running in the cluster as well.
Yesterday we configured the cluster (by simply using rabbitmqctl join_cluster), and since then we see that memory on Windows node and on one of Linux nodes constantly grows. The third node works stable and without noticeable issues.
According to the RabbitMQ status (see below) on both nodes a log of memory allocated for quorum_ets (according to docs, it is raft consensus, which should not take that much).
Both problematic nodes have a lot of errors in their logs (also listed below). Unfortunately, they do not give me any hints about the reason. I see that somehow leader is lost again and again, and new election leads to mqtt_node termintation, but no idea why. Strangely it never happens with third node, where we see just a short notice about an election right after the node starts.
If somebody can give me a hint, I will be appreciate.
******************************************************************************************
******************************************************************************************
Status of node rabbit@Win2012 ...
Runtime
OS PID: 11940
OS: Windows
Uptime (seconds): 298556
RabbitMQ version: 3.8.5
Node name: rabbit@Win2012
Erlang configuration: Erlang/OTP 23 [erts-11.0.2] [source] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:64]
Erlang processes: 77123 used, 1048576 limit
Scheduler run queue: 0
Cluster heartbeat timeout (net_ticktime): 60
Plugins
Enabled plugin file: c:/Users/root/AppData/Roaming/RabbitMQ/enabled_plugins
Enabled plugins:
* rabbitmq_event_exchange
* rabbitmq_mqtt
* rabbitmq_management
* amqp_client
* rabbitmq_web_dispatch
* cowboy
* cowlib
* rabbitmq_management_agent
Data directory
Node data directory: c:/Users/root/AppData/Roaming/RabbitMQ/db/rabbit@Win2012-mnesia
Raft data directory: c:/Users/root/AppData/Roaming/RabbitMQ/db/rabbit@Win2012-mnesia/quorum/rabbit@Win2012
Config files
* c:/Users/root/AppData/Roaming/RabbitMQ/advanced.config
* c:/Users/root/AppData/Roaming/RabbitMQ/rabbitmq.conf
Log file(s)
* c:/Users/root/AppData/Roaming/RabbitMQ/log/rab...@Win2012.log
* c:/Users/root/AppData/Roaming/RabbitMQ/log/rabbit@Win2012_upgrade.log
Alarms
(none)
Memory
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 5.3853 gb
quorum_ets: 1.8512 gb (38.01 %)
allocated_unused: 0.8833 gb (18.14 %)
connection_other: 0.6941 gb (14.25 %)
other_proc: 0.653 gb (13.41 %)
queue_procs: 0.224 gb (4.6 %)
mgmt_db: 0.2061 gb (4.23 %)
other_ets: 0.1117 gb (2.29 %)
plugins: 0.0858 gb (1.76 %)
other_system: 0.0346 gb (0.71 %)
code: 0.0326 gb (0.67 %)
binary: 0.0298 gb (0.61 %)
metrics: 0.0257 gb (0.53 %)
quorum_queue_procs: 0.0165 gb (0.34 %)
mnesia: 0.0161 gb (0.33 %)
connection_channels: 0.0036 gb (0.07 %)
atom: 0.0014 gb (0.03 %)
msg_index: 0.0004 gb (0.01 %)
connection_writers: 0.0002 gb (0.0 %)
connection_readers: 0.0002 gb (0.0 %)
queue_slave_procs: 0.0 gb (0.0 %)
reserved_unallocated: 0.0 gb (0.0 %)
File Descriptors
Total: 3098, limit: 65439
Sockets: 3092, limit: 58893
Free Disk Space
Low free disk space watermark: 0.05 gb
Free disk space: 117.5893 gb
Totals
Connection count: 6
Queue count: 3100
Virtual host count: 1
Listeners
Interface: [::], port: 55956, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: 0.0.0.0, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: 0.0.0.0, port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: 0.0.0.0, port: 15672, protocol: http, purpose: HTTP API
Interface: [::], port: 1883, protocol: mqtt, purpose: MQTT
Interface: 0.0.0.0, port: 1883, protocol: mqtt, purpose: MQTT
Interface: [::], port: 8883, protocol: mqtt/ssl, purpose: MQTT over TLS
Interface: 0.0.0.0, port: 8883, protocol: mqtt/ssl, purpose: MQTT over TLS
2020-07-09 10:15:56.667 [warning] <0.30340.6> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-09 10:15:56.709 [info] <0.30340.6> mqtt_node: granting vote for {mqtt_node,'rabbit@Win2012'} with last indexterm {9017585,20143} for term 20144 previous term was 20143
2020-07-09 10:15:56.716 [info] <0.30340.6> mqtt_node: detected a new leader {mqtt_node,'rabbit@Win2012'} in term 20144
2020-07-09 10:15:56.723 [info] <0.30340.6> mqtt_node: follower received leader command from <17633.29821.451>. Rejecting to {mqtt_node,'rabbit@Win2012'}
2020-07-09 10:15:56.724 [info] <0.30340.6> mqtt_node: follower received leader command from <17633.25202.322>. Rejecting to {mqtt_node,'rabbit@Win2012'}
2020-07-09 10:15:56.794 [info] <0.30340.6> mqtt_node: Leader monitor down with {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,...},{ra_server,...},{...}|...]}}, setting election timeout
2020-07-09 10:15:56.803 [info] <0.30340.6> mqtt_node: detected a new leader {mqtt_node,'rabbit@Win2012'} in term 20144
2020-07-09 10:16:17.107 [notice] <0.30340.6> mqtt_node: candidate -> leader in term: 20145
2020-07-09 10:16:17.175 [error] <0.30340.6> handle_leader err {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
2020-07-09 10:16:17.175 [info] <0.30340.6> mqtt_node: terminating with {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,[115,114,99,47,109,113,116,116,95,109,97,99,104,105,110,101,46,101,114,108]},{line,45}]},{ra_server,apply_with,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,apply_to,5,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,327}]},{ra_server_proc,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,911}]},{ra_server_proc,leader,3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,426}]},{gen_statem,loop_state_callback,11,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1166}]}]}} in state leader
2020-07-09 10:16:17.182 [error] <0.30340.6> ** State machine mqtt_node terminating
** Last event = {cast,{{mqtt_node,'rabbit@Win2012'},{append_entries_reply,20145,true,9019063,9019015,20145}}}
** When server state = [{id,{mqtt_node,rabbit@deb10_node01}},{opt,terminate},{raft_state,leader},{leader_last_seen,undefined},{num_pending_commands,0},{num_delayed_commands,0},{election_timeout_set,false},{ra_server_state,#{aux => undefined,cluster => #{{mqtt_node,'rabbit@Win2012'} => #{commit_index_sent => 221295,match_index => 9019014,next_index => 9020451,query_index => 0},{mqtt_node,rabbit@deb10_node01} => #{commit_index_sent => 0,match_index => 0,next_index => 1,query_index => 0}},commit_index => 221295,current_term => 20145,effective_machine_module => ...,...}}]
** Reason for termination = exit:{'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
** [{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,918}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
** Time-outs: {1,[{{timeout,tick},tick_timeout}]}
2020-07-09 10:16:17.185 [error] <0.30340.6> CRASH REPORT Process mqtt_node with 0 neighbours exited with reason: {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,...},...]}} in ra_server_proc:handle_leader/2 line 918
2020-07-09 10:16:17.185 [error] <0.6716.0> Supervisor {<0.6716.0>,ra_server_sup} had child mqtt_node started with ra_server_proc:start_link(#{cluster_name => mqtt_node,friendly_name => mqtt_node,id => {mqtt_node,rabbit@deb10_node01},initial_members => ...,...}) at <0.30340.6> exit with reason {'EXIT',{{case_clause,{ok,<17633.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,...},...]}} in context child_terminated
2020-07-09 10:16:27.154 [warning] <0.30455.6> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-09 10:16:27.158 [warning] <0.30455.6> mqtt_node: leader cast - leader not known. Command is dropped.
* Logs from Windows node
********************************************************************************
2020-07-08 20:11:37.693 [warning] <0.19403.388> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-08 20:11:37.693 [warning] <0.19403.388> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-08 20:11:37.693 [warning] <0.19403.388> mqtt_node: leader cast - leader not known. Command is dropped.
2020-07-08 20:11:37.896 [notice] <0.19403.388> mqtt_node: candidate -> leader in term: 15638
2020-07-08 20:11:37.974 [error] <0.19403.388> handle_leader err {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
2020-07-08 20:11:37.974 [info] <0.19403.388> mqtt_node: terminating with {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,[115,114,99,47,109,113,116,116,95,109,97,99,104,105,110,101,46,101,114,108]},{line,45}]},{ra_server,apply_with,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,apply_to,5,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,2017}]},{ra_server,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,46,101,114,108]},{line,327}]},{ra_server_proc,handle_leader,2,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,911}]},{ra_server_proc,leader,3,[{file,[115,114,99,47,114,97,95,115,101,114,118,101,114,95,112,114,111,99,46,101,114,108]},{line,426}]},{gen_statem,loop_state_callback,11,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1166}]}]}} in state leader
2020-07-08 20:11:37.990 [info] <0.26078.388> MQTT vhost picked using plugin configuration or default
2020-07-08 20:11:37.990 [error] <0.19403.388> ** State machine mqtt_node terminating
** Last event = {cast,{{mqtt_node,rabbit@deb10_node01},{append_entries_reply,15638,true,2952249,2952150,15638}}}
** When server state = [{id,{mqtt_node,'rabbit@Win2012'}},{opt,terminate},{raft_state,leader},{leader_last_seen,undefined},{num_pending_commands,0},{num_delayed_commands,0},{election_timeout_set,false},{ra_server_state,#{aux => undefined,cluster => #{{mqtt_node,'rabbit@Win2012'} => #{commit_index_sent => 0,match_index => 0,next_index => 1,query_index => 0},{mqtt_node,rabbit@deb10_node01} => #{commit_index_sent => 221295,match_index => 2952125,next_index => 2953400,query_index => 0}},commit_index => 221295,current_term => 15638,effective_machine_module => ...,...}}]
** Reason for termination = exit:{'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]}]}}
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
** [{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,918}]},{ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,426}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
** Time-outs: {1,[{{timeout,tick},tick_timeout}]}
2020-07-08 20:11:38.005 [error] <0.19403.388> CRASH REPORT Process mqtt_node with 0 neighbours exited with reason: {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[...]},...]}} in ra_server_proc:handle_leader/2 line 918
2020-07-08 20:11:38.005 [error] <0.878.0> Supervisor {<0.878.0>,ra_server_sup} had child mqtt_node started with ra_server_proc:start_link(#{await_condition_timeout => 30000,broadcast_time => 100,cluster_name => mqtt_node,friendly_name => ...,...}) at <0.19403.388> exit with reason {'EXIT',{{case_clause,{ok,<0.10615.240>}},[{mqtt_machine,apply,3,[{file,"src/mqtt_machine.erl"},{line,45}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{line,2059}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,2017}]},{ra_server,handle_leader,2,[{file,"src/ra_server.erl"},{line,327}]},{ra_server_proc,handle_leader,2,[{file,"src/ra_server_proc.erl"},{line,911}]},{ra_server_proc,leader,3,[...]},...]}} in context child_terminated
********************************************************************************
* Logs from second Linux node
********************************************************************************
2020-07-08 16:49:26.753 [info] <0.3842.0> Server startup complete; 5 plugins started.
* rabbitmq_event_exchange
* rabbitmq_mqtt
* rabbitmq_management
* rabbitmq_web_dispatch
* rabbitmq_management_agent
2020-07-08 16:49:34.586 [info] <0.4063.0> mqtt_node: Leader monitor down with {'EXIT',{{case_clause,{ok,<42941.10615.240>}},[{mqtt_machine,apply,...},{ra_server,...},{...}|...]}}, setting election timeout
2020-07-08 16:49:34.586 [info] <0.4063.0> mqtt_node: detected a new leader {mqtt_node,rabbit@deb10_node01} in term 14134
2020-07-08 16:49:34.589 [info] <0.4063.0> mqtt_node: granting vote for {mqtt_node,'rabbit@Win2012'} with last indexterm {1084039,14134} for term 14135 previous term was 14134
2020-07-08 16:49:34.591 [info] <0.4063.0> mqtt_node: follower did not have entry at 1084039 in 14134. Requesting {mqtt_node,'rabbit@Win2012'} from 221980
2020-07-08 16:49:34.591 [info] <0.4063.0> mqtt_node: detected a new leader {mqtt_node,'rabbit@Win2012'} in term 14135