MK, thank you for taking a look at this issue.
I setup an entirely new cluster of 3 data nodes with latest RabbitMQ (3.7.8) and Erlang (Erlang/OTP 21) versions.
I had initially read how to tune for high no of concurrent connections and updated my configurations as per suggestions given there. Specifically, I use following config on each node :
fs.file-max = 200000
net.ipv4.tcp_fin_timeout = 10
net.core.somaxconn = 8192
net.ipv4.tcp_max_syn_backlog = 50000
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4
SERVER_ADDITIONAL_ERL_ARGS="+A 128 +Q 200000"
{tcp_listen_options,
[{backlog, 50000},
{nodelay, true},
{linger, {true,0}},
{sndbuf, 16384},
{recbuf, 16384},
{exit_on_close, false}
]
}
Open file descriptors limit set to 200000
I added a TCP LB (in GCP) in front of these 3 nodes. I created a policy that matches all MQTT topics to set HA of 2, TTL and expiry of 1 minute : '{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"manual","message-ttl":60000,"expires":60000}'
When I replaced my old server node with this 3 node cluster, the setup ran fine for a few minutes (maybe 5) and 2 nodes went down. I restarted those two nodes and since then the cluster has been running fine for about 1 hour now. The load has surely come down when compared to one hour ago and I am afraid that this crash might resurface again on Monday morning when the load increases.
I tried to look for any hints in log files to understand the cause for 2 nodes going down and I found a bunch of errors in crash log :
2018-09-28 13:38:44 =ERROR REPORT====
** Generic server <0.4556.7> terminating
** Last message in was {inet_async,#Port<0.16028>,0,{ok,<<16,29,0,6,77,81,73,115,100,112,3,2,0,30,0,15,108,56,48,53,48,53,53,56,56,51,49,52,52,53,51>>}}
** When Server state == {state,#Port<0.16028>,"1-39-158-170.live.vodafone.in:49505 -> 247.242.200.35.bc.googleusercontent.com:1883",true,undefined,false,running,{none,none},<0.4555.7>,false,none,{proc_state,#Port<0.16028>,#{},{undefined,undefined},{0,nil},{0,nil},undefined,1,undefined,undefined,undefined,{undefined,undefined},undefined,<<"amq.topic">>,{amqp_adapter_info,unknown,unknown,unknown,unknown,<<"(unknown)">>,{'MQTT',"N/A"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"MQTT client">>}]},{ssl,false}]},none,undefined,undefined,#Fun<rabbit_mqtt_processor.0.11905929>},undefined,{state,fine,5000,undefined}}
** Reason for termination ==
** {{badmatch,{error,enotconn}},[{rabbit_mqtt_processor,process_login,4,[{file,"src/rabbit_mqtt_processor.erl"},{line,482}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2018-09-28 13:38:44 =CRASH REPORT====
crasher:
initial call: rabbit_mqtt_reader:init/1
pid: <0.4556.7>
registered_name: []
exception exit: {{{badmatch,{error,enotconn}},[{rabbit_mqtt_processor,process_login,4,[{file,"src/rabbit_mqtt_processor.erl"},{line,482}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},[{gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1166}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
ancestors: [<0.4554.7>,<0.674.0>,<0.673.0>,<0.672.0>,rabbit_mqtt_sup,<0.665.0>]
message_queue_len: 1
messages: [{inet_async,#Port<0.16028>,1,{error,closed}}]
links: [<0.4554.7>,#Port<0.16028>]
dictionary: [{rand_seed,{#{jump => #Fun<rand.16.10897371>,max => 288230376151711743,next => #Fun<rand.15.10897371>,type => exsplus},[68840492309434389|3013970438321444]}}]
trap_exit: true
status: running
heap_size: 1598
stack_size: 27
reductions: 2562
neighbours:
2018-09-28 13:38:44 =SUPERVISOR REPORT====
Supervisor: {<0.4554.7>,rabbit_mqtt_connection_sup}
Context: child_terminated
Reason: {{badmatch,{error,enotconn}},[{rabbit_mqtt_processor,process_login,4,[{file,"src/rabbit_mqtt_processor.erl"},{line,482}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
Offender: [{pid,<0.4556.7>},{name,rabbit_mqtt_reader},{mfargs,{rabbit_mqtt_reader,start_link,[<0.4555.7>,{acceptor,{0,0,0,0,0,0,0,0},1883},#Port<0.16028>]}},{restart_type,intrinsic},{shutdown,30000},{child_type,worker}]
2018-09-28 13:38:44 =SUPERVISOR REPORT====
Supervisor: {<0.4554.7>,rabbit_mqtt_connection_sup}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.4556.7>},{name,rabbit_mqtt_reader},{mfargs,{rabbit_mqtt_reader,start_link,[<0.4555.7>,{acceptor,{0,0,0,0,0,0,0,0},1883},#Port<0.16028>]}},{restart_type,intrinsic},{shutdown,30000},{child_type,worker}]
What does this indicate and how to get ride of this?