RabbitMQ server keeps shutting down continuously

1,486 views
Skip to first unread message

Shobhana Sriram

unread,
Sep 27, 2018, 1:56:51 AM9/27/18
to rabbitmq-users
Hi,

We have a RabbitMQ server which was working fine for the past 9+ months. We use this server to exchange only non-persistent MQTT messages on Topics between our server and apps running on Android and iOS devices. We have tens of thousands of users active everyday. Roughly the no of topics is equal to twice the no of users! This server runs on a 16 vCPU, 60 GB RAM node. Since the users' mobile devices may not be connected to the internet always, there is a possibility of message getting piled up at the server. Many a times I have seen more than 800K messages ready to be delivered. So we have set TTL and expiry policy for 2 hours.

Since last 4 days, we have observed that this server crashes at least once everyday. I wanted to try by reducing the load on this server. So I setup another server (say Server2) which will be used for exchanging one specific category of messages. All other category of messages will be exchanged via the original server.

This Server2 has same configurations as that of the original server, except the following differences :
a) Server1 run 3.6.14-1 version and Server2 runs 3.6.15-1 version
b) The TTL and expiry is set to 2 hours on Server1, and to 15 minutes only in Server2

But this Server2 keeps crashing continuously. I tried by resetting the broker to its virgin state and then restarted the VM instance several times. Server starts up fine, but stops working within 1-2 minutes. There is no crash dump file anywhere. I found a bunch of error messages in both .log and sasl logs, but couldn't understand what is causing these error messages. I have seen some of these error messages in Server1 also, but Server1 is running fine for past 12 hours. 

When the broker is down, the app has intelligence to queue up messages to be published and retry when the connection to the broker is successful. Maybe this sudden increase in the no of messages that get posted to the server is causing some issue? 

I have attached the error log and sasl logs for reference. How do I identify the root cause for this?

Regards,
Shobhana
node2@a.b.c.d.log
node2@a.b.c.d-sasl.log

Shobhana Sriram

unread,
Sep 27, 2018, 6:55:24 AM9/27/18
to rabbitmq-users
Running cluster_status command gives following output :

ubuntu@rabbitmq-n2:/var/lib/rabbitmq$ sudo rabbitmqctl cluster_status
Cluster status of node 'no...@a.b.c.d'
Error: unable to connect to node 'node2@a.b.c.d': nodedown
DIAGNOSTICS
===========
attempted to contact: ['node2@a.b.c.d']
node2@a.b.c.d:
  * connected to epmd (port 4369) on a.b.c.d
  * epmd reports node 'node2' running on port 25672
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
  * suggestion: is the Erlang distribution using TLS?
current node details:
- node name: 'rabbitm...@rabbitmq-n2.c.production.internal'
- home dir: /var/lib/rabbitmq
- cookie hash: /Dv1qISQipM1EIKMiYSfRA==

Michael Klishin

unread,
Sep 27, 2018, 7:49:36 AM9/27/18
to rabbitm...@googlegroups.com
You are running on Erlang/OTP 18.3 which is known to have bugs that are catastrophic to RabbitMQ [1].
On 3.6.x (which has been out of support for a few months now, by the way) you should be able to move to
20.3.x, or at least 19.3.6.5+ which contains fixes for the bugs listed in [1].

There's a lot of MQTT connection errors that say that a socket is no longer in connected state. That can be just
another manifestation of those bugs (typically the node simply refuses to accept connections, including from CLI tools)
or an indication of a high connection churn that makes the problem a lot more visible. In either case, see [2].


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Shobhana Sriram

unread,
Sep 28, 2018, 10:53:03 AM9/28/18
to rabbitmq-users
MK, thank you for taking a look at this issue.

I setup an entirely new cluster of 3 data nodes with latest RabbitMQ (3.7.8) and Erlang (Erlang/OTP 21) versions.

I had initially read how to tune for high no of concurrent connections and updated my configurations as per suggestions given there. Specifically, I use following config on each node :

fs.file-max = 200000
net.ipv4.tcp_fin_timeout = 10
net.core.somaxconn = 8192
net.ipv4.tcp_max_syn_backlog = 50000
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4

SERVER_ADDITIONAL_ERL_ARGS="+A 128 +Q 200000"

{tcp_listen_options,
     [{backlog,       50000},
      {nodelay,       true},
      {linger,        {true,0}},
      {sndbuf, 16384},
      {recbuf, 16384},
      {exit_on_close, false}
     ]
    }

Open file descriptors limit set to 200000 

I added a TCP LB (in GCP) in front of these 3 nodes. I created a policy that matches all MQTT topics to set HA of 2, TTL and expiry of 1 minute : '{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"manual","message-ttl":60000,"expires":60000}'

When I replaced my old server node with this 3 node cluster, the setup ran fine for a few minutes (maybe 5) and 2 nodes went down. I restarted those two nodes and since then the cluster has been running fine for about 1 hour now. The load has surely come down when compared to one hour ago and I am afraid that this crash might resurface again on Monday morning when the load increases.

I tried to look for any hints in log files to understand the cause for 2 nodes going down and I found a bunch of errors in crash log :


2018-09-28 13:38:44 =ERROR REPORT====
** Generic server <0.4556.7> terminating
** Last message in was {inet_async,#Port<0.16028>,0,{ok,<<16,29,0,6,77,81,73,115,100,112,3,2,0,30,0,15,108,56,48,53,48,53,53,56,56,51,49,52,52,53,51>>}}
** When Server state == {state,#Port<0.16028>,"1-39-158-170.live.vodafone.in:49505 -> 247.242.200.35.bc.googleusercontent.com:1883",true,undefined,false,running,{none,none},<0.4555.7>,false,none,{proc_state,#Port<0.16028>,#{},{undefined,undefined},{0,nil},{0,nil},undefined,1,undefined,undefined,undefined,{undefined,undefined},undefined,<<"amq.topic">>,{amqp_adapter_info,unknown,unknown,unknown,unknown,<<"(unknown)">>,{'MQTT',"N/A"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"MQTT client">>}]},{ssl,false}]},none,undefined,undefined,#Fun<rabbit_mqtt_processor.0.11905929>},undefined,{state,fine,5000,undefined}}
** Reason for termination ==
** {{badmatch,{error,enotconn}},[{rabbit_mqtt_processor,process_login,4,[{file,"src/rabbit_mqtt_processor.erl"},{line,482}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2018-09-28 13:38:44 =CRASH REPORT====
  crasher
:
    initial call
: rabbit_mqtt_reader:init/1
    pid
: <0.4556.7>
    registered_name
: []
    exception
exit: {{{badmatch,{error,enotconn}},[{rabbit_mqtt_processor,process_login,4,[{file,"src/rabbit_mqtt_processor.erl"},{line,482}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},[{gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1166}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
    ancestors
: [<0.4554.7>,<0.674.0>,<0.673.0>,<0.672.0>,rabbit_mqtt_sup,<0.665.0>]
    message_queue_len
: 1
    messages
: [{inet_async,#Port<0.16028>,1,{error,closed}}]
    links
: [<0.4554.7>,#Port<0.16028>]
    dictionary
: [{rand_seed,{#{jump => #Fun<rand.16.10897371>,max => 288230376151711743,next => #Fun<rand.15.10897371>,type => exsplus},[68840492309434389|3013970438321444]}}]
    trap_exit
: true
    status
: running
    heap_size
: 1598
    stack_size
: 27
    reductions
: 2562
  neighbours
:
2018-09-28 13:38:44 =SUPERVISOR REPORT====
     
Supervisor: {<0.4554.7>,rabbit_mqtt_connection_sup}
     
Context:    child_terminated
     
Reason:     {{badmatch,{error,enotconn}},[{rabbit_mqtt_processor,process_login,4,[{file,"src/rabbit_mqtt_processor.erl"},{line,482}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
     
Offender:   [{pid,<0.4556.7>},{name,rabbit_mqtt_reader},{mfargs,{rabbit_mqtt_reader,start_link,[<0.4555.7>,{acceptor,{0,0,0,0,0,0,0,0},1883},#Port<0.16028>]}},{restart_type,intrinsic},{shutdown,30000},{child_type,worker}]
2018-09-28 13:38:44 =SUPERVISOR REPORT====
     
Supervisor: {<0.4554.7>,rabbit_mqtt_connection_sup}
     
Context:    shutdown
     
Reason:     reached_max_restart_intensity
     
Offender:   [{pid,<0.4556.7>},{name,rabbit_mqtt_reader},{mfargs,{rabbit_mqtt_reader,start_link,[<0.4555.7>,{acceptor,{0,0,0,0,0,0,0,0},1883},#Port<0.16028>]}},{restart_type,intrinsic},{shutdown,30000},{child_type,worker}]



What does this indicate and how to get ride of this?

Michael Klishin

unread,
Sep 28, 2018, 11:53:28 AM9/28/18
to rabbitm...@googlegroups.com
"enotconn" means that a socket is no longer in connected state. In other words, the connection is closed.

It can be that the client closes it abruptly [1] instead of closing it the way MQTT 3.1.1 spec dictates (with a connection closure frame).
A traffic capture and checking your application code and the way it closes connections/shuts down should be able to confirm this.

Those messages are very likely just log noise and the connection is gone and there would be no further activity on it anyway.
Of course log noise is something we'd like to avoid when it's safe to do so, if you have a way to reproduce, please share it.

[3][4] are highly relevant here.

Reply all
Reply to author
Forward
0 new messages