RabbitMQ Crashes intermittently

133 views
Skip to first unread message

Sameer Soni

unread,
Jan 24, 2019, 1:41:48 AM1/24/19
to rabbitmq-users
Hi folks,
We are using RabbitMQ in our web applications and there are multiple Android devices and server clients connected to it for exchanging messages.

We recently upgraded RabbitMQ from version to version 3.7.8
Erlang Version: 20.3.4
RabbitMQ runs inside a docker container.
All logs are send to AWS cloudwatch.

Configurations :
1. rabbitqm_auth_backend_http
2. rabbitmq_auth_backend_cache
    {cache_ttl, 60000}
3. {vm_memory_high_watermark, 0.7}
4. {heartbeat, 90}

List of enabled plugins:
[E*] rabbitmq_auth_backend_cache       3.7.8
[E*] rabbitmq_auth_backend_http        3.7.8
[  ] rabbitmq_auth_backend_ldap        3.7.8
[  ] rabbitmq_auth_mechanism_ssl       3.7.8
[  ] rabbitmq_consistent_hash_exchange 3.7.8
[  ] rabbitmq_event_exchange           3.7.8
[  ] rabbitmq_federation               3.7.8
[  ] rabbitmq_federation_management    3.7.8
[  ] rabbitmq_jms_topic_exchange       3.7.8
[E*] rabbitmq_management               3.7.8
[e*] rabbitmq_management_agent         3.7.8
[E*] rabbitmq_mqtt                     3.7.8
[  ] rabbitmq_peer_discovery_aws       3.7.8
[  ] rabbitmq_peer_discovery_common    3.7.8
[  ] rabbitmq_peer_discovery_consul    3.7.8
[  ] rabbitmq_peer_discovery_etcd      3.7.8
[  ] rabbitmq_peer_discovery_k8s       3.7.8
[  ] rabbitmq_random_exchange          3.7.8
[  ] rabbitmq_recent_history_exchange  3.7.8
[  ] rabbitmq_sharding                 3.7.8
[  ] rabbitmq_shovel                   3.7.8
[  ] rabbitmq_shovel_management        3.7.8
[  ] rabbitmq_stomp                    3.7.8
[  ] rabbitmq_top                      3.7.8
[  ] rabbitmq_tracing                  3.7.8
[  ] rabbitmq_trust_store              3.7.8
[e*] rabbitmq_web_dispatch             3.7.8
[E*] rabbitmq_web_mqtt                 3.7.8
[E*] rabbitmq_web_mqtt_examples        3.7.8



After the upgrade, we are getting below errors at an average interval of 12-18hrs and all connections get broken from rabbitmq server.

Below is the error stack trace:


** Last message in was {inet_async,{sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>},#Ref<0.2517789809.3853778946.1543>,{ok,<<16,243,2,0,6,77,81,73,115,100,112,3,236,0,60,0,16,55,48,98,102,56,99,102,56,56,51,98,49,101,48,102,99,0,24,107,111,107,111,47,115,101,114,118,101,114,47,107,112,47,52,50,47,115,116,97,116,117,115,0,7,111,102,102,108,105,110,101,0,5,107,112,95,52,50,1,41,101,121,74,48,101,88,65,105,79,105,74,75,86,49,81,105,76,67,74,104,98,71,99,105,79,105,74,73,85,122,73,49,78,105,74,57,46,101,121,74,49,99,50,86,121,88,50,108,107,73,106,111,105,78,68,73,105,76,67,74,49,99,50,86,121,88,51,66,104,99,51,78,51,98,51,74,107,73,106,111,105,77,122,103,52,90,71,89,50,90,68,85,53,78,106,70,109,79,87,73,48,89,106,65,53,78,84,107,119,77,71,81,119,78,68,90,109,90,68,82,109,77,50,85,105,76,67,74,115,98,50,100,112,98,108,57,116,89,87,53,104,90,50,86,121,88,50,53,104,98,87,85,105,79,105,74,109,100,87,86,115,88,51,66,118,97,87,53,48,88,121,73,115,73,109,70,119,99,70,57,48,101,88,66,108,73,106,111,105,97,51,65,105,76,67,74,110,90,87,53,108,99,109,70,48,90,87,82,102,89,88,81,105,79,105,73,121,77,68,69,52,76,84,69,121,76,84,65,122,73,68,69,121,79,106,69,119,79,106,85,50,76,106,73,49,77,122,69,53,77,121,74,57,46,66,109,65,95,112,49,112,99,73,76,76,53,52,100,99,67,81,53,...>>}}

** When Server state == {state,{sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>},"<server_ip>:49171 -> <server_ip>:8883",true,undefined,true,running,{none,<0.16316.210>},<0.16233.210>,false,none,{proc_state,{sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>},#{},{undefined,<<"amq.ctag-P1SlWWXXGq0PBD7g081_EQ">>},{0,nil},{0,nil},undefined,1,"android_device_id",false,{mqtt_msg,true,1,"topic_name",false,undefined,<<"offline">>},{<0.16268.210>,undefined},<0.16273.210>,<<"amq.topic">>,{amqp_adapter_info,{0,0,0,0,0,65535,2815,38},8883,{0,0,0,0,0,65535,2815,2},49171,<<"10.255.0.2:49171 -> <server_ip>:8883">>,{'MQTT',"N/A"},[{variable_map,#{<<"client_id">> => <<"android_device_id">>}},{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"MQTT client">>}]},{ssl,true},{ssl_protocol,'tlsv1.2'},{ssl_key_exchange,ecdhe_rsa},{ssl_cipher,aes_128_gcm},{ssl_hash,aead}]},none,<0.843.0>,{auth_state,<<"kp_42">>,{user,<<"kp_42">>,[],[{rabbit_auth_backend_cache,none}]},<<"/">>},#Fun<rabbit_mqtt_processor.0.11905929>},<0.16273.210>,{state,fine,5000,#Ref<0.2517789809.3860332545.149892>}}

** Reason for termination ==

** {function_clause,[{rabbit_mqtt_processor,process_login,[<<"kp_42">>,<<"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoiNDIiLCJ1c2VyX3Bhc3N3b3JkIjoiMzg4ZGY2ZDU5NjFmOWI0YjA5NTkwMGQwNDZmZDRmM2UiLCJsb2dpbl9tYW5hZ2VyX25hbWUiOiJmdWVsX3BvaW50XyIsImFwcF90eXBlIjoia3AiLCJnZW5lcmF0ZWRfYXQiOiIyMDE4LTEyLTAzIDEyOjEwOjU2LjI1MzE5MyJ9.BmA_p1pcILL54dcCQ59hKJo0U9M0XVKQhCXP5u1fXno">>,3,{proc_state,{sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>},#{},{undefined,<<"amq.ctag-P1SlWWXXGq0PBD7g081_EQ">>},{0,nil},{0,nil},undefined,1,"android_device_id",false,{mqtt_msg,true,1,"topic_name",false,undefined,<<"offline">>},{<0.16268.210>,undefined},<0.16273.210>,<<"amq.topic">>,{amqp_adapter_info,{0,0,0,0,0,65535,2815,38},8883,{0,0,0,0,0,65535,2815,2},49171,<<"<server_ip>:49171 -> <server_ip>:8883">>,{'MQTT',"N/A"},[{variable_map,#{<<"client_id">> => <<"android_device_id">>}},{variable_map,#{<<"client_id">> => <<"android_device_id">>}},{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"MQTT client">>}]},{ssl,true},{ssl_protocol,'tlsv1.2'},{ssl_key_exchange,ecdhe_rsa},{ssl_cipher,aes_128_gcm},{ssl_hash,aead}]},none,<0.843.0>,{auth_state,<<"kp_42">>,{user,<<"kp_42">>,[],[{rabbit_auth_backend_cache,none}]},<<"/">>},#Fun<rabbit_mqtt_processor.0.11905929>}],[{file,"src/rabbit_mqtt_processor.erl"},{line,477}]},{rabbit_mqtt_processor,process_request,...},...]}

2019-01-24 05:03:03.292 [error] <0.16109.210> CRASH REPORT Process <0.16109.210> with 0 neighbours exited with reason: no function clause matching rabbit_mqtt_processor:process_login(<<"kp_42">>, <<"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoiNDIiLCJ1c2VyX3Bhc3N3b3JkIjoiMzg4ZGY2ZDU5Nj...">>, 3, {proc_state,{sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>},#{},{undefined,...},...}) line 477 in gen_server2:terminate/3 line 1166

2019-01-24 05:03:03.292 [error] <0.16276.210> Supervisor {<0.16276.210>,rabbit_mqtt_connection_sup} had child rabbit_mqtt_reader started with rabbit_mqtt_reader:start_link(<0.16233.210>, {acceptor,{0,0,0,0,0,0,0,0},8883}, {sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>}) at <0.16109.210> exit with reason no function clause matching rabbit_mqtt_processor:process_login(<<"kp_42">>, <<"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoiNDIiLCJ1c2VyX3Bhc3N3b3JkIjoiMzg4ZGY2ZDU5Nj...">>, 3, {proc_state,{sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>},#{},{undefined,...},...}) line 477 in context child_terminated

2019-01-24 05:03:03.293 [error] <0.16276.210> Supervisor {<0.16276.210>,rabbit_mqtt_connection_sup} had child rabbit_mqtt_reader started with rabbit_mqtt_reader:start_link(<0.16233.210>, {acceptor,{0,0,0,0,0,0,0,0},8883}, {sslsocket,{gen_tcp,#Port<0.1637478>,tls_connection,<0.865.0>},<0.16201.210>}) at <0.16109.210> exit with reason reached_max_restart_intensity in context shutdown


After this error, all devices get disconnected for sometime and they have to reconnect again.
Please assist incase if anyone has faced similar situation or if there's any solution which you could recommend.

Thanks

Sameer Soni

unread,
Jan 24, 2019, 1:50:39 AM1/24/19
to rabbitmq-users
One other observation on RabbitMQ UI:
Please see the attachment:

There are a lot of inactive connections from the android client.
We have roughly 20 active connections based on number of devices, but there are total 430+ connections in RabbitMQ connection list.

Not sure if this is related to issue mentioned above, but might be helpful to investigate the overall issue.

Thanks
Screenshot 2019-01-24 at 12.15.56 PM.png

Michael Klishin

unread,
Feb 5, 2019, 2:07:07 PM2/5/19
to rabbitm...@googlegroups.com
There is no evidence in the log that RabbitMQ crashes. The word "crash" in the log refers
to an unhandled exception. RabbitMQ almost certainly stays up and available for future connections.

The exception happened in a function that authenticates the client [1] but nothing in the log stands out.

My best guess is that the client tries to "reauthenticate" on that connection unintentionally, possibly because it detects
connection failure before the server does but TCP connection is still active [2]. Without a way to reproduce
that's about as much as I can say.

Transient failures from mobile client are generally expected and there are known tuning recommendations for such environments [3][4].
[5] explains how to identify high connection churn. They won't make this exception go away but they will improve the overall availability
of the system (which this exception does not affect since connections are isolated from each other).

HTH.



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Feb 5, 2019, 2:57:41 PM2/5/19
to rabbitm...@googlegroups.com
The exception has enough information to investigate.

Also note that the cause and effect here are the opposite:

 * The exception DOES NOT cause other connections to go down
 * The exception is almost certainly an indication of something else that makes all connections fail
 * It is expected that TCP connection loss is not detected immediately, FWIW that's a TCP feature with known workarounds [1][2]


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Grigory Starinkin

unread,
Feb 6, 2019, 10:31:01 AM2/6/19
to rabbitmq-users
Hi Sameer,

As Michael, guessed, this is caused by client trying to "reauthenticate" using the same connection. It can be easily reproduced:

** {function_clause,[{rabbit_mqtt_processor,process_login,[<<"foo">>,<<"bar">>,4,{proc_state,#Port<0.38965>,#{},{undefined,undefined},{0,nil},{0,nil},undefined,1,"08VMjjLXL7TwqD9X7hs5cg",true,undefined,{<0.1941.0>,undefined},<0.1934.0>,<<"amq.topic">>,{amqp_adapter_info,{0,0,0,0,0,65535,44050,2},1883,{0,0,0,0,0,65535,44050,1},60686,<<"172.18.0.1:60686 -> 172.18.0.2:1883">>,{'MQTT',"N/A"},[{variable_map,#{<<"client_id">> => <<"KH9Fs3yoa-igGtqyMatIiw">>}},{variable_map,#{<<"client_id">> => <<"08VMjjLXL7TwqD9X7hs5cg">>}},{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"MQTT client">>}]},{ssl,false}]},none,<0.938.0>,{auth_state,<<"guest">>,{user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]},<<"/">>},#Fun<rabbit_mqtt_processor.0.11905929>}],[{file,"src/rabbit_mqtt_processor.erl"},{line,477}]},{rabbit_mqtt_processor,process_request,3,[{file,"src/rabbit_mqtt_processor.erl"},{line,112}]},{rabbit_mqtt_processor,process_frame,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,69}]},{rabbit_mqtt_reader,process_received_bytes,2,[{file,"src/rabbit_mqtt_reader.erl"},{line,266}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}

Taking into account number of opened connections from RMQ UI, I would guess there's an issue with connections management.

Thank you,

Michael Klishin

unread,
Feb 6, 2019, 11:44:24 AM2/6/19
to rabbitm...@googlegroups.com
Hi Grigory,

You say it can be easily reproduced. Can you share how you'd do that? I don't think the MQTT 3.1 spec explicitly prohibits
clients from reauthenticating but in general it should be a really rare thing to see.

We can make the code more defensive but need a way to reproduce first.

Grigory Starinkin

unread,
Feb 6, 2019, 11:59:29 AM2/6/19
to rabbitm...@googlegroups.com
Hi Michael,

I used mqtt library to talk to RabbitMQ with mqtt plugin enabled:

$ rebar3 shell
1> rr(mqtt_transport).
[ssl,tcp]
2> rr(mqtt_client).
[data,mqtt_connack,mqtt_connect,mqtt_disconnect,
 mqtt_last_will,mqtt_pingreq,mqtt_pingresp,mqtt_puback,
 mqtt_pubcomp,mqtt_publish,mqtt_pubrec,mqtt_pubrel,
 mqtt_suback,mqtt_subscribe,mqtt_unsuback,mqtt_unsubscribe]
3> {ok, Connection, _} = mqtt_client:connect(#{transport => {tcp, #{port => 32903}}}).
{ok,<0.142.0>,false}
4> {State, #data{transport = Transport}} = sys:get_state(Connection).
{connected,#data{owner = <0.138.0>,
                 owner_monitor = #Ref<0.2266047736.1255931907.153478>,
                 transport = #tcp{socket = #Port<0.12>},
                 transport_tags = {tcp,tcp_closed,tcp_error,#Port<0.12>},
                 buffer = <<>>,next_id = 0,pending_subscribe = #{},
                 pending_unsubscribe = #{},
                 keep_alive_ping_timeout = infinity,
                 keep_alive_exit_timeout = infinity}}
5> mqtt_transport:send(Transport, mqtt_packet:encode(#mqtt_connect{protocol_name = <<"MQTT">>, protocol_level = 4, clean_session = true, keep_alive = 0, client_id = <<>>, username = <<"foo">>, password = <<"bar">> })).
ok

Once command #5 is executed you should be able to see backtrace in RabbitMQ similar to what I posted earlier.

Thank you,

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/1MBdymMAZzw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Feb 6, 2019, 12:17:05 PM2/6/19
to rabbitmq-users
So you used a flexible client that allows you to send any frame at any time :) Perfect,
that helps a lot!

Luke Bakken

unread,
Feb 7, 2019, 2:36:35 PM2/7/19
to rabbitmq-users
Hi Grigory,

Thank you very much for these reproduction steps. They allowed me to quickly address this issue:


Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages