Connections failed with {handshake_timeout,frame_header}, {handshake_timeout,handshake}

2,296 views
Skip to first unread message

David Noete

unread,
Apr 3, 2020, 12:50:57 PM4/3/20
to rabbitmq-users

Hi,

We've been running a single node deployment of RabbitMQ for years without any major incidents.  We started to see an issue come up that we've never seen before, and we're a bit lost on how to diagnose it, why it started now, and what steps to take.

RabbitMQ Broker details:

Old version: 3.6.15 (I didn't save which Erlang)
New version: 3.8.2(Erlang/OTP 22)
OS: Ubuntu 14.04
Environment: Pasted at the bottom of this message

Client details:
Client: php-amqp 1.6.1 (wrapper on top of rabbitmq-c)
rabbitmq-c: 0.7.1
(These are both quite old, but worked well for years!)


We were running RabbitMQ 3.6.15 for a long time.  Unrelated to this issue, we ended up upgrading to 3.8.2 so we could benefit from even distribution across queues with the improvements in the consistent hash exchange plugin.  We saw our current problem on both versions.  We upgraded because we want to add more queues/consumers to our consistent hash exchange.  We wanted to do that because our consumers couldn't keep up with our producers during our peak hours.

The error we are seeing in version 3.8.2 is:

2020-03-29 07:33:36.131 [info] <0.31176.435> accepting AMQP connection <0.31176.435> (127.0.0.1:60636 -> 127.0.0.1:5672)

...

2020-03-29 07:33:49.896 [error] <0.31176.435> closing AMQP connection <0.31176.435> (127.0.0.1:60636 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

And in 3.6.15:


=INFO REPORT==== 28-Feb-2020::05:37:06 ===
accepting AMQP connection
<0.26436.1> (127.0.0.1:42128 -> 127.0.0.1:5672)

...

=ERROR REPORT==== 28-Feb-2020::05:37:24 ===
closing AMQP connection
<0.26436.1> (127.0.0.1:42128 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

Our handshake_timeout configuration is the default 10000.  We aren't using SSL connections:

$ sudo rabbitmqctl environment | grep handshake_timeout
     
{handshake_timeout,10000},
     
{ssl_handshake_timeout,5000},



One thing I notice is that the time between the 'accepting AMQP connection...' and the 'closing AMQP connection..' messages are quite a bit more than 10 seconds apart.  Is this a clue in to what is happening?

You'll notice those connections are coming from the same instance as the broker, which is a pretty beefy EC2 instance (c5.9xlarge).  I am not seeing any evidence that this instance is resource constrained in terms of CPU, memory, I/O, etc.  My understanding of the handshake_timeout error is that we should only be seeing it with 'constrained clients and networks'.  The instance size has actually been increased after we started seeing this issue (again, for what we think are unrelated reasons).  It doesn't seem like that should be the case here, but I guess there could be or is something weird going on with this instance that lived across the isntance upgrade.

In a steady state we usually have around 350 connections open to this rabbit node.  ~250 of those are from Shovel connections on ~50 other servers.  We do not have scenarios where a large influx of connections occur, aside from a full system restart, and that isn't the case when we see this issue arise. 

We are only seeing this happen intermittently, a couple of times a day.  We have a steady flow of succesful connections being opened and closed.  Since it is only happening intermittently, I'm at a bit of a loss on how to monitor and catch this issue in action.

Note that we do sometimes see the {handshake_timeout, xxx} errors from connections that aren't coming from the same instance as the broker.  We'd expect that to happen sometimes since those servers are distributed around the world and are known to operate within spotty or slower networks.

One obvious thing to do next would be to update our client libraries.  This is planned work since there might be other benefits to doing this, but I'm not convinced it will fix the current problem since we were running with these client libraries for years without incident.  Similarily, we are also quite behind on updates on this system.  It is running Ubuntu 14.04, but again, it has been running fine for years.

I am hoping for some guidance or tips on how to diagnose what is going on before we expedite some of our planned upgrades.  A suggestion of configuring more logging for RabbitMQ, tools for monitoring these failed TCP connections, etc are all welcome. If you need more information from me so you can help, just let me know.

Thanks in advance for any help that can be provided.  If we do get to the bottom of it I will post back here with the solution.

Cheers,
Dave

[{amqp_client,[{prefer_ipv6,false},{ssl_options,[]}]},
 
{asn1,[]},
 
{aten,
     
[{detection_threshold,0.99},
     
{heartbeat_interval,100},
     
{poll_interval,1000}]},
 
{compiler,[]},
 
{cowboy,[]},
 
{cowlib,[]},
 
{credentials_obfuscation,
     
[{enabled,true},{ets_table_name,credentials_obfuscation}]},
 
{crypto,[{fips_mode,false},{rand_cache_size,896}]},
 
{gen_batch_server,[]},
 
{goldrush,[]},
 
{inets,[]},
 
{jsx,[]},
 
{kernel,
     
[{inet_default_connect_options,[{nodelay,true}]},
     
{inet_dist_listen_max,25672},
     
{inet_dist_listen_min,25672},
     
{logger,
         
[{handler,default,logger_std_h,
               
#{config => #{type => standard_io},
                 formatter
=>
                     
{logger_formatter,
                         
#{legacy_header => true,single_line => false}}}}]},
     
{logger_level,notice},
     
{logger_sasl_compatible,false}]},
 
{lager,
     
[{async_threshold,20},
     
{async_threshold_window,5},
     
{colored,false},
     
{colors,
         
[{debug,"\e[0;38m"},
           
{info,"\e[1;37m"},
           
{notice,"\e[1;36m"},
           
{warning,"\e[1;33m"},
           
{error,"\e[1;31m"},
           
{critical,"\e[1;35m"},
           
{alert,"\e[1;44m"},
           
{emergency,"\e[1;41m"}]},
     
{crash_log,"log/crash.log"},
     
{crash_log_count,5},
     
{crash_log_date,"$D0"},
     
{crash_log_msg_size,65536},
     
{crash_log_rotator,lager_rotator_default},
     
{crash_log_size,10485760},
     
{error_logger_format_raw,true},
     
{error_logger_hwm,50},
     
{error_logger_hwm_original,50},
     
{error_logger_redirect,true},
     
{extra_sinks,
         
[{error_logger_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_channel_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_connection_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_ldap_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_mirroring_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_queue_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_ra_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_federation_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_shovel_lager_event,
               
[{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
               
{rabbit_handlers,
                   
[{lager_forwarder_backend,[lager_event,inherit]}]}]},
           
{rabbit_log_upgrade_lager_event,
               
[{handlers,
                   
[{lager_file_backend,
                         
[{date,[]},
                         
{file,"/var/log/rabbitmq/rabbit_upgrade.log"},
                         
{formatter_config,
                             
[date," ",time," ",color,"[",severity,"] ",
                               
{pid,[]},
                               
" ",message,"\n"]},
                         
{level,info},
                         
{size,0}]}]},
               
{rabbit_handlers,
                   
[{lager_file_backend,
                         
[{date,[]},
                         
{file,"/var/log/rabbitmq/rabbit_upgrade.log"},
                         
{formatter_config,
                             
[date," ",time," ",color,"[",severity,"] ",
                               
{pid,[]},
                               
" ",message,"\n"]},
                         
{level,info},
                         
{size,0}]}]}]}]},
     
{handlers,
         
[{lager_file_backend,
               
[{date,[]},
               
{file,"/var/log/rabbitmq/rabbit.log"},
               
{formatter_config,
                   
[date," ",time," ",color,"[",severity,"] ",
                     
{pid,[]},
                     
" ",message,"\n"]},
               
{level,info},
               
{size,0}]}]},
     
{log_root,"/var/log/rabbitmq"},
     
{rabbit_handlers,
         
[{lager_file_backend,
               
[{date,[]},
               
{file,"/var/log/rabbitmq/rabbit.log"},
               
{formatter_config,
                   
[date," ",time," ",color,"[",severity,"] ",
                     
{pid,[]},
                     
" ",message,"\n"]},
               
{level,info},
               
{size,0}]}]}]},
 
{mnesia,[{dir,"/var/lib/rabbitmq/mnesia/rabbit"}]},
 
{observer_cli,[{plugins,[]}]},
 
{os_mon,
     
[{start_cpu_sup,false},
     
{start_disksup,false},
     
{start_memsup,false},
     
{start_os_sup,false}]},
 
{public_key,[]},
 
{ra,[{data_dir,"/var/lib/rabbitmq/mnesia/rabbit/quorum"},
     
{logger_module,rabbit_log_ra_shim},
     
{segment_max_entries,32768},
     
{wal_max_size_bytes,536870912}]},
 
{rabbit,
     
[{auth_backends,[rabbit_auth_backend_internal]},
     
{auth_mechanisms,['PLAIN','AMQPLAIN']},
     
{autocluster,
         
[{peer_discovery_backend,rabbit_peer_discovery_classic_config}]},
     
{background_gc_enabled,false},
     
{background_gc_target_interval,60000},
     
{backing_queue_module,rabbit_priority_queue},
     
{channel_max,2047},
     
{channel_operation_timeout,15000},
     
{channel_tick_interval,60000},
     
{cluster_keepalive_interval,10000},
     
{cluster_nodes,{[],disc}},
     
{cluster_partition_handling,ignore},
     
{collect_statistics,fine},
     
{collect_statistics_interval,5000},
     
{config_entry_decoder,[{passphrase,undefined}]},
     
{connection_max,infinity},
     
{credit_flow_default_credit,{400,200}},
     
{default_consumer_prefetch,{false,0}},
     
{default_permissions,[<<".*">>,<<".*">>,<<".*">>]},
     
{default_user,<<"guest">>},
     
{default_user_tags,[administrator]},
     
{default_vhost,<<"/">>},
     
{delegate_count,16},
     
{disk_free_limit,50000000},
     
{disk_monitor_failure_retries,10},
     
{disk_monitor_failure_retry_interval,120000},
     
{enabled_plugins_file,"/etc/rabbitmq/enabled_plugins"},
     
{feature_flags_file,"/var/lib/rabbitmq/mnesia/rabbit-feature_flags"},
     
{fhc_read_buffering,false},
     
{fhc_write_buffering,true},
     
{frame_max,131072},
     
{halt_on_upgrade_failure,true},
     
{handshake_timeout,10000},
     
{heartbeat,60},
     
{hipe_compile,false},
     
{hipe_modules,
         
[rabbit_reader,rabbit_channel,gen_server2,rabbit_exchange,
           rabbit_command_assembler
,rabbit_framing_amqp_0_9_1,rabbit_basic,
           rabbit_event
,lists,queue,priority_queue,rabbit_router,rabbit_trace,
           rabbit_misc
,rabbit_binary_parser,rabbit_exchange_type_direct,
           rabbit_guid
,rabbit_net,rabbit_amqqueue_process,
           rabbit_variable_queue
,rabbit_binary_generator,rabbit_writer,
           
delegate,gb_sets,lqueue,sets,orddict,rabbit_amqqueue,
           rabbit_limiter
,gb_trees,rabbit_queue_index,
           rabbit_exchange_decorator
,gen,dict,ordsets,file_handle_cache,
           rabbit_msg_store
,array,rabbit_msg_store_ets_index,rabbit_msg_file,
           rabbit_exchange_type_fanout
,rabbit_exchange_type_topic,mnesia,
           mnesia_lib
,rpc,mnesia_tm,qlc,sofs,proplists,credit_flow,pmon,
           ssl_connection
,tls_connection,ssl_record,tls_record,gen_fsm,ssl]},
     
{lager_default_file,"/var/log/rabbitmq/rabbit.log"},
     
{lager_extra_sinks,
         
[rabbit_log_lager_event,rabbit_log_channel_lager_event,
           rabbit_log_connection_lager_event
,rabbit_log_ldap_lager_event,
           rabbit_log_mirroring_lager_event
,rabbit_log_queue_lager_event,
           rabbit_log_ra_lager_event
,rabbit_log_federation_lager_event,
           rabbit_log_shovel_lager_event
,rabbit_log_upgrade_lager_event]},
     
{lager_log_root,"/var/log/rabbitmq"},
     
{lager_upgrade_file,"/var/log/rabbitmq/rabbit_upgrade.log"},
     
{lazy_queue_explicit_gc_run_operation_threshold,1000},
     
{log,
         
[{file,[{file,"/var/log/rabbitmq/rabbit.log"}]},
           
{categories,
               
[{upgrade,[{file,"/var/log/rabbitmq/rabbit_upgrade.log"}]}]}]},
     
{loopback_users,[<<"guest">>]},
     
{max_message_size,134217728},
     
{memory_monitor_interval,2500},
     
{mirroring_flow_control,true},
     
{mirroring_sync_batch_size,4096},
     
{mnesia_table_loading_retry_limit,10},
     
{mnesia_table_loading_retry_timeout,30000},
     
{msg_store_credit_disc_bound,{4000,800}},
     
{msg_store_file_size_limit,16777216},
     
{msg_store_index_module,rabbit_msg_store_ets_index},
     
{msg_store_io_batch_size,4096},
     
{num_ssl_acceptors,10},
     
{num_tcp_acceptors,10},
     
{password_hashing_module,rabbit_password_hashing_sha256},
     
{plugins_dir,
         
"/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.8.2/plugins"},
     
{plugins_expand_dir,"/var/lib/rabbitmq/mnesia/rabbit-plugins-expand"},
     
{proxy_protocol,false},
     
{queue_explicit_gc_run_operation_threshold,1000},
     
{queue_index_embed_msgs_below,4096},
     
{queue_index_max_journal_entries,32768},
     
{quorum_cluster_size,5},
     
{quorum_commands_soft_limit,256},
     
{reverse_dns_lookups,false},
     
{server_properties,[]},
     
{ssl_allow_poodle_attack,false},
     
{ssl_apps,[asn1,crypto,public_key,ssl]},
     
{ssl_cert_login_from,distinguished_name},
     
{ssl_handshake_timeout,5000},
     
{ssl_listeners,[]},
     
{ssl_options,[]},
     
{tcp_listen_options,[{backlog,4096}]},
     
{tcp_listeners,[5672]},
     
{trace_vhosts,[]},
     
{vhost_restart_strategy,continue},
     
{vm_memory_calculation_strategy,rss},
     
{vm_memory_high_watermark,0.4},
     
{vm_memory_high_watermark_paging_ratio,0.5}]},
 
{rabbit_common,[]},
 
{rabbitmq_consistent_hash_exchange,[]},
 
{rabbitmq_management,
     
[{content_security_policy,"default-src 'self'"},
     
{cors_allow_origins,[]},
     
{cors_max_age,1800},
     
{http_log_dir,none},
     
{load_definitions,none},
     
{management_db_cache_multiplier,5},
     
{process_stats_gc_timeout,300000},
     
{stats_event_max_backlog,250}]},
 
{rabbitmq_management_agent,
     
[{rates_mode,basic},
     
{sample_retention_policies,
         
[{global,[{605,5},{3660,60},{29400,600},{86400,1800}]},
           
{basic,[{605,5},{3600,60}]},
           
{detailed,[{605,5}]}]}]},
 
{rabbitmq_web_dispatch,[]},
 
{ranch,[]},
 
{recon,[]},
 
{sasl,[{errlog_type,error},{sasl_error_logger,false}]},
 
{ssl,[]},
 
{stdlib,[]},
 
{stdout_formatter,[]},
 
{syntax_tools,[]},
 
{sysmon_handler,
     
[{busy_dist_port,true},
     
{busy_port,false},
     
{gc_ms_limit,0},
     
{heap_word_limit,0},
     
{port_limit,100},
     
{process_limit,100},
     
{schedule_ms_limit,0}]},
 
{tools,[{file_util_search_methods,[{[],[]},{"ebin","esrc"},{"ebin","src"}]}]},
 
{xmerl,[]}]


Luke Bakken

unread,
Apr 3, 2020, 1:08:39 PM4/3/20
to rabbitmq-users
Hi David,

Thank you for the extremely comprehensive set of information.

As you noted, the behavior is seen no matter the RabbitMQ and Erlang version. So my questions are:
  • When did these unusual timeout errors start happening?
  • What else changed around that time? Application code changes, environment changes, OS upgrades (even in seemingly non-related libraries), PHP upgrade ... ???
It is odd to be running your client applications on the same instance as RabbitMQ itself, because they will compete with Erlang for system resources. Can you move them to another server?

The fact that the handshake timeout is 10 seconds but you're seeing longer intervals in the log suggests resource contention between your apps and RabbitMQ.

Thanks,
Luke

David Noete

unread,
Apr 3, 2020, 1:30:54 PM4/3/20
to rabbitmq-users
Thanks for the reply, Luke.


When did these unusual timeout errors start happening?

They started on February 25th of this year.  That is shortly after we started to notice our system becoming busier, thus require the RabbitMQ upgrade to add more queues and consumers.

What else changed around that time? Application code changes, environment changes, OS upgrades (even in seemingly non-related libraries), PHP upgrade ... ???

Aside from us seeming to hit the tipping point of the load we could handle, nothing changed.


It is odd to be running your client applications on the same instance as RabbitMQ itself, because they will compete with Erlang for system resources. Can you move them to another server?
The fact that the handshake timeout is 10 seconds but you're seeing longer intervals in the log suggests resource contention between your apps and RabbitMQ.

You're right.  I had more or less come to that conclusion as I was writing my previous message.  Taking the time to write out the details cleared things up for me!  It hadn't registered that the > 10 second delay between the connection starting and failing was pointing to the issue.

Splitting the broker and our clients was one of our planned upgrades for scaling this system anyway. I'll post back with results once that is done.

If anyone else has anything to add that might help in the very short term, feel free.  

Cheers,
Dave
Reply all
Reply to author
Forward
0 new messages