RMQ java client thread blocked/lost permanently, server partially "stuck"

277 views
Skip to first unread message

John Doe

unread,
Jun 8, 2016, 8:12:29 AM6/8/16
to rabbitmq-users
Hi All,

We have observed very strange issue where RMQ java client thread is stuck on TCP/AMQP call permanently, as if server will never reply, which effectively renders system unoperational.

Multiple samples of jstack over long time period confirm that thread is waiting on reply which never arrives (please see attachment). Workload is really low (100-ish messages per second on beefy box). Declustered setup.

When troubleshooting, executing "rabbitmqctl list_channels" also gets stuck, even though "rabbitmqctl list_connections" works ok (show in attachment). Killing that specific connection via "rabbitmqctl close_connection ..." recovers the lost thread.

I'm attaching all relevant logs.

Any tips appreciated.

Thanks

RMQ server version: 3.5.6
Erlang: 18.1
Java client: 3.5.6


rmq-status.txt
rmq-log.txt
rmq-log-sasl.txt
rmq-client-jstack.txt

Michael Klishin

unread,
Jun 8, 2016, 4:21:35 PM6/8/16
to rabbitm...@googlegroups.com
rabbitmqctl eval `rabbit_diagnostics:maybe_stuck().` will display processes that can be blocked
(according to their stack trace over a period of sampling).

I recall a few issues that could lead to blocked channels fixed since 3.5.6, so upgrade
to 3.6.1 or later:

Closing a connection terminates all channels on that connection, cleanly or forcefully, and
starts shutdown sequence in the Java client, which unblocks all pending continuations.

There are versions of most protocol methods that don't wait for a response (those methods end in "NoWait"),
in case you'd rather do that.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

John Doe

unread,
Jun 9, 2016, 11:45:03 AM6/9/16
to rabbitmq-users
Thanks on quick reply. Have switched delete ops to NoWait, but it now impacts even queue binds (see jstack capture below).

Will try with new version.

Thanks!

"msglib-rabbitmq2_5672" #19 daemon prio=5 os_prio=0 tid=0x00007fcf0c814800 nid=0x25 in Object.wait() [0x00007fcea849b000]
   java
.lang.Thread.State: WAITING (on object monitor)
        at java
.lang.Object.wait(Native Method)
        at java
.lang.Object.wait(Object.java:502)
        at com
.rabbitmq.utility.BlockingCell.get(BlockingCell.java:50)
       
- eliminated <0x00000006cad6d1d8> (a com.rabbitmq.utility.BlockingValueOrException)
        at com
.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:89)
       
- locked <0x00000006cad6d1d8> (a com.rabbitmq.utility.BlockingValueOrException)
        at com
.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
        at com
.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:361)
        at com
.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:226)
        at com
.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:118)
        at com
.rabbitmq.client.impl.ChannelN.queueBind(ChannelN.java:933)
        at com
.rabbitmq.client.impl.ChannelN.queueBind(ChannelN.java:947)
        at com
.rabbitmq.client.impl.ChannelN.queueBind(ChannelN.java:61)


John Doe

unread,
Jun 9, 2016, 12:04:08 PM6/9/16
to rabbitmq-users
Have run rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'  and indeed, looks like some processes are stuck

[root@rabbitmq2 /]# rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
There are 2602 processes.
Investigated 3 processes this round, 5000ms to go.
Investigated 3 processes this round, 4500ms to go.
Investigated 3 processes this round, 4000ms to go.
Investigated 3 processes this round, 3500ms to go.
Investigated 3 processes this round, 3000ms to go.
Investigated 3 processes this round, 2500ms to go.
Investigated 3 processes this round, 2000ms to go.
Investigated 3 processes this round, 1500ms to go.
Investigated 3 processes this round, 1000ms to go.
Investigated 3 processes this round, 500ms to go.
Found 3 suspicious processes.
[{pid,<5509.95.0>},
 
{registered_name,[]},
 
{current_stacktrace,[{worker_pool_worker,run,2,[]},
                     
{worker_pool_worker,handle_call,3,[]},
                     
{gen_server2,handle_msg,2,[]},
                     
{proc_lib,wake_up,3,
                               
[{file,"proc_lib.erl"},{line,250}]}]},
 
{initial_call,{proc_lib,init_p,5}},
 
{dictionary,[{'$ancestors',[worker_pool_sup,rabbit_sup,<5509.76.0>]},
             
{worker_pool_worker,true},
             
{'$initial_call',{gen,init_it,6}}]},
 
{message_queue_len,0},
 
{links,[<5509.91.0>,<5509.24962.789>]},
 
{monitors,[]},
 
{monitored_by,[<5509.89.0>,<5509.92.0>,<5509.24669.789>]},
 
{heap_size,610}]
[{pid,<5509.24669.789>},
 
{registered_name,[]},
 
{current_stacktrace,
     
[{gen,do_call,4,[{file,"gen.erl"},{line,168}]},
     
{gen_server2,call,3,[]},
     
{rabbit_misc,execute_mnesia_transaction,1,[]},
     
{rabbit_misc,execute_mnesia_tx_with_tail,1,[]},
     
{rabbit_channel,binding_action,9,[]},
     
{rabbit_channel,handle_cast,2,[]},
     
{gen_server2,handle_msg,2,[]},
     
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},
 
{initial_call,{proc_lib,init_p,5}},
 
{dictionary,
     
[{'$ancestors',
         
[<5509.24926.789>,<5509.4074.0>,<5509.3910.0>,<5509.3908.0>,
           rabbit_tcp_client_sup
,rabbit_sup,<5509.76.0>]},
     
{process_name,
         
{rabbit_channel,{<<"172.18.0.14:43190 -> 172.18.0.3:5672">>,3795}}},
     
{permission_cache,
         
[{{resource,<<"/">>,exchange,
               
<<"GetDynamic_6ac36807-dad2-4b8f-9a99-a99e6c5b1b9a.tenant_79bb5a24-7688-404a-a4c6-f1ed0aca9f9a.dev">>},
            read
},
           
{{resource,<<"/">>,queue,
               
<<"GetDynamic_6ac36807-dad2-4b8f-9a99-a99e6c5b1b9a.tenant_79bb5a24-7688-404a-a4c6-f1ed0aca9f9a.dev">>},
            write
}]},
     
{'$initial_call',{gen,init_it,6}}]},
 
{message_queue_len,1},
 
{links,[<5509.24926.789>]},
 
{monitors,[{process,<5509.95.0>}]},
 
{monitored_by,[<5509.3911.0>,<5509.371.0>]},
 
{heap_size,1598}]
[{pid,<5509.24962.789>},
 
{registered_name,[]},
 
{current_stacktrace,
     
[{timer,sleep,1,[{file,"timer.erl"},{line,153}]},
     
{mnesia_tm,restart,9,[{file,"mnesia_tm.erl"},{line,914}]},
     
{rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,[]},
     
{worker_pool_worker,'-run/2-fun-0-',3,[]}]},
 
{initial_call,{erlang,apply,2}},
 
{dictionary,
     
[{{xtype_to_module,direct},rabbit_exchange_type_direct},
     
{mnesia_activity_state,
         
{mnesia,{tid,2595519,<5509.24962.789>},{tidstore,15793487974,[],1}}},
     
{random_seed,{10186,19477,3807}},
     
{worker_pool_worker,true}]},
 
{message_queue_len,2},
 
{links,[<5509.95.0>,<5509.123.0>]},
 
{monitors,[]},
 
{monitored_by,[]},
 
{heap_size,610}]
ok
[root@rabbitmq2 /]#


Michael Klishin

unread,
Jun 9, 2016, 6:39:23 PM6/9/16
to rabbitm...@googlegroups.com
According to the stack trace you still use com.rabbitmq.client.impl.ChannelN.queueBind,
not its "no wait" version.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jun 9, 2016, 6:42:16 PM6/9/16
to rabbitm...@googlegroups.com
A Mnesia transaction is blocked. This looks like a bug that we've identified a few months ago
and it was fixed in Erlang/OTP 18.3.1:


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Doe

unread,
Jun 10, 2016, 4:56:17 AM6/10/16
to rabbitmq-users
Regarding the NoWait, we have switched deletes to NoWait, but for other AMQP ops, including queueBind, semantically we need to know when error happens, and thus blocking call (i.e. reply from server) would be required. Apologies on confusion.

I have upgraded to erlang-18.3-1.el7.centos.x86_64.rpm and rabbitmq-server-3.6.2-1.noarch.rpm, and when using identical workload/setup,  we are seeing some warnings in logs, and also, RMQ Docker container crashes after some time. The only difference is version of erlang & RMQ, and previous setup was working 24/7 without issues, also on Docker. We made no major relevant changes to RMQ 3.5.6 config ie mostly defaults.

We are sending around 100 non-persistent messages per second on a box with 8 cores, 32 GB of RAM and SAN with low utilitization as per "iostat -xd 2" (around 30% IO utilized). Hardware utilization is across all

Any tips if we are doing something wrong? Do we just need to retune default RMQ config? Thanks!

From RMQ log before crash
=WARNING REPORT==== 10-Jun-2016::07:36:21 ===
Mnesia(rabbit@rabbitmq1): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
=WARNING REPORT==== 10-Jun-2016::07:42:44 ===
Mnesia(rabbit@rabbitmq1): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
=WARNING REPORT==== 10-Jun-2016::07:42:44 ===
Mnesia(rabbit@rabbitmq1): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}



Process exit error:


eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "old_heap").

John Doe

unread,
Jun 10, 2016, 6:52:00 AM6/10/16
to rabbitmq-users
I think there might be a memory leak in RMQ 3.6.2 or erlang-18.3-1. Have measured the metrics and can see that memory usage of beam.smp  keeps growing in unbounded fashion. Our workload/test is not unbounded: message rate is  less than 200 per sec, and messages are consumed by clients. Also, there is no unbounded growth of number of exchanges, queues, connections or channels that we create. One dynamic use case we have is: create queue, publish message, do basic.get, delete queue, but even here we do perform cleanup and there are no leftover queues. 

First attachments shows hardware utilization metrics (multiple samples), and second attachment contains output of "rabbitmqctl report" slightly before error happened. As it can be seen in the latter, meta data (queue, exchange, connection, channel, consumer counts are low) size is low and bounded, same holds for data i.e. there are almost no unconsumed messages. Have taken multiple samples of all logs, to ensure consistency.

Any tips appreciated, just trying to reduce search space.

Seems that mgmt_db is using 7GBs:

{memory,
     
[{total,7709730440},
     
{connection_readers,369632},
     
{connection_writers,892296},
     
{connection_channels,2358552},
     
{connection_other,1518352},
     
{queue_procs,2957512},
     
{queue_slave_procs,0},
     
{plugins,7457911728},
     
{other_proc,0},
     
{mnesia,752720},
     
{mgmt_db,7472868376},
     
{msg_index,5421272},
     
{other_ets,3940560},
     
{binary,41541240},
     
{code,27743992},
     
{atom,1000601},
     
{other_system,106046183}]},

hardware-metrics.txt
report6.txt

Michael Klishin

unread,
Jun 10, 2016, 8:51:23 AM6/10/16
to rabbitm...@googlegroups.com, John Doe
Please see 

https://github.com/rabbitmq/rabbitmq-management/issues/214
https://groups.google.com/forum/#!searchin/rabbitmq-users/3.6.3/rabbitmq-users/zEIAuNQ2rnQ/xP7qT3nKBgAJ
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send an email to rabbitm...@googlegroups.com.

John Doe

unread,
Jun 13, 2016, 9:53:47 AM6/13/16
to rabbitmq-users, rmqus...@gmail.com
Thanks for quick & awesome help! 

We managed to add temporary workaround on the client side. When we notice TCP thread getting stuck for long time, we close Socket itself via another thread by getting handle to it via custom 

com.rabbitmq.client.SocketConfigurator. We first tried closing com.rabbitmq.client.Connection in similar manner, but that didn't work (i.e. connection was not killable via separate thread).


Under current plans, when could we expect 3.6.3 to be formally released/GA-ed to official download repo? https://www.rabbitmq.com/releases/rabbitmq-server/


Cheers

Michael Klishin

unread,
Jun 13, 2016, 9:56:00 AM6/13/16
to rabbitm...@googlegroups.com, John Doe
Yes, this is a known workaround for another known issue which cannot be resolved before we move the client to NIO(2):
Socket#write blocks when peer becomes unavailable and there is no way to provide a timeout for that specific operation.

3.6.3 will be released in the next couple of weeks, with an RC hopefully this week.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages