Leaking channels and erlang processes - MQTT high connection churn

828 views
Skip to first unread message

Colin Christ

unread,
Jun 25, 2019, 6:01:16 PM6/25/19
to rabbitmq-users
We are seeing leaking channels and erlang processes with our current RabbitMQ setup. We are currently seeing 8k connections with 230k channels and 240k+ erlang processes on each node. Channels started off being roughly double connections, and has slowly grown to over 25x number of connections.

Out setup is as follows:
RabbitMQ version 3.7.11 with MQTT plugin
4 rabbit nodes in cluster, 1 of which is a stats only node (we did this because stats collection was causing high memory), and each node has 12GB of ram
We have lowered the TCP FIN timeout to 15 seconds
sndbuf / recbuf set to 15k
We also have a policy set on the MQTT subscription queues that turns on HA (mirrors to 1 other node).

Anyone have any ideas what is causing channels and erlang processes to continue to climb? Memory usage is also continuing to grow over time.

Thanks,

Colin

Luke Bakken

unread,
Jun 25, 2019, 6:51:48 PM6/25/19
to rabbitmq-users
Hi Colin,

My first suggestion is to use the latest version of RabbitMQ, 3.7.15. Version 3.7.12 has some MQTT fixes that may or may not be related to what you see.

Your subject says "high connection churn" ... can you be more specific about that? Do your MQTT apps gracefully disconnect?

Thanks -
Luke

Colin Christ

unread,
Jun 25, 2019, 7:11:38 PM6/25/19
to rabbitmq-users
Hi Luke,

We have also been looking into upgrading to latest version of Rabbit.

To answer your question, our mqtt devices are all on cellular connections and are in moving vehicles, constantly bouncing between cell towers etc. so they are likely to frequently disconnect ungracefully. I did some quick math and it looks like we are seeing an average of 6 reconnects per day per device (some in bad cellular areas have more, other less)

Thanks,

Colin

Luke Bakken

unread,
Jul 1, 2019, 7:02:39 PM7/1/19
to rabbitmq-users
Hi Colin,

I've had a difficult time reproducing your issue because every way I can think of to make an MQTT client disappear results in the TCP socket closing, and RabbitMQ cleaning up resources.

Are you using a firewall or proxy that may be holding connections open, or are your MQTT devices connecting directly to RabbitMQ?

Thanks -
Luke

Colin Christ

unread,
Jul 1, 2019, 7:54:51 PM7/1/19
to rabbitmq-users
Hi Luke,

The connections are not being held open. Connection count is going down as expected, but channel count is not. In fact, the channel list has many orphaned channels, some of them we try to click into and they 404, some others we click into and they have no connection associated with them (its blank). 

We have since scaled the cluster to 10 nodes, you can see in the attached image the 3 original nodes (02 - 04) have much higher memory and erlang processes for the same number of connections.

We bounced the 3 original nodes and things appear to have returned to normal the last few days (screen shot is prior to bounce), but we would like to know how this happened so we can avoid it in future.

Thanks,

Colin
rabbit.png

Luke Bakken

unread,
Jul 1, 2019, 8:58:25 PM7/1/19
to rabbitmq-users
Hi Colin,

Could you please clarify if your MQTT devices are connecting directly to RabbitMQ or not?

The connection count in the UI should be accurate, but the definitive method would be to use a tool like netstat, lsof or ss to directly count the number of open TCP connections to port 1883. Would you mind checking that on your nodes?

The reason I'd like to get an accurate picture of how your devices connect to RabbitMQ is to be able to reproduce what you observe in my environment. As I said before, I see resources cleaned up when TCP connections close.

Finally, a 10 node cluster is enormous. We really don't recommend more nodes than 5. Are you mirroring any of your queues?

Thanks -
Luke

On Monday, July 1, 2019 at 4:54:51 PM UTC-7, Colin Christ wrote:
Hi Luke,

Colin Christ

unread,
Jul 1, 2019, 9:18:13 PM7/1/19
to rabbitmq-users
Hey Luke,

We have an F5 configuration in front of rabbit that load balances (least connections) and terminates SSL. We are mirroring the device subscription queue's (to 1 other node each). The leaking channel / erlang process issue resolved after doing a bounce 3 days ago, so I'm not sure getting the connection count now would be helpful.

We scaled out to 10 nodes to support roll out to 3x more devices (in short term), and quite a bit more than that in the long term. Why is scaling wide not recommended ? For 5 nodes to support our use it seems they would have to be pretty beefy, we were struggling to support 10k devices with 4 nodes (1 being stats only) at 12 GB ram each.

Thanks,

Colin

Luke Bakken

unread,
Jul 2, 2019, 11:00:04 AM7/2/19
to rabbitmq-users
Hi Colin,

Our main concern with 10 nodes are users who feel like they have to mirror queues to every other node, which isn't your case. You might want to consider two 5-node clusters instead of a single 10-node one as you would limit the effects of a network partition to one cluster, for instance.

If you see the leaking channel / process issue again, please start by getting a count of connections to port 1883 via a command like netstat on each of your cluster nodes.

If you have time now, I would verify that when one of your MQTT devices goes away unexpectedly, the F5 load balancer closes the associated TCP connection to RabbitMQ within a reasonable time frame. RabbitMQ will produce a log message like this:

2019-07-02 07:59:18.828 [info] <0.641.0> MQTT connection "[::1]:30463 -> [::1]:1883" will terminate because peer closed TCP connection

Thanks,
Luke

Colin Christ

unread,
Jul 2, 2019, 2:05:02 PM7/2/19
to rabbitmq-users
Hey Luke,

When I force an ungraceful disconnect (power off device), I see the following in rabbit logs (about 2 min later, which I believe is double the 60 sec keep alive)

2019-07-02 10:59:03.487 [error] <0.20373.50> closing MQTT connection "10.19.100.201:51994 -> 10.19.149.149:1883" (keepalive timeout)

I do see instances of:
2019-07-02 02:24:30.597 [info] <0.18295.87> MQTT connection "10.19.100.203:58683 -> 10.19.149.149:1883" will terminate because peer closed TCP connection

But I'm not sure how to make it happen

Thanks,

Colin

Luke Bakken

unread,
Jul 2, 2019, 6:36:07 PM7/2/19
to rabbitmq-users
Hi Colin,

I'm assuming that 10.19.100.201 and 10.19.100.203 are IP addresses associated with your load balancer.

The first message is interesting, because it suggests that the load balancer keeps the TCP connection open to RabbitMQ and it's only the MQTT keep alive feature that enables RabbitMQ to detect a dead connection. I would have expected to see the second message if the load balancer immediately closes the connection to RabbitMQ.

On Tuesday, July 2, 2019 at 11:05:02 AM UTC-7, Colin Christ wrote:
Hey Luke,

Colin Christ

unread,
Jul 2, 2019, 8:54:08 PM7/2/19
to rabbitmq-users
Hi Luke,

For an ungraceful disconnect, the device does not close the TCP connection with the F5, so the F5 will keep the connection to rabbit open until TCP timeout (I think 5 min).

Thanks,

Colin

Michael Klishin

unread,
Jul 31, 2019, 7:16:43 PM7/31/19
to rabbitmq-users
This is a classic problem with intermediaries. They must use the same timeout interval as their downstreams, or else confusion will set in.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/5e6dcbc7-6375-4246-92c7-3bb71a7026ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Sean McEvoy

unread,
Aug 15, 2019, 12:42:19 PM8/15/19
to rabbitmq-users
Hi Luke, Colin,
We're seeing something similar here.
We're using RabbitMQ version 3.7.8 with STOMP.
Been having issues lately with a handful of isolated messages being delayed by upt to 20 seconds, but that may be irrelevant to this thread.
While looking closer today we realised a much larger than expected number of channels, way out of the expected proportion to the number of connections.
Poking around in the source code and VM, I see that the stats are read from these tables:

(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).
[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,65639},
 {owner,<0.2126.0>},
 {heir,none},
 {name,connection_created_stats},
 {size,234},
 {node,'rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).  
[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,112250},
 {owner,<0.2126.0>},
 {heir,none},
 {name,channel_created_stats},
 {size,1470},
 {node,'rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>

Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:

(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.     
channel_created_stats
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).
[{<0.7809.2431>,
  <<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
  [{pid,<0.7809.2431>},
   {name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
   {connection,<0.10397.2433>},
   {number,1},
   {user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
   {vhost,<<"/">>},
   {user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>

From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.

(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
[{current_function,{rpc,do_call,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,2},
 {messages,[{'DOWN',#Ref<0.2422516239.3161980930.36842>,
                    process,<0.14151.2431>,normal},
            {'DOWN',#Ref<0.2422516239.1343750147.49402>,process,
                    <0.7096.2431>,normal}]},
 {links,[<0.16250.2431>]},
 {dictionary,[{rand_seed,{#{jump => #Fun<rand.16.15449617>,max => 288230376151711743,
                            next => #Fun<rand.15.15449617>,type => exsplus},
                          [284571379084928398|184058334907653253]}},
              {guid,{{2961902178,4273135356,688403066,666051968},0}},
              {'$ancestors',[<0.16250.2431>,rabbit_direct_client_sup,
                             rabbit_sup,<0.237.0>]},
              {{credit_from,<14497.1876.0>},398},
              {msg_size_for_gc,1197},
              {delegate,delegate_4},
              {channel_operation_timeout,15000},
              {{credit_from,<14498.1879.0>},398},
              {credit_flow_default_credit,{400,200}},
              {'$initial_call',{rabbit_channel,init,1}},
              {process_name,{rabbit_channel,{<<"100.98.14.136:52"...>>,
                                             1}}},
              {{credit_from,<0.1253.0>},399},
              {{credit_to,<0.15615.2431>},199},
              {{xtype_to_module,direct},rabbit_exchange_type_direct}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.236.0>},
 {total_heap_size,2589},
 {heap_size,987},
 {stack_size,22},
 {reductions,28084},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,14}]},
 {suspending,[]}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>

From the source code I think it'll all cleanup properly if it processes these 2 messages, but i can't trace the rpc call that it's making. I guess it used a timeout of infinity.
That's as far as I've gotten so far.
The OPs guys here are scheduling a restart over the weekend to clear the stale channels, I can grab any other state info that ye think might help before that.
Let us know what ye think,
//Sean.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/5e6dcbc7-6375-4246-92c7-3bb71a7026ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sean mcevoy

unread,
Aug 16, 2019, 10:51:34 AM8/16/19
to rabbitmq-users
Hi again,We have a little more info, we have these zombie channel processes building up whose connections have closed.
There are over 900 on the node in our system that I examined.
These processes all have the process info: {dictionary,[...{'$initial_call',{rabbit_channel,init,1}}...
and are all deadlocked with the process info: {current_function,{rpc,do_call,3}}
I can't see any obvious path from the rabbit_channel.erl process that makes an rpc call but haven't yet followed it exahustively.Between the 900 zombie channel processes there over 4,000 queued messages, each process has 2 unprocessed 'DOWN' messages, presumably from the connection process when it terminated.
There are also loads of application level messages, which points to this being the root cause of our issue with delayed messages.I've also found a handful of gen_server call requests, I'm investigating if there's a circular gen_server call that might be causing the deadlock.Colin, when you experienced your problem did ye see any application messages getting lost or significantly delayed?
Did you get a resolution? If so, was did it involve the load balancer?//Sean.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fdae7570-2151-4b4f-a5fd-6d35b6040fa7%40googlegroups.com.

Luke Bakken

unread,
Aug 20, 2019, 12:00:33 PM8/20/19
to rabbitmq-users
Hi Sean,

Thanks for all of this information. I plan on doing some investigation this week into this. If I need more information I will follow up here.

Thanks,
Luke


On Friday, August 16, 2019 at 7:51:34 AM UTC-7, sean mcevoy wrote:
Hi again,We have a little more info, we have these zombie channel processes building up whose connections have closed.
There are over 900 on the node in our system that I examined.
These processes all have the process info: {dictionary,[...{'$initial_call',{rabbit_channel,init,1}}...
and are all deadlocked with the process info: {current_function,{rpc,do_call,3}}
I can't see any obvious path from the rabbit_channel.erl process that makes an rpc call but haven't yet followed it exahustively.Between the 900 zombie channel processes there over 4,000 queued messages, each process has 2 unprocessed 'DOWN' messages, presumably from the connection process when it terminated.
There are also loads of application level messages, which points to this being the root cause of our issue with delayed messages.I've also found a handful of gen_server call requests, I'm investigating if there's a circular gen_server call that might be causing the deadlock.Colin, when you experienced your problem did ye see any application messages getting lost or significantly delayed?
Did you get a resolution? If so, was did it involve the load balancer?//Sean.

On Thu, Aug 15, 2019 at 5:42 PM Sean McEvoy <sean....@gmail.com> wrote:
Hi Luke, Colin,
We're seeing something similar here.
We're using RabbitMQ version 3.7.8 with STOMP.
Been having issues lately with a handful of isolated messages being delayed by upt to 20 seconds, but that may be irrelevant to this thread.
While looking closer today we realised a much larger than expected number of channels, way out of the expected proportion to the number of connections.
Poking around in the source code and VM, I see that the stats are read from these tables:

(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).

[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,65639},
 {owner,<0.2126.0>},
 {heir,none},
 {name,connection_created_stats},
 {size,234},
 {node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},

 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).  
[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,112250},
 {owner,<0.2126.0>},
 {heir,none},
 {name,channel_created_stats},
 {size,1470},
 {node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},

 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>

Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:

(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.     
channel_created_stats
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).

[{<0.7809.2431>,
  <<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
  [{pid,<0.7809.2431>},
   {name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
   {connection,<0.10397.2433>},
   {number,1},
   {user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
   {vhost,<<"/">>},
   {user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>

From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.

(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>

Luke Bakken

unread,
Aug 20, 2019, 4:10:27 PM8/20/19
to rabbitmq-users
Hi Sean,

Could you provide your complete RabbitMQ configuration file(s)? From your hostname it appears you're running a cluster ... how many nodes? Are you also using a load balancer?

Thanks -
Luke

On Friday, August 16, 2019 at 7:51:34 AM UTC-7, sean mcevoy wrote:
Hi again,We have a little more info, we have these zombie channel processes building up whose connections have closed.
There are over 900 on the node in our system that I examined.
These processes all have the process info: {dictionary,[...{'$initial_call',{rabbit_channel,init,1}}...
and are all deadlocked with the process info: {current_function,{rpc,do_call,3}}
I can't see any obvious path from the rabbit_channel.erl process that makes an rpc call but haven't yet followed it exahustively.Between the 900 zombie channel processes there over 4,000 queued messages, each process has 2 unprocessed 'DOWN' messages, presumably from the connection process when it terminated.
There are also loads of application level messages, which points to this being the root cause of our issue with delayed messages.I've also found a handful of gen_server call requests, I'm investigating if there's a circular gen_server call that might be causing the deadlock.Colin, when you experienced your problem did ye see any application messages getting lost or significantly delayed?
Did you get a resolution? If so, was did it involve the load balancer?//Sean.

On Thu, Aug 15, 2019 at 5:42 PM Sean McEvoy <sean....@gmail.com> wrote:
Hi Luke, Colin,
We're seeing something similar here.
We're using RabbitMQ version 3.7.8 with STOMP.
Been having issues lately with a handful of isolated messages being delayed by upt to 20 seconds, but that may be irrelevant to this thread.
While looking closer today we realised a much larger than expected number of channels, way out of the expected proportion to the number of connections.
Poking around in the source code and VM, I see that the stats are read from these tables:

(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).

[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,65639},
 {owner,<0.2126.0>},
 {heir,none},
 {name,connection_created_stats},
 {size,234},
 {node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},

 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).  
[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,112250},
 {owner,<0.2126.0>},
 {heir,none},
 {name,channel_created_stats},
 {size,1470},
 {node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},

 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>

Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:

(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.     
channel_created_stats
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).

[{<0.7809.2431>,
  <<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
  [{pid,<0.7809.2431>},
   {name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
   {connection,<0.10397.2433>},
   {number,1},
   {user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
   {vhost,<<"/">>},
   {user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>

From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.

(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
Message has been deleted

Sean McEvoy

unread,
Aug 21, 2019, 6:09:20 AM8/21/19
to rabbitmq-users
Hi Luke,

Thanks for taking the time!
Config files attached. Yes, we have a 5 node cluster and are using a load balancer.

We restarted the cluster over the weekend and there have been no new instances since, continuing to monitor.
Before that the cluster had been up since mid April, so around 4 months for the 900 channels to deadlock on one node. We haven't changed anything else so I expect it's just a matter of time until we get new instances.
All the old instances appear to have been on channels used by STOMP, not sure if I mentioned that already.
Am looking at the source code too but it's a steep learning curve.

Cheers!
//Sean.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fdae7570-2151-4b4f-a5fd-6d35b6040fa7%40googlegroups.com.

Luke Bakken

unread,
Aug 27, 2019, 6:09:55 PM8/27/19
to rabbitmq-users
Hi Sean,

There are quite a few places in the RabbitMQ code where rpc:call is used with an infinite timeout. If you find another channel process in the "stuck" state, the following function should return the current stack trace:

P = pid(0, 743, 0),
process_info(P, [current_stacktrace]).

That information is not included by default when you use process_info/2

I'm continuing to investigate for a while but obviously a stack trace would be the most helpful!

Thanks,
Luke

sean mcevoy

unread,
Aug 29, 2019, 8:04:49 AM8/29/19
to rabbitmq-users
Hi Luke,

After almost 2 weeks of normal operation we have a spike of 6 new instances of this today, shared between 2 nodes in the 5-node cluster.

They all have this current stacktrace:

  {current_stacktrace,
   [{rpc,do_call,3,[{file,"rpc.erl"},{line,339}]},
    {rabbit_amqqueue,notify_down_all,3,[{file,"src/rabbit_amqqueue.erl"},{line,920}]},
    {rabbit_channel,notify_queues,1,[{file,"src/rabbit_channel.erl"},{line,1807}]},
    {rabbit_channel,terminate,2,[{file,"src/rabbit_channel.erl"},{line,696}]},
    {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1152}]},
    {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}
   ]},


Here's an example dictionary, they all have the channel_operation_timeout set to 15 seconds:

  {dictionary,
   [...
    {guid,{{2321157692,2655692948,1364756044,3330968543},26}},
    {{credit_to,<14498.16243.654>},173},
    {delegate,delegate_7},
    {channel_operation_timeout,15000},
    ...
    {process_name,{rabbit_channel,{<<"100.11"...>>,1}}},
    {'$initial_call',{rabbit_channel,init,1}},
    ...
   ]},


Here's an example message inbox:

  {messages,
    [{'DOWN',#Ref<14498.1728707486.3170107395.63313>,process,<14498.31100.632>,normal},
     {'DOWN',#Ref<14498.1728707486.4011327490.124065>,process,<14498.22570.632>,normal},
     {'$gen_call',{<14498.5676.0>,#Ref<14498.1728707486.3174039555.64401>},{info,[name]}},
     {'$gen_call',{<14500.5692.0>,#Ref<14500.2804211400.3725066242.201802>},{info,[name]}},
     {'$gen_call',{<14497.4055.0>,#Ref<14497.699163620.2283274243.154837>},{info,[name]}},
     {'$gen_call',{<0.5303.0>,#Ref<0.4005365105.2703491074.137248>},{info,[name]}},
     emit_stats,
     {'$gen_call',{<14499.5322.0>,#Ref<14499.1785546170.3314286596.209490>},{info,[name]}}
    ]}


From a quick code review I see that rabbit_amqqueue:notify_down_all/3 calls rpc with:
rpc:call(node(), delegate, invoke, [QPids, {gen_server2, call, [{notify_down, ChPid}, infinity]}], Timeout).

where the variable Timeout has the value 15000 and QPids is a list of pids of rabbit_prequeue.erl processes.
This translates to the call:
delegate:invoke(QPids, {gen_server2, call, [{notify_down, ChPid}, infinity]}).

where each rabbit_prequeue process gets a gen_server2:call/3 with timeout infinty.
But even if that never returns I don't see why the rpc call wouldn't timeout after 15 seconds.
At a purely erlang level, I'm confused :-P

//Sean.


(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).

[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,65639},
 {owner,<0.2126.0>},
 {heir,none},
 {name,connection_created_stats},
 {size,234},
 {node,'rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},

 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).  
[{read_concurrency,false},
 {write_concurrency,false},
 {compressed,false},
 {memory,112250},
 {owner,<0.2126.0>},
 {heir,none},
 {name,channel_created_stats},
 {size,1470},
 {node,'rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},

 {named_table,true},
 {type,set},
 {keypos,1},
 {protection,public}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>

Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:

(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.     
channel_created_stats
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).

[{<0.7809.2431>,
  <<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
  [{pid,<0.7809.2431>},
   {name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
   {connection,<0.10397.2433>},
   {number,1},
   {user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
   {vhost,<<"/">>},
   {user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>

From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.

(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/19294649-1db4-4d11-9b79-d5af9c2a5e01%40googlegroups.com.

Luke Bakken

unread,
Aug 29, 2019, 12:07:54 PM8/29/19
to rabbitmq-users
Hi Sean,

This is great and we should be able to figure out what is going on. Please follow this issue: https://github.com/rabbitmq/rabbitmq-server/issues/2090

Again, this is very helpful and we appreciate the time you've taken to dig into this. Issues like this that are hard (or impossible) for the RMQ team to reproduce depend on users like yourself to get fixed!

Thanks,
Luke

Luke Bakken

unread,
Aug 29, 2019, 12:29:54 PM8/29/19
to rabbitmq-users
Hey Sean,

What version of Erlang are you using? I haven't asked that yet.

Thanks,
Luke

sean mcevoy

unread,
Aug 29, 2019, 5:19:41 PM8/29/19
to rabbitmq-users
Hi Luke,
Very happy to make a contribution! We at least owe the rabbit project this!
We're using Rabbit 3.7.8 on OTP 20 (erts-9.3.3.3).
2 weeks of running with no instances of these deadlocked channels, then over the previous night we had 6, now we're up to 19. With all new instances following the same patterns described the last time.
I don't know if we can read anything into way they've started building up all of a sudden.
Regards,
//Sean.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Luke Bakken

unread,
Aug 29, 2019, 6:12:30 PM8/29/19
to rabbitmq-users
Hi Sean,

Here's something you can try to get a stuck channel process un-stuck:

  {dictionary,
   [...
    {guid,{{2321157692,2655692948,1364756044,3330968543},26}},
    {{credit_to,<14498.16243.654>},173},
    {delegate,delegate_7},
    {channel_operation_timeout,15000},
    ...
    {process_name,{rabbit_channel,{<<"100.11"...>>,1}}},
    {'$initial_call',{rabbit_channel,init,1}},
    ...
   ]},


Get the stuck channel process identifier like you have before (example numbers used here):

 ChPid = pid(0,7809,2431).

Then, send that process any message, like this:

ChPid ! {self(), test_message}.

Then, re-examine the process using the process_info function like you did before. I would be interested to see if the mailbox still has those two DOWN messages at the head.

Thanks again,
Luke

Luke Bakken

unread,
Aug 29, 2019, 8:35:06 PM8/29/19
to rabbitmq-users
Hi Sean,

After digging through the code to try and figure out how the rpc:call implemenation or something being called by it could possibly be receiving the incorrect DOWN message, I decided to check out the release notes for erts-9.3.3.3 -

http://erlang.org/doc/apps/erts/notes.html#erts-9.3.3.3

Then I noticed, serendipitously, what was fixed in erts-9.3.3.5http://erlang.org/doc/apps/erts/notes.html#erts-9.3.3.5

ERTS internal trees of monitor structures could get into an inconsistent state. This could cause 'DOWN' messages not to be delivered when they should, as well as delivery of 'DOWN' messages that should not be delivered

That pretty much describes exactly what is happening in your scenario.

So, what I think you should do is upgrade OTP to at least version 20.3.8.11...


But since you're using RabbitMQ 3.7.8 you can upgrade to the latest Erlang 21.X version, which is what I recommend - https://www.rabbitmq.com/which-erlang.html#compatibility-matrix

Upgrading Erlang hopefully won't be too much of a hassle but I'm pretty sure this is what is causing your issue.

Thanks,
Luke

On Thursday, August 29, 2019 at 2:19:41 PM UTC-7, sean mcevoy wrote:
Hi Luke,
Very happy to make a contribution! We at least owe the rabbit project this!
We're using Rabbit 3.7.8 on OTP 20 (erts-9.3.3.3).
2 weeks of running with no instances of these deadlocked channels, then over the previous night we had 6, now we're up to 19. With all new instances following the same patterns described the last time.
I don't know if we can read anything into way they've started building up all of a sudden.
Regards,
//Sean.

On Thu, Aug 29, 2019 at 5:30 PM Luke Bakken <lba...@pivotal.io> wrote:
Hey Sean,

What version of Erlang are you using? I haven't asked that yet.

Thanks,
Luke

From a quick code review I see that rabbit_amqqueue:notify_down_all/3 calls rpc with:
rpc:call(node(), delegate, invoke, [QPids, {gen_server2, call, [{notify_down, ChPid}, infinity]}], Timeout).

where the variable Timeout has the value 15000 and QPids is a list of pids of rabbit_prequeue.erl processes.
This translates to the call:
delegate:invoke(QPids, {gen_server2, call, [{notify_down, ChPid}, infinity]}).

where each rabbit_prequeue process gets a gen_server2:call/3 with timeout infinty.
But even if that never returns I don't see why the rpc call wouldn't timeout after 15 seconds.
At a purely erlang level, I'm confused :-P

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Luke Bakken

unread,
Aug 29, 2019, 8:37:38 PM8/29/19
to rabbitmq-users
Hi Colin,

I never asked what version of Erlang you're using, but there is a pretty good chance you're running into a bug in Erlang itself.


Thanks,
Luke

M K

unread,
Aug 30, 2019, 4:06:29 AM8/30/19
to rabbitmq-users
Note that Erlang 20.3 support goes EOL on October 1st. Consider upgrading to at least 21.3 [1], and perhaps upgrading RabbitMQ to 3.7.17 as well [2].

sean mcevoy

unread,
Aug 30, 2019, 12:33:51 PM8/30/19
to rabbitmq-users
Cheers Luke!
It's an rpc call to a queue proces that hangs, and the queue process is a gen_server2 which monitors the multi-call.
So that fits perfectly with the theory of mis-placed 'DOWN' messages breaking things. It also explains why I was confused at the purely erlang level, hadn't considered an OTP bug.
Will get our ops guys to update the OTP version and monitor for a while afterwards, hopefully we'll never hear of this issue again.
Regards,
//Sean.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/a724162f-f7c4-4b7e-9bab-a960c6de50f1%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages