Hi Luke,
Hey Luke,
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/5e6dcbc7-6375-4246-92c7-3bb71a7026ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/5e6dcbc7-6375-4246-92c7-3bb71a7026ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fdae7570-2151-4b4f-a5fd-6d35b6040fa7%40googlegroups.com.
Hi again,We have a little more info, we have these zombie channel processes building up whose connections have closed.
There are over 900 on the node in our system that I examined.
These processes all have the process info: {dictionary,[...{'$initial_call',{rabbit_channel,init,1}}...
and are all deadlocked with the process info: {current_function,{rpc,do_call,3}}
I can't see any obvious path from the rabbit_channel.erl process that makes an rpc call but haven't yet followed it exahustively.Between the 900 zombie channel processes there over 4,000 queued messages, each process has 2 unprocessed 'DOWN' messages, presumably from the connection process when it terminated.
There are also loads of application level messages, which points to this being the root cause of our issue with delayed messages.I've also found a handful of gen_server call requests, I'm investigating if there's a circular gen_server call that might be causing the deadlock.Colin, when you experienced your problem did ye see any application messages getting lost or significantly delayed?
Did you get a resolution? If so, was did it involve the load balancer?//Sean.
On Thu, Aug 15, 2019 at 5:42 PM Sean McEvoy <sean....@gmail.com> wrote:
Hi Luke, Colin,
We're seeing something similar here.
We're using RabbitMQ version 3.7.8 with STOMP.
Been having issues lately with a handful of isolated messages being delayed by upt to 20 seconds, but that may be irrelevant to this thread.
While looking closer today we realised a much larger than expected number of channels, way out of the expected proportion to the number of connections.
Poking around in the source code and VM, I see that the stats are read from these tables:
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).
[{read_concurrency,false},
{write_concurrency,false},
{compressed,false},
{memory,65639},
{owner,<0.2126.0>},
{heir,none},
{name,connection_created_stats},
{size,234},
{node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
{named_table,true},
{type,set},
{keypos,1},
{protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).
[{read_concurrency,false},
{write_concurrency,false},
{compressed,false},
{memory,112250},
{owner,<0.2126.0>},
{heir,none},
{name,channel_created_stats},
{size,1470},
{node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
{named_table,true},
{type,set},
{keypos,1},
{protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.
channel_created_stats
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).
[{<0.7809.2431>,
<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
[{pid,<0.7809.2431>},
{name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
{connection,<0.10397.2433>},
{number,1},
{user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
{vhost,<<"/">>},
{user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
Hi again,We have a little more info, we have these zombie channel processes building up whose connections have closed.
There are over 900 on the node in our system that I examined.
These processes all have the process info: {dictionary,[...{'$initial_call',{rabbit_channel,init,1}}...
and are all deadlocked with the process info: {current_function,{rpc,do_call,3}}
I can't see any obvious path from the rabbit_channel.erl process that makes an rpc call but haven't yet followed it exahustively.Between the 900 zombie channel processes there over 4,000 queued messages, each process has 2 unprocessed 'DOWN' messages, presumably from the connection process when it terminated.
There are also loads of application level messages, which points to this being the root cause of our issue with delayed messages.I've also found a handful of gen_server call requests, I'm investigating if there's a circular gen_server call that might be causing the deadlock.Colin, when you experienced your problem did ye see any application messages getting lost or significantly delayed?
Did you get a resolution? If so, was did it involve the load balancer?//Sean.
On Thu, Aug 15, 2019 at 5:42 PM Sean McEvoy <sean....@gmail.com> wrote:
Hi Luke, Colin,
We're seeing something similar here.
We're using RabbitMQ version 3.7.8 with STOMP.
Been having issues lately with a handful of isolated messages being delayed by upt to 20 seconds, but that may be irrelevant to this thread.
While looking closer today we realised a much larger than expected number of channels, way out of the expected proportion to the number of connections.
Poking around in the source code and VM, I see that the stats are read from these tables:
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).
[{read_concurrency,false},
{write_concurrency,false},
{compressed,false},
{memory,65639},
{owner,<0.2126.0>},
{heir,none},
{name,connection_created_stats},
{size,234},
{node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
{named_table,true},
{type,set},
{keypos,1},
{protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).
[{read_concurrency,false},
{write_concurrency,false},
{compressed,false},
{memory,112250},
{owner,<0.2126.0>},
{heir,none},
{name,channel_created_stats},
{size,1470},
{node,'rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
{named_table,true},
{type,set},
{keypos,1},
{protection,public}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.
channel_created_stats
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).
[{<0.7809.2431>,
<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
[{pid,<0.7809.2431>},
{name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
{connection,<0.10397.2433>},
{number,1},
{user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
{vhost,<<"/">>},
{user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rabbit@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fdae7570-2151-4b4f-a5fd-6d35b6040fa7%40googlegroups.com.
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)11> ets:info(connection_created_stats).
[{read_concurrency,false},
{write_concurrency,false},
{compressed,false},
{memory,65639},
{owner,<0.2126.0>},
{heir,none},
{name,connection_created_stats},
{size,234},
{node,'rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
{named_table,true},
{type,set},
{keypos,1},
{protection,public}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)12> ets:info(channel_created_stats).
[{read_concurrency,false},
{write_concurrency,false},
{compressed,false},
{memory,112250},
{owner,<0.2126.0>},
{heir,none},
{name,channel_created_stats},
{size,1470},
{node,'rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local'},
{named_table,true},
{type,set},
{keypos,1},
{protection,public}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
Taking a sample from the top of the channel table I can see an application level session id from our system and a pid:
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)13> T = channel_created_stats.
channel_created_stats
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)14> ets:lookup(T,ets:first(T)).
[{<0.7809.2431>,
<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>,
[{pid,<0.7809.2431>},
{name,<<"100.98.14.136:52438 -> 100.122.224.203:15674 (1)">>},
{connection,<0.10397.2433>},
{number,1},
{user,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>},
{vhost,<<"/">>},
{user_who_performed_action,<<"9f279924-6b0b-4cfb-ae8e-c51915eed874">>}]}]
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
From our system I can tell that the application level session ended on the 18th of July, almost a month ago, and this channel should have ended then.
Looking at the pid I can see that it seems stuck in an rpc call and has 2 unprocessed messages in its inbox.
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)15> process_info(pid(0,7809,2431)).
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
(rab...@rabbitmq-prod-0.rabbitmq-prod.platform.svc.cluster.local)16>
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/19294649-1db4-4d11-9b79-d5af9c2a5e01%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f127e7d4-c4c2-45b4-8acc-903ff5bf3568%40googlegroups.com.
{dictionary,
[...
{guid,{{2321157692,2655692948,1364756044,3330968543},26}},
{{credit_to,<14498.16243.654>},173},
{delegate,delegate_7},
{channel_operation_timeout,15000},
...
{process_name,{rabbit_channel,{<<"100.11"...>>,1}}},
{'$initial_call',{rabbit_channel,init,1}},
...
]},
ERTS internal trees of monitor structures could get into an inconsistent state. This could cause 'DOWN' messages not to be delivered when they should, as well as delivery of 'DOWN' messages that should not be delivered
Hi Luke,Very happy to make a contribution! We at least owe the rabbit project this!We're using Rabbit 3.7.8 on OTP 20 (erts-9.3.3.3).2 weeks of running with no instances of these deadlocked channels, then over the previous night we had 6, now we're up to 19. With all new instances following the same patterns described the last time.I don't know if we can read anything into way they've started building up all of a sudden.Regards,//Sean.
On Thu, Aug 29, 2019 at 5:30 PM Luke Bakken <lba...@pivotal.io> wrote:
Hey Sean,--What version of Erlang are you using? I haven't asked that yet.Thanks,LukeFrom a quick code review I see that rabbit_amqqueue:notify_down_all/3 calls rpc with:
rpc:call(node(), delegate, invoke, [QPids, {gen_server2, call, [{notify_down, ChPid}, infinity]}], Timeout).
where the variable Timeout has the value 15000 and QPids is a list of pids of rabbit_prequeue.erl processes.
This translates to the call:
delegate:invoke(QPids, {gen_server2, call, [{notify_down, ChPid}, infinity]}).
where each rabbit_prequeue process gets a gen_server2:call/3 with timeout infinty.
But even if that never returns I don't see why the rpc call wouldn't timeout after 15 seconds.
At a purely erlang level, I'm confused :-P
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/a724162f-f7c4-4b7e-9bab-a960c6de50f1%40googlegroups.com.