Crash in pg_local

98 views
Skip to first unread message

jo...@cloudamqp.com

unread,
Jul 21, 2018, 12:57:25 AM7/21/18
to rabbitm...@googlegroups.com
Hi,

This happened on RabbitMQ 3.6.14 / Erlang 20.1.
If possible I'd like to know what can cause the following.
Below is the first time it happened, followed by five more times and one second after the last event the full RabbitMQ crashed.

In the log:

=ERROR REPORT==== 20-Jul-2018::13:19:17 ===
** Generic server pg_local terminating
** Last message in was {'DOWN',#Ref<0.2549218045.20447237.19527>,process,
<0.22113.1037>,killed}
** When Server state == {state}
** Reason for termination ==
** {{badmatch,[]},
[{pg_local,member_died,1,[{file,"src/pg_local.erl"},{line,152}]},
{pg_local,handle_info,2,[{file,"src/pg_local.erl"},{line,124}]},
{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,686}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

In the SASL log (I include all events at 13:19:17):

=SUPERVISOR REPORT==== 20-Jul-2018::13:19:17 ===
Supervisor: {<0.15388.1024>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.30855.1023>},
{name,channel},
{mfargs,
{rabbit_channel,start_link,
[1,<0.19187.1022>,<0.7614.1024>,<0.19187.1022>,
<<"10.178.69.0:49010 -> 10.159.1.101:5672">>,
rabbit_framing_amqp_0_9_1,
{user,<<"username">>,[],
[{rabbit_auth_backend_internal,none}]},
<<"vhostname">>,
[{<<"consumer_cancel_notify">>,bool,true},
{<<"connection.blocked">>,bool,true}],
<0.19675.1022>,<0.23715.1023>]}},
{restart_type,intrinsic},
{shutdown,70000},
{child_type,worker}]


=SUPERVISOR REPORT==== 20-Jul-2018::13:19:17 ===
Supervisor: {<0.14460.1014>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.19778.1016>},
{name,channel},
{mfargs,
{rabbit_channel,start_link,
[1,<0.20896.1016>,<0.9145.1016>,<0.20896.1016>,
<<"10.178.69.0:40316 -> 10.159.1.101:5672">>,
rabbit_framing_amqp_0_9_1,
{user,<<"username">>,[],
[{rabbit_auth_backend_internal,none}]},
<<"vhostname">>,
[{<<"connection.blocked">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true}],
<0.20693.1013>,<0.6444.1014>]}},
{restart_type,intrinsic},
{shutdown,70000},
{child_type,worker}]


=SUPERVISOR REPORT==== 20-Jul-2018::13:19:17 ===
Supervisor: {<0.27924.4489>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.22113.1037>},
{name,channel},
{mfargs,
{rabbit_channel,start_link,
[1,<0.14241.1039>,<0.25335.1036>,<0.14241.1039>,
<<"10.178.69.0:34004 -> 10.159.1.101:5672">>,
rabbit_framing_amqp_0_9_1,
{user,<<"username">>,[],
[{rabbit_auth_backend_internal,none}]},
<<"vhostname">>,
[{<<"connection.blocked">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true}],
<0.14498.1039>,<0.32646.1038>]}},
{restart_type,intrinsic},
{shutdown,70000},
{child_type,worker}]


=CRASH REPORT==== 20-Jul-2018::13:19:17 ===
crasher:
initial call: pg_local:init/1
pid: <0.8199.7801>
registered_name: pg_local
exception error: no match of right hand side value []
in function pg_local:member_died/1 (src/pg_local.erl, line 152)
in call from pg_local:handle_info/2 (src/pg_local.erl, line 124)
in call from gen_server:try_dispatch/4 (gen_server.erl, line 616)
in call from gen_server:handle_msg/6 (gen_server.erl, line 686)
ancestors: [kernel_safe_sup,kernel_sup,<0.36.0>]
message_queue_len: 8
messages: [{'$gen_cast',{leave,rabbit_channels,<0.32345.18>}},
{'DOWN',#Ref<0.2549218045.280231941.256138>,process,
<0.32345.18>,normal},
{'$gen_cast',{leave,rabbit_channels,<0.18159.19>}},
{'DOWN',#Ref<0.2549218045.280231941.256124>,process,
<0.18159.19>,normal},
{'$gen_cast',{leave,rabbit_connections,<0.23930.16>}},
{'DOWN',#Ref<0.2549218045.2456551430.42010>,process,
<0.23930.16>,normal},
{'$gen_cast',{leave,rabbit_channels,<0.26121.19>}},
{'DOWN',#Ref<0.2549218045.280231941.256141>,process,
<0.26121.19>,normal}]
links: [<0.60.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 2586
stack_size: 27
reductions: 33205921599
neighbours:

=SUPERVISOR REPORT==== 20-Jul-2018::13:19:17 ===
Supervisor: {local,kernel_safe_sup}
Context: child_terminated
Reason: {{badmatch,[]},
[{pg_local,member_died,1,
[{file,"src/pg_local.erl"},{line,152}]},
{pg_local,handle_info,2,
[{file,"src/pg_local.erl"},{line,124}]},
{gen_server,try_dispatch,4,
[{file,"gen_server.erl"},{line,616}]},
{gen_server,handle_msg,6,
[{file,"gen_server.erl"},{line,686}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,247}]}]}
Offender: [{pid,<0.8199.7801>},
{id,pg_local},
{mfargs,{pg_local,start_link,[]}},
{restart_type,permanent},
{shutdown,4294967295},
{child_type,worker}]


=SUPERVISOR REPORT==== 20-Jul-2018::13:19:17 ===
Supervisor: {<0.9556.3717>,rabbit_channel_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.29821.7589>},
{name,channel},
{mfargs,
{rabbit_channel,start_link,
[1,<0.28151.865>,<0.25356.3411>,<0.28151.865>,
<<"10.178.69.0:43124 -> 10.159.1.101:5672">>,
rabbit_framing_amqp_0_9_1,
{user,<<"username">>,[],
[{rabbit_auth_backend_internal,none}]},
<<"vhostname">>,
[{<<"connection.blocked">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true}],
<0.24576.860>,<0.6743.864>]}},
{restart_type,intrinsic},
{shutdown,70000},
{child_type,worker}]


=SUPERVISOR REPORT==== 20-Jul-2018::13:19:17 ===
Supervisor: {<0.7325.7478>,rabbit_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,1},
{name,channel_sup},
{mfargs,{rabbit_channel_sup,start_link,[]}},
{restart_type,temporary},
{shutdown,infinity},
{child_type,supervisor}]

Michael Klishin

unread,
Jul 23, 2018, 6:09:34 PM7/23/18
to rabbitm...@googlegroups.com
Looks like a race condition between two operations that use pg_local ETS tables.
One was looking up a local channel process and another one removed it from the table
so no results were returned.

pg_local can be made more defensive to do nothing in such cases.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send an email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Sep 14, 2018, 8:05:26 PM9/14/18
to rabbitmq-users
This was reported again today and I put together a test that reproduces it with a PR that

On Tuesday, July 24, 2018 at 12:09:34 AM UTC+2, Michael Klishin wrote:
Looks like a race condition between two operations that use pg_local ETS tables.
One was looking up a local channel process and another one removed it from the table
so no results were returned.

pg_local can be made more defensive to do nothing in such cases.

jo...@cloudamqp.com

unread,
Sep 15, 2018, 9:59:06 PM9/15/18
to rabbitm...@googlegroups.com
Thank you very much for fixing, and following up!

Johan
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send email to rabbitm...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages