Supervisor got noproc (looks like a bug)

679 views
Skip to first unread message

Alexander Petrovsky

unread,
Sep 7, 2021, 1:21:27 PM9/7/21
to Erlang Questions
Hello!

Recently, during the performance test, I noticed - when the supervisor
tries to terminate children sometimes it's got 'noproc'. My erlang
version - OTP-23.2.5.

Accordingly, to some circumstances, the child
('gen_server:start_link/3' process) decides to stop with '{stop,
normal, State}'. At the same time, another process decides to stop
this child with 'supervisor:terminate_child/2'.

In some (rare, but reproducible) cases I've got in logs:

supervisor: {local,my_sup}
errorContext: shutdown_error
reason: noproc
offender: [{pid,<0.4890.1539>},
{id,my},
{mfargs,{my,start_link,[]}},
{restart_type,temporary},
{shutdown,5000},
{child_type,worker}]
(supervisor:do_terminate/2)

Accordingly to https://github.com/erlang/otp/blob/OTP-23.2.5/lib/stdlib/src/supervisor.erl#L873-L919
this could happen when: child terminates with 'noproc' reason or
"child did unlink and the child dies before monitor". But the
termination reason in all cases is '{stop, normal, State}' and the
children don't do any 'erlang:unlink/1' at all!

I've gather the trace with:
dbg:tracer(process, {fun(Msg,_) -> syslog:info_msg("Trace ~0p~n",
[Msg]), 0 end, 0}).
dbg:p(whereis(my_sup), [m, c]).
dbg:tp(erlang,exit,[{'_',[],[{return_trace}]}]).
dbg:tp(erlang,unlink,[{'_',[],[{return_trace}]}]).
dbg:tp(erlang,monitor,[{'_',[],[{return_trace}]}]).
dbg:p(all, c).

And I see the different flows! When the supervisor got 'noproc' the
flow looks like:

2021-09-07T15:53:56.431343Z <0.5314.0> my: started
2021-09-07T15:53:56.433941Z Trace
{trace,<0.1089.0>,'receive',{ack,<0.5314.0>,{ok,<0.5314.0>}}}
2021-09-07T15:53:56.434081Z Trace
{trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566209.208162>,{ok,<0.5314.0>}},<0.5284.0>}
2021-09-07T15:53:56.442202Z <0.5314.0> my: finished
2021-09-07T15:53:56.447844Z Trace
{trace,<0.1089.0>,'receive',{'$gen_call',{<0.5284.0>,#Ref<0.1919367765.3881566210.214548>},{terminate_child,<0.5314.0>}}}
2021-09-07T15:53:56.543227Z Trace
{trace,<0.1089.0>,call,{erlang,monitor,[process,<0.5314.0>]}}
2021-09-07T15:53:56.543352Z Trace
{trace,<0.1089.0>,return_from,{erlang,monitor,2},#Ref<0.1919367765.3881566212.211028>}
2021-09-07T15:53:56.543467Z Trace
{trace,<0.1089.0>,call,{erlang,unlink,[<0.5314.0>]}}
2021-09-07T15:53:56.543676Z Trace
{trace,<0.1089.0>,'receive',{'DOWN',#Ref<0.1919367765.3881566212.211028>,process,<0.5314.0>,noproc}}
2021-09-07T15:53:56.543854Z Trace
{trace,<0.1089.0>,call,{erlang,exit,[<0.5314.0>,shutdown]}}

In all other normal cases flow looks lie:

2021-09-07T15:54:24.489280Z <0.6838.0> my: started
2021-09-07T15:54:24.491185Z Trace
{trace,<0.1089.0>,'receive',{ack,<0.6838.0>,{ok,<0.6838.0>}}}
2021-09-07T15:54:24.491344Z Trace
{trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566210.231347>,{ok,<0.6838.0>}},<0.6809.0>}
2021-09-07T15:54:24.514841Z <0.6838.0> my: finished
2021-09-07T15:54:24.529580Z Trace
{trace,<0.1089.0>,'receive',{'$gen_call',{<0.6809.0>,#Ref<0.1919367765.3881566209.224355>},{terminate_child,<0.6838.0>}}}
2021-09-07T15:54:24.616316Z Trace
{trace,<0.1089.0>,'receive',{'EXIT',<0.6838.0>,normal}}
2021-09-07T15:54:24.618596Z Trace
{trace,<0.1089.0>,call,{erlang,monitor,[process,<0.6838.0>]}}
2021-09-07T15:54:24.618760Z Trace
{trace,<0.1089.0>,call,{erlang,unlink,[<0.6838.0>]}}
2021-09-07T15:54:24.619175Z Trace
{trace,<0.1089.0>,'receive',{'DOWN',#Ref<0.1919367765.3881566212.227125>,process,<0.6838.0>,noproc}}
2021-09-07T15:54:24.619275Z Trace
{trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566209.224355>,ok},<0.6809.0>}

Does it seem like the very tricky race or like erlang lost EXIT messages?

--
Alexander Petrovsky

Maria Scott

unread,
Sep 8, 2021, 7:09:36 AM9/8/21
to Erlang-Questions Questions
Forgot to reply to the list =^^=
Resending so others can chime in.

> ---------- Ursprüngliche Nachricht ----------
> Von: Maria Scott <maria-1...@hnc-agency.org>
> An: Alexander Petrovsky <askj...@gmail.com>
> Datum: 08.09.2021 12:52
> Betreff: Re: Supervisor got noproc (looks like a bug)
>
>
> Hi :)
>
> first, this is partly guesswork, so take with a grain of salt.
>
> You have a situation where the child may be terminated by the supervisor (via terminate_child) and may at the same time be terminating by itself (via {stop, ...}), is that right?
>
> While your child is running, it is linked to the supervisor, but not monitored. When the supervisor is told to shut down (terminate) a child, what it does is this (simplified, see https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982 for all the details):
> (a) monitor the child
> (b) unlink the child
> (c) check for an EXIT message (in case the child already terminated before the monitoring)
> (d) if there is an EXIT message, flush out the DOWN message and return the EXIT reason (and that's it in this case)
> (e) otherwise, if no EXIT message is there, call exit(Child, shutdown)
> (f) wait for a DOWN message; reasons shutdown and normal are normal exits, everything else produces a shutdown_error
>
> By only intuition, this flow should hold no matter if and when the child terminates by itself.
> The key to understanding how the shutdown_error you describe arises is this passage from the docs for monitor/2: "The monitor request is an asynchronous signal. That is, it takes time before the signal reaches its destination." unlink/1, while it is also an asynchronous request that takes time to reach the other process, does something more: it marks the link as inactive on the process calling unlink, and "The exit signal is silently dropped if ... the corresponding link has been deactivated".
>
> So what I think is happening when the error you describe occurs is this:
> - the supervisor calls monitor(process, Child) (see (a)), but the message does not reach the child immediately
> - the supervisor unlinks the child (see (b)), deactivating the link
> - the child dies (exits by itself as a result of {stop, ...}); but as it is now unlinked, there is no EXIT message (see (c) and (d))
> - the monitor signal reaches (or, doesn't rather) reach the child, resulting in a DOWN message with reason noproc
> - the supervisor receives the DOWN message (see (f)), and as the reason is not shutdown or normal, it gets propagated, ultimately resulting in the shutdown_error with reason noproc
>
> As I said, this is pieced together from some (educated) guesswork ;) Don't rely on it until somebody else confirms it.
>
> Kind regards,
> Maria

Alexander Petrovsky

unread,
Sep 9, 2021, 6:00:56 AM9/9/21
to Maria Scott, Erlang Questions
Hi!

I've carefully re-read the docs:
- https://erlang.org/doc/man/erlang.html#unlink-1
- https://erlang.org/doc/reference_manual/processes.html#links
- https://erlang.org/doc/apps/erts/erl_dist_protocol.html#link_protocol

And it seems you are absolutely right about the current situation and
it's a tricky race, not a bug:
(a) monitor request emitted and is still in flight (async nature).
(b) unlink the child (async nature):
(b.1) sent UNLINK_ID and deactivate link (after this point all EXIT
messages from the linked process will be dropped);
(b.2) linked process received UNLINK_ID;
(b.3) receive UNLINK_ID_ACK and remove link state at all;
(a.1) monitor received 'noproc' message.

I found, the unlink protocol is changed in OTP 23, and there are old
and new protocols, the new states:
"The receiver of an UNLINK_ID signal responds with an UNLINK_ID_ACK
signal. Upon reception of an UNLINK_ID signal, the corresponding
UNLINK_ID_ACK signal must be sent before any other signals are sent to
the sender of the UNLINK_ID signal."

So, the linked process termination could happen:
- between (a) and (b), in this case, the EXIT message will be emitted
be placed into the mailbox;
- between (b.1) and (b.2), in this case, the message will be emitted,
but rejected due to the link is already deactivated;
- between (b.2) and (b.3), in this case, no messages could be emitted
by linked process accordingly to protocol.

It seems like, the behaviour of the monitor should be changed somehow,
and the code https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982
seems a little bit outdated due to the async nature of the monitors
and such tricky race, also, it's 12 years old... :)

I would like to see, what others add OTP maintainers thinks about this
behaviour?

ср, 8 сент. 2021 г. в 13:52, Maria Scott <maria-1...@hnc-agency.org>:
>
> Hi :)
>
> first, this is partly guesswork, so take with a grain of salt.
>
> You have a situation where the child may be terminated by the supervisor (via terminate_child) and may at the same time be terminating by itself (via {stop, ...}), is that right?
>
> While your child is running, it is linked to the supervisor, but not monitored. When the supervisor is told to shut down (terminate) a child, what it does is this (simplified, see https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982 for all the details):
> (a) monitor the child
> (b) unlink the child
> (c) check for an EXIT message (in case the child already terminated before the monitoring)
> (d) if there is an EXIT message, flush out the DOWN message and return the EXIT reason (and that's it in this case)
> (e) otherwise, if no EXIT message is there, call exit(Child, shutdown)
> (f) wait for a DOWN message; reasons shutdown and normal are normal exits, everything else produces a shutdown_error
>
> By only intuition, this flow should hold no matter if and when the child terminates by itself.
> The key to understanding how the shutdown_error you describe arises is this passage from the docs for monitor/2: "The monitor request is an asynchronous signal. That is, it takes time before the signal reaches its destination." unlink/1, while it is also an asynchronous request that takes time to reach the other process, does something more: it marks the link as inactive on the process calling unlink, and "The exit signal is silently dropped if ... the corresponding link has been deactivated".
>
> So what I think is happening when the error you describe occurs is this:
> - the supervisor calls monitor(process, Child) (see (a)), but the message does not reach the child immediately
> - the supervisor unlinks the child (see (b)), deactivating the link
> - the child dies (exits by itself as a result of {stop, ...}); but as it is now unlinked, there is no EXIT message (see (c) and (d))
> - the monitor signal reaches (or, doesn't rather) reach the child, resulting in a DOWN message with reason noproc
> - the supervisor receives the DOWN message (see (f)), and as the reason is not shutdown or normal, it gets propagated, ultimately resulting in the shutdown_error with reason noproc
>
> As I said, this is pieced together from some (educated) guesswork ;) Don't rely on it until somebody else confirms it.
>
> Kind regards,
> Maria



--
Alexander Petrovsky

Alexander Petrovsky

unread,
Sep 9, 2021, 6:49:58 AM9/9/21
to Maria Scott, Erlang Questions
One more thing, taking into account the async nature of the monitors
and links, I think the following statement should make sense: if the
process with Pid dies after the monitor signal is sent, the DOWN
message should have the real reason, not a 'noproc'.

чт, 9 сент. 2021 г. в 13:00, Alexander Petrovsky <askj...@gmail.com>:
--
Alexander Petrovsky

Maria Scott

unread,
Sep 9, 2021, 7:13:49 AM9/9/21
to Alexander Petrovsky, Erlang Questions
> It seems like, the behaviour of the monitor should be changed somehow,

I don't think the general working of monitors can be changed easily, and not without causing far-reaching repercussions =\

> and the code https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982
> seems a little bit outdated due to the async nature of the monitors
> and such tricky race, also, it's 12 years old... :)

Most of the supervisor code is old/old-fashioned, I had my hands in there not too long ago ^^; That said, as it is a very central component to OTP, there is understandably some reluctance to change the battle-tested-by-time old code just for the sake of following the latest trends ;)

> I would like to see, what others add OTP maintainers thinks about this
> behaviour?

Open an issue at https://github.com/erlang/otp? That's more likely to catch their attention than the mailing list ;)

Regards,
Maria

Loïc Hoguin

unread,
Sep 9, 2021, 8:20:49 AM9/9/21
to Erlang Questions
We are getting a variant of this sometimes when shutting down RabbitMQ:

2021-09-08 14:14:08.886 [error] <0.4805.15> Supervisor {<0.4805.15>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2021-09-08 14:14:08.887 [info] <0.29845.15> supervisor: {<0.29845.15>,rabbit_channel_sup_sup}, errorContext: shutdown_error, reason: shutdown, offender: [{nb_children,1},{id,channel_sup},{mfargs,{rabbit_channel_sup,start_link,[]}},{restart_type,temporary},{shutdown,infinity},{child_type,supervisor}]

This happens because channels are automatically closed when the relevant connection goes away. Because we are shutting down, all connections go away, channels shut themselves down automatically with reason shutdown, while at the same time the application is stopping itself and eventually it tries to shut down a channel that is itself in the process of shutting down.

The supervisor monitors, unlinks, then receives the shutdown 'DOWN' message, just before it would send its own exit(Pid, shutdown).

Perhaps there should be another clause here to exclude the shutdown/noproc reasons? It doesn't accomplish much other than polluting logs.

https://github.com/erlang/otp/blob/master/lib/stdlib/src/supervisor.erl#L952-L953

Cheers,

Loïc Hoguin

On 07/09/2021 19:21, "erlang-questions on behalf of Alexander Petrovsky" <erlang-quest...@erlang.org on behalf of askj...@gmail.com> wrote:

Hello!

Recently, during the performance test, I noticed - when the supervisor
tries to terminate children sometimes it's got 'noproc'. My erlang
version - OTP-23.2.5.

Accordingly, to some circumstances, the child
('gen_server:start_link/3' process) decides to stop with '{stop,
normal, State}'. At the same time, another process decides to stop
this child with 'supervisor:terminate_child/2'.

In some (rare, but reproducible) cases I've got in logs:

supervisor: {local,my_sup}
errorContext: shutdown_error
reason: noproc
offender: [{pid,<0.4890.1539>},
{id,my},
{mfargs,{my,start_link,[]}},
{restart_type,temporary},
{shutdown,5000},
{child_type,worker}]
(supervisor:do_terminate/2)

Accordingly to https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ferlang%2Fotp%2Fblob%2FOTP-23.2.5%2Flib%2Fstdlib%2Fsrc%2Fsupervisor.erl%23L873-L919&amp;data=04%7C01%7Clhoguin%40vmware.com%7Ce81a78dc06984189dccf08d97223ee39%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637666320924891115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=5pFBdThPfXeRT90i6%2BWucVjpE2kIgdxaTNSkNOad4yI%3D&amp;reserved=0

Rickard Green

unread,
Sep 9, 2021, 4:37:44 PM9/9/21
to Alexander Petrovsky, Erlang Questions
You are right in that there is a race causing a 'noproc' exit reason when it should be possible to get the real exit reason. I'll write an internal ticket about this, but you are welcome to create a bug issue at <https://github.com/erlang/otp/issues> as well (as pointed out by Maria).

On Thu, Sep 9, 2021 at 12:49 PM Alexander Petrovsky <askj...@gmail.com> wrote:
One more thing, taking into account the async nature of the monitors
and links, I think the following statement should make sense: if the
process with Pid dies after the monitor signal is sent, the DOWN
message should have the real reason, not a 'noproc'.

No, in the distributed case we would need to keep exit reasons for a long time (hard to determine how long) for all terminated processes in order to satisfy such a behavior.

The behaviour is and should be: If the process with Pid dies after the monitor signal has been *received*, the DOWN message should have the real reason, not a 'noproc'. If the process is not alive at the time of the reception of the monitor signal, you will get a 'noproc' reason.
 
Regards,
Rickard, Erlang/OTP


--
Rickard Green, Erlang/OTP, Ericsson AB

Maria Scott

unread,
Sep 10, 2021, 8:23:46 AM9/10/21
to Rickard Green, Alexander Petrovsky, Erlang Questions
> > One more thing, taking into account the async nature of the monitors
> > and links, I think the following statement should make sense: if the
> > process with Pid dies after the monitor signal is sent, the DOWN
> > message should have the real reason, not a 'noproc'.

Within the constraint you outline, "process dies after the monitor signal was sent", yes, it would make sense. But...

> No, in the distributed case we would need to keep exit reasons for a long time (hard to determine how long) for all terminated processes in order to satisfy such a behavior.

... what Rickard said, plus to achieve this the solution would have to use an order of events in global time, ie what happened in process A happened before what happened in process B, which is practically impossible to do.



Anyway, I have been rolling this over in my mind a bit, in the context of the supervisor terminate-a-child behavior, and... is it _necessary_ to unlink the child after setting a monitor on it? Isn't it in fact a bit dangerous even? What if the supervisor crashes or gets killed right after unlinking the child? It will be left running, unaware of the fact that it has become an orphan. Correct?

So instead, what if we just keep the link? (The following assumes that there will never be messages from a process after the 'EXIT' message, something which I think I did read somewhere once, but can't find right now).

If the child is well-behaved (has not unlinked itself), at the supervisor side after it told the child to exit, what we can expect to receive is either:
a) {'EXIT', ..., Reason} followed by {'DOWN', ..., noproc} if the child died before the monitor signal reached it
b) {'DOWN', ..., Reason} followed by {'EXIT', ..., Reason} if the monitor signal has reached the child before it died, or

If the child is naughty (unlinked itself), things are a bit trickier:
c) {'DOWN', ..., noproc} if the child died before the monitor signal reached it, or
d) {'DOWN', ..., Reason} _not_ followed by an 'EXIT' message

The following approach should take care of all of the above cases, I think:

* set up a monitor on the child, and leave the link in place
* send shutdown (+kill on shutdown timeout) or kill, whatever the shutdown strategy of the child requires
* use a selective receive with two clauses, one for 'EXIT', one for 'DOWN', and...
* if {'EXIT', ...} is received first, we have case (a), and can just flush out the associated 'DOWN' message via demonitor with flush
* if a {'DOWN', ..., noproc} message is received first, we have case c). The child is gone, and we will never get at the exit reason
* if a {'DOWN', ..., Reason} message is received, we have either case b) or d). We know the child exited and for what reason. We try to flush out the possibly existing 'EXIT' message, which may or may not be there

What do you think? Make any sense?
Regards,
Maria

Rickard Green

unread,
Sep 10, 2021, 3:40:57 PM9/10/21
to Maria Scott, Erlang Questions, Rickard Green
On Fri, Sep 10, 2021 at 2:23 PM Maria Scott <maria-1...@hnc-agency.org> wrote:
> > One more thing, taking into account the async nature of the monitors
> >  and links, I think the following statement should make sense: if the
> >  process with Pid dies after the monitor signal is sent, the DOWN
> >  message should have the real reason, not a 'noproc'.

Within the constraint you outline, "process dies after the monitor signal was sent", yes, it would make sense. But...

> No, in the distributed case we would need to keep exit reasons for a long time (hard to determine how long) for all terminated processes in order to satisfy such a behavior.

... what Rickard said, plus to achieve this the solution would have to use an order of events in global time, ie what happened in process A happened before what happened in process B, which is practically impossible to do.



Anyway, I have been rolling this over in my mind a bit, in the context of the supervisor terminate-a-child behavior, and... is it _necessary_ to unlink the child after setting a monitor on it? Isn't it in fact a bit dangerous even? What if the supervisor crashes or gets killed right after unlinking the child? It will be left running, unaware of the fact that it has become an orphan. Correct?


Yes
 
So instead, what if we just keep the link?

Yes one wants to keep the link as long as possible, but we eventually want to perform the unlink unless we see an 'EXIT' message.
 
(The following assumes that there will never be messages from a process after the 'EXIT' message, something which I think I did read somewhere once, but can't find right now).


Yes. The signal order guarantee <https://erlang.org/doc/reference_manual/processes.html#signal-delivery> promise that two signals sent from one process to another are received in the same order as sent (if both are received). 'EXIT' and 'DOWN' signals are sent after the process has entered an exiting state, so signals sent by the process itself when it was alive, such as normal messages, have been sent before 'EXIT' and 'DOWN' signals. There might however be other signals sent on behalf of the terminated process after 'EXIT' and 'DOWN' signals have been sent. For example, a process-info-reply informing the sender of a process-info-request that the process is not alive. Note that the order between 'EXIT' and 'DOWN' signals from a process is undefined. That is, if you have both a link and a monitor to a process, you don't know which one will be received first of the 'DOWN' and the 'EXIT' signals.

If the child is well-behaved (has not unlinked itself), at the supervisor side after it told the child to exit, what we can expect to receive is either:
a) {'EXIT', ..., Reason} followed by {'DOWN', ..., noproc} if the child died before the monitor signal reached it
b) {'DOWN', ..., Reason} followed by {'EXIT', ..., Reason} if the monitor signal has reached the child before it died, or

If the child is naughty (unlinked itself), things are a bit trickier:
c) {'DOWN', ..., noproc} if the child died before the monitor signal reached it, or
d) {'DOWN', ..., Reason} _not_ followed by an 'EXIT' message

The following approach should take care of all of the above cases, I think:

* set up a monitor on the child, and leave the link in place
* send shutdown (+kill on shutdown timeout) or kill, whatever the shutdown strategy of the child requires
* use a selective receive with two clauses, one for 'EXIT', one for 'DOWN', and...
  * if {'EXIT', ...} is received first, we have case (a), and can just flush out the associated 'DOWN' message via demonitor with flush
  * if a {'DOWN', ..., noproc} message is received first, we have case c). The child is gone, and we will never get at the exit reason
  * if a {'DOWN', ..., Reason} message is received, we have either case b) or d). We know the child exited and for what reason. We try to flush out the possibly existing 'EXIT' message, which may or may not be there

What do you think? Make any sense?

Yes it makes sense, but in case we see no 'EXIT' message after a 'DOWN' message we want to perform an unlink and flush the message queue for an 'EXIT' message. If we got a 'DOWN' message with 'noproc' and an 'EXIT' message appears (in a flush, or before) we use the exit reason of the 'EXIT' message. If the 'DOWN' message with a non-noproc reason arrived we can just ignore the exit reason in the 'EXIT' message if it should have arrived.

Without the unlink and 'EXIT' message flush, we might end up with a stray 'EXIT' message in the message queue of the supervisor. If that is a problem or not for the superviser I do however not know.
 
Regards,
Maria

Regards,
Rickard
Reply all
Reply to author
Forward
0 new messages