[erlang-questions] Understanding supervisor / start_link behaviour

176 views
Skip to first unread message

Steve Strong

unread,
Jun 1, 2011, 3:26:39 PM6/1/11
to erlang-q...@erlang.org
Hi,

I've got some strange behaviour with gen_event within a supervision tree which I don't fully understand.  Consider the following supervisor (completely standard, feel free to skip over):

<snip>

-module(sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).
-define(SERVER, ?MODULE).

start_link() ->
    supervisor:start_link({local, ?SERVER}, ?MODULE, []).

init([]) ->
    Child1 = {child, {child, start_link, []}, permanent, 2000, worker, [child]},
    {ok, {{one_for_all, 1000, 3600}, [Child1]}}.

</snip>

and corresponding gen_server (interesting code in bold):

<snip>

-module(child).
-behaviour(gen_server).
-export([start_link/0, init/1, handle_call/3, handle_cast/2, 
handle_info/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({local, child}, child, [], []).

init([]) ->
    io:format("about to start gen_event~n"),
    X = gen_event:start_link({local, my_gen_event}),
    io:format("gen_event started with ~p~n", [X]),
    {ok, _Pid} = X,

    {ok, {}, 2000}.

handle_call(_Request, _From, State) ->
    {reply, ok, State}.

handle_cast(_Msg, State) ->
    {noreply, State}.

handle_info(_Info, State) ->
    io:format("about to crash...~n"),
    1 = 2,
    {noreply, State}.

terminate(_Reason, _State) ->
    ok.

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

</snip>

If I run this from an erl shell like this:

<snip>

--> erl
Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.2  (abort with ^G)
1> application:start(sasl), supervisor:start_link(sup, []).

</snip>

Then the supervisor & server start as expected.  After 2 seconds the server gets a timeout message and crashes itself; the supervisor obviously spots this and restarts it.  Within the init of the gen_server, it also does a start_link on a gen_event process.  By my understanding, whenever the gen_server process exits, the gen_event will also be terminated.

However, every now and then I see the following output (a ton of sasl trace omitted for clarity!):

<snip>

about to crash...
about to start gen_event
gen_event started with {error,{already_started,<0.79.0>}}
about to start gen_event
gen_event started with {error,{already_started,<0.79.0>}}
about to start gen_event

</snip>

What is happening is that the gen_server is crashing but on its restart the gen_event process is still running - hence the gen_server fails in its init and gets restarted again.  Sometimes this loop clears after a few iterations, other times it can continue until the parent supervisor gives up, packs its bags and goes home.

So, my question is whether this is expected behaviour or not.  I assume that the termination of the linked child is happening asynchronously, and that the supervisor is hence restarting its children before things have cleaned up correctly - is that correct?

I can fix this particular scenario by trapping exits within the gen_server, and then calling gen_event:stop within the terminate.  Is this type of processing necessary whenever a process is start_link'ed within a supervisor tree, or is what I'm doing considered bad practice?

Thanks for your time,

Steve

-- 
Steve Strong, Director, id3as


Roberto Ostinelli

unread,
Jun 1, 2011, 5:15:29 PM6/1/11
to Steve Strong, erlang-q...@erlang.org
hi steve,

your gen_event should be started by your supervisor too. in this case, since you specified a one_for_all behaviour, when gen_server crashes, gen_event will be restarted too.

r.

Ahmed Omar

unread,
Jun 1, 2011, 5:57:37 PM6/1/11
to Roberto Ostinelli, erlang-q...@erlang.org
Agree with Roberto, you should put under supervisor. Regarding your case, i would guess you are trapping exit in your init in my_gen_event?

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions




--
Best Regards,
- Ahmed Omar
Follow me on twitter

Steve Strong

unread,
Jun 2, 2011, 3:15:17 AM6/2/11
to Ahmed Omar, erlang-q...@erlang.org, Roberto Ostinelli
Yeah, that makes perfect sense and would obviously solve the problem.

The reason we'd gone down this path was that we had a number of "sub" processes (the gen_event just being one example) that we felt would be "polluting" the supervisor; these sub-processes were just helpers of the primary gen_servers that the supervisor was controlling - using a start_link in the primary gen_servers felt like a very clean and easy way of spinning up these other processes in a way that (we thought) would still be resilient to failures.

The thing that bit us was that we naively thought that, due to the sub-process being linked, it would die when the parent died.  Of course, it does, but its death is asynchronous to the notification that the supervisor receives and hence it may well still be alive (doomed, but alive) when the supervisor begins the restart cycle.  Our servers don't crash that often, and when they do this race condition is was rarely seen, which was reinforced our misconceptions.  The only thing that does surprise me is how many times the supervisor can go round the restart loop before the doomed process finally exits - we have seen it thrash round this loop about 1000 times before the supervisor itself finally fails; I guess it's just down to how things are being scheduled by the VM, and in those cases we were just getting unlucky.

Sounds like best-practice within the OTP world is to have everything started via a supervisor - is that a fair comment?

Cheers,

Steve

-- 
Steve Strong, Director, id3as

Ladislav Lenart

unread,
Jun 2, 2011, 4:23:05 AM6/2/11
to Steve Strong, Roberto Ostinelli, erlang-q...@erlang.org
Hello.

I am by no means an expert on the topic but I would like to point out
that the only reason you get {already_started, ...} error is because
you attempt to register the helper process with {local, ...}. If it is
a helper, there should be no reason for it to be globally accessible.
And if it wasn't registered, the gen_server would be restarted without
issues creating new helper process. The old helper would die eventually
just as you expect it to.


Ladislav Lenart


On 2.6.2011 09:15, Steve Strong wrote:
> Yeah, that makes perfect sense and would obviously solve the problem.
>
> The reason we'd gone down this path was that we had a number of "sub" processes (the gen_event just being one example) that we felt would be "polluting" the supervisor; these sub-processes were just
> helpers of the primary gen_servers that the supervisor was controlling - using a start_link in the primary gen_servers felt like a very clean and easy way of spinning up these other processes in a way
> that (we thought) would still be resilient to failures.
>
> The thing that bit us was that we naively thought that, due to the sub-process being linked, it would die when the parent died. Of course, it does, but its death is asynchronous to the notification
> that the supervisor receives and hence it may well still be alive (doomed, but alive) when the supervisor begins the restart cycle. Our servers don't crash that often, and when they do this race
> condition is was rarely seen, which was reinforced our misconceptions. The only thing that does surprise me is how many times the supervisor can go round the restart loop before the doomed process
> finally exits - we have seen it thrash round this loop about 1000 times before the supervisor itself finally fails; I guess it's just down to how things are being scheduled by the VM, and in those
> cases we were just getting unlucky.
>
> Sounds like best-practice within the OTP world is to have everything started via a supervisor - is that a fair comment?
>
> Cheers,
>
> Steve
>
> --
> Steve Strong, Director, id3as
> twitter.com/srstrong
>
> On Wednesday, 1 June 2011 at 23:57, Ahmed Omar wrote:
>
>> Agree with Roberto, you should put under supervisor. Regarding your case, i would guess you are trapping exit in your init in my_gen_event?
>>

>> On Wed, Jun 1, 2011 at 11:15 PM, Roberto Ostinelli <rob...@widetag.com <mailto:rob...@widetag.com>> wrote:
>>> hi steve,
>>>
>>> your gen_event should be started by your supervisor too. in this case, since you specified a one_for_all behaviour, when gen_server crashes, gen_event will be restarted too.
>>>
>>> r.
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list

>>> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>


>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>>
>>
>> --
>> Best Regards,
>> - Ahmed Omar
>> http://nl.linkedin.com/in/adiaa
>> Follow me on twitter

>> @spawn_think <http://twitter.com/#!/spawn_think>

Ahmed Omar

unread,
Jun 2, 2011, 5:03:18 AM6/2/11
to Ladislav Lenart, erlang-q...@erlang.org, Roberto Ostinelli
True ({local, Name}  -> register locally, not globally, with Name) but you have to be careful if having two instances alive in the same time is acceptable or not.

Steve, let's put it this way it's better to start processes under supervisor specially if you want to benefit from standard restarting strategies, it keeps your application cleaner.
(as a hack, your case can also be solved by a monitor in the gen_server init before starting the gen_server
 erlang:monitor(process, my_gen_event),
    receive
        {'DOWN', Ref, process, Pid, Reason}->
            ok
    end,
)
But using supervisor is much cleaner and safer, and easier to design with in my opinion

Ladislav Lenart

unread,
Jun 2, 2011, 5:14:57 AM6/2/11
to Ahmed Omar, erlang-q...@erlang.org, Roberto Ostinelli
On 2.6.2011 11:03, Ahmed Omar wrote:
> True ({local, Name} -> register locally, not globally, with Name) but you have to be careful if having two instances alive in the same time is acceptable or not.

My bad. By "globally accessible" I meant that the locally
registered process will be available to all processes on
the local node.


Ladislav Lenart


> Steve, let's put it this way it's better to start processes under supervisor specially if you want to benefit from standard restarting strategies, it keeps your application cleaner.
> (as a hack, your case can also be solved by a monitor in the gen_server init before starting the gen_server
> erlang:monitor(process, my_gen_event),
> receive
> {'DOWN', Ref, process, Pid, Reason}->
> ok
> end,
> )
> But using supervisor is much cleaner and safer, and easier to design with in my opinion

> twitter.com/srstrong <http://twitter.com/srstrong>


>
> On Wednesday, 1 June 2011 at 23:57, Ahmed Omar wrote:
>
> Agree with Roberto, you should put under supervisor. Regarding your case, i would guess you are trapping exit in your init in my_gen_event?
>

> On Wed, Jun 1, 2011 at 11:15 PM, Roberto Ostinelli <rob...@widetag.com <mailto:rob...@widetag.com> <mailto:rob...@widetag.com <mailto:rob...@widetag.com>>> wrote:
>
> hi steve,
>
> your gen_event should be started by your supervisor too. in this case, since you specified a one_for_all behaviour, when gen_server crashes, gen_event will be restarted too.
>
> r.
>
>
> _______________________________________________
> erlang-questions mailing list

> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org> <mailto:erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>>

Antoine Koener

unread,
Jun 2, 2011, 4:50:43 AM6/2/11
to Steve Strong, Roberto Ostinelli, erlang-q...@erlang.org
On Jun 2, 2011, at 09:15 , Steve Strong wrote:

Yeah, that makes perfect sense and would obviously solve the problem.

The reason we'd gone down this path was that we had a number of "sub" processes (the gen_event just being one example) that we felt would be "polluting" the supervisor; these sub-processes were just helpers of the primary gen_servers that the supervisor was controlling - using a start_link in the primary gen_servers felt like a very clean and easy way of spinning up these other processes in a way that (we thought) would still be resilient to failures.

I dealt with such a situation using a supervisor that starts all mandatory process and a child supervisor.
The child supervisor starts all other processes that needs those mandatory processes.


The thing that bit us was that we naively thought that, due to the sub-process being linked, it would die when the parent died.  Of course, it does, but its death is asynchronous to the notification that the supervisor receives and hence it may well still be alive (doomed, but alive) when the supervisor begins the restart cycle.  Our servers don't crash that often, and when they do this race condition is was rarely seen, which was reinforced our misconceptions.  The only thing that does surprise me is how many times the supervisor can go round the restart loop before the doomed process finally exits - we have seen it thrash round this loop about 1000 times before the supervisor itself finally fails; I guess it's just down to how things are being scheduled by the VM, and in those cases we were just getting unlucky.

Sounds like best-practice within the OTP world is to have everything started via a supervisor - is that a fair comment?

Yes, and
application:start(sasl).
is a way "observe" the whole starting process, and see what could be wrong.

Mazen Harake

unread,
Jun 2, 2011, 5:53:02 AM6/2/11
to Steve Strong, erlang-q...@erlang.org
Steve,

I wouldn't say that you are wrong. I think that you are reasoning good
about not putting the gen_event module under a supervisor because
*that is what links are for*. Just because you have a supervisor
doesn't mean the you shove everything underneath there! If the
gen_server and the gen_event are truly linked (meaning: gen_server
doesn't act as a "supervisor" keeping track of its gen_event process
and restarts it all the time but rather that they really are linked
and they crash together) then your approach, in my opinion, is good.

There are great benefits in doing it in that way. Many will claim that
it is best practice to put *everything* under a supervisor but this is
simply not true. 90% of cases it *is* the best thing to do and many
times it is more about how you designed your application rather than
where to put the supervisors and their children but doing it the way
you did is not necessarily wrong.

The only problem I see with your approach is that you have registered
the gen_event process which clearly isn't useful (since only the
gen_server should know about it, after all, it started it). Other than
that, this approach is extremely helpful and a nice way to clean up
things after they die/shutdown (Again: assuming truly linked).

There is a big misconception in the community that everything
should/must look like the supervisor-tree model which shows how
gen_servers are put under supervisors and more supervisors under the
"top" supervisor but that is not enforced and the design principles
doesn't take many cases into account where this setup actually brings
more headache to the table than to just exit and clean up using linked
processes (because they do exist).

/M

Tim Watson

unread,
Jun 2, 2011, 5:55:13 AM6/2/11
to Ladislav Lenart, Roberto Ostinelli, erlang-q...@erlang.org
On 2 June 2011 10:14, Ladislav Lenart <lena...@volny.cz> wrote:
> On 2.6.2011 11:03, Ahmed Omar wrote:
>>
>> True ({local, Name}  -> register locally, not globally, with Name) but you
>> have to be careful if having two instances alive in the same time is
>> acceptable or not.
>
> My bad. By "globally accessible" I meant that the locally
> registered process will be available to all processes on
> the local node.

You might consider gproc for these kinds of use cases. It provides a
great deal of simplification around synchronising startups and
registering names etc.

Steve Strong

unread,
Jun 2, 2011, 8:23:23 AM6/2/11
to Mazen Harake, erlang-q...@erlang.org
That makes a good deal of sense.  I guess the point that something should get promoted up to a supervision tree rather than being start-linked is when it starts getting to a complexity level such that it may have issues if multiple instances of the process are running simultaneously.  At that point it stops sound like a trivial helper process and something that should be managed more actively.

Completely agree on the fact that having the gen_event register wasn't a useful thing, and that not doing so would solve the problem - that was pretty obvious as soon as I spotted the issue, this thread was more to get opinion on how things should be best structured.


-- 
Steve Strong, Director, id3as

Frédéric Trottier-Hébert

unread,
Jun 2, 2011, 10:10:53 AM6/2/11
to Mazen Harake, erlang-q...@erlang.org
There are disadvantages to *not* putting workers under the supervision tree, though. Namely, you'll be losing the ability to have the release handlers walk down the supervision trees to find which processes to suspend/update, and you'll then need to find a different way of doing things.

This is a serious point to consider if you ever plan on going the way of releases/appups if the workers you use are to be long-lived (you don't want them to be killed during a purge). I'm not saying you didn't know this, but I felt I should point it out for the sake of having the arguments clear on the mailing list.

--
Fred Hébert
http://www.erlang-solutions.com

Steve Strong

unread,
Jun 2, 2011, 10:23:40 AM6/2/11
to Frédéric Trottier-Hébert, erlang-q...@erlang.org
That is an interesting point, and not something I'd considered to date

-- 
Steve Strong
Sent with Sparrow

Jachym Holecek

unread,
Jun 2, 2011, 10:37:39 AM6/2/11
to Mazen Harake, erlang-q...@erlang.org
# Mazen Harake 2011-06-02:

> I wouldn't say that you are wrong. I think that you are reasoning good
> about not putting the gen_event module under a supervisor because
> *that is what links are for*. Just because you have a supervisor
> doesn't mean the you shove everything underneath there! If the
> gen_server and the gen_event are truly linked (meaning: gen_server
> doesn't act as a "supervisor" keeping track of its gen_event process
> and restarts it all the time but rather that they really are linked
> and they crash together) then your approach, in my opinion, is good.

FWIW couldn't agree more with this. For completeness (it's obvious and you're
no doubt aware of it): 'normal' exits don't kill linked peers, which takes a
little getting used to, but is trivial to manage.

As a more general point, designing sensible supervision trees was probably
the most difficult engineering aspect of OTP for me to learn, so I guess
people shouldn't feel too bad if it feels intimidating initially. :-)

BR,
-- Jachym

Mazen Harake

unread,
Jun 3, 2011, 5:21:53 AM6/3/11
to Frédéric Trottier-Hébert, erlang-q...@erlang.org
True. This is a very valid point.

Personally I have very rarely used the live upgrade tools of a node
(relup/appup/release_handler etc) so I don't really know the bad side
of not putting everything under a supervision tree. But then again I
simply don't think the fuzz of specifying every single thing to
reload/change is worth the "uptime" mark.

The strategy I prefer is to have an architecture which enables me to;
take down a node gracefully (detaching itself from the cluster),
manually install a release (I.e. untar the release and changing
start_erl.data to point to it), and start up the node again. This
should not affect the system which should still be operational (say
you have 10 nodes and you do this upgrade one by one). Should the new
release not work or something unexpected turns up then just change the
start_erl.data file to point to the old release and bounce the node
(your version handling on your applications should support this
meaning v1.32.424 in this release has *exactly* the same code as
v1.32.424 in the previous release).

This way of working has been proven very successful to me (and the
systems I took part in building). Specifying relups and appups for
this kind of work is, in my opinion, tedious but some seem to think it
is worth the effort. However you do have a very important point to
consider when not hanging everything under a supervisor tree. If I had
only 2 nodes to consider maybe I'd want them up at all time but then
again they would be built in a way to handle if one goes down (E.g.
when I upgrade them).


2011/6/2 Frédéric Trottier-Hébert <fred....@erlang-solutions.com>:

Reply all
Reply to author
Forward
0 new messages