[erlang-questions] Why have a supervisor behaviour?

34 views
Skip to first unread message

Roger Lipscombe

unread,
May 21, 2015, 10:11:18 AM5/21/15
to erlang-q...@erlang.org
I find myself writing a custom supervisor (because I need restarts to
be delayed[1]), and I find myself wondering why OTP has a supervisor
behaviour?

That is: why does it require us to provide the Module:init/1 function?
Surely we could just pass the restart strategy, child specs, etc. to
supervisor:start_link directly?

Is there something I'm missing, that I'm going to regret if I don't do
it the same way in my custom supervisor?

Regards,
Roger.

[1] http://erlang.org/pipermail/erlang-patches/2012-January/002575.html
discusses a supervisor with delayed child restart (and has some code),
but (a) it seems to have died a death, and (b) it's not exactly what I
need.
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Jesper Louis Andersen

unread,
May 21, 2015, 10:24:21 AM5/21/15
to Roger Lipscombe, erlang-q...@erlang.org
On Thu, May 21, 2015 at 4:11 PM, Roger Lipscombe <ro...@differentpla.net> wrote:
I find myself writing a custom supervisor (because I need restarts to
be delayed[1]), and I find myself wondering why OTP has a supervisor
behaviour?

Another way around this is to let the supervisor just start the server and then let the server itself handle the delay by querying a delay_manager, or something such.
 
That is: why does it require us to provide the Module:init/1 function?
Surely we could just pass the restart strategy, child specs, etc. to
supervisor:start_link directly?

Because init/1 runs in the context of the supervisor process, not the invoker of start_link/1. If you create an ETS table in the supervisor, for instance, its protection is relative to who created it. And so is its lifetime. You can't easily do that if you pass in the data in start_link/1, since you would have to pass a fun anyway.



--
J.

Roger Lipscombe

unread,
May 21, 2015, 10:58:17 AM5/21/15
to Jesper Louis Andersen, erlang-q...@erlang.org
On 21 May 2015 at 15:23, Jesper Louis Andersen
<jesper.lou...@gmail.com> wrote:
> Another way around this is to let the supervisor just start the server and
> then let the server itself handle the delay by querying a delay_manager, or
> something such.

I considered that, but I have a small wrinkle: I need find_or_start
semantics, such that if the child is already started, it's returned;
if the child doesn't exist, it's started and returned (so far this is
just one_for_one, ish); but -- if it's in the sin bin -- it needs to
be restarted immediately. The caller _needs_ a valid pid. Having a
delay_manager might complicate that?

>> That is: why does it require us to provide the Module:init/1 function?
>> Surely we could just pass the restart strategy, child specs, etc. to
>> supervisor:start_link directly?
>
> Because init/1 runs in the context of the supervisor process, not the
> invoker of start_link/1. If you create an ETS table in the supervisor, for
> instance, its protection is relative to who created it. And so is its
> lifetime. You can't easily do that if you pass in the data in start_link/1,
> since you would have to pass a fun anyway.

But the supervisor's not supposed to _do_ anything, right? It only has
Mod:init. If you want an ETS table, you should have it owned by the
supervisor's first child, right?

Cheers,
Roger.

Roger Lipscombe

unread,
May 21, 2015, 11:00:16 AM5/21/15
to erlang-q...@erlang.org
On 21 May 2015 at 15:56, Vance Shipley <van...@motivity.ca> wrote:
> Return {ok, State, Timeout) from your gen_server:init/1 callback. Continue
> initialization after the timeout in your handle_info/2 handler.

I need delayed _restart_. Is this what Jesper refers to when he talks
about "a delay_manager"? Such that init queries that and then
might/might not delay?

Roger Lipscombe

unread,
May 21, 2015, 11:49:16 AM5/21/15
to Jesper Louis Andersen, erlang-q...@erlang.org
On 21 May 2015 at 15:58, Roger Lipscombe <ro...@differentpla.net> wrote:
> On 21 May 2015 at 15:23, Jesper Louis Andersen
> <jesper.lou...@gmail.com> wrote:
>> Because init/1 runs in the context of the supervisor process, not the
>> invoker of start_link/1. If you create an ETS table in the supervisor, for
>> instance, its protection is relative to who created it. And so is its
>> lifetime.
>
> But the supervisor's not supposed to _do_ anything, right? It only has
> Mod:init. If you want an ETS table, you should have it owned by the
> supervisor's first child, right?

OK. On thinking about this more, it makes sense: if you've got an ETS
table that every child needs to access, having it owned by the
supervisor might make sense in some scenarios. Gotcha.

Jesper Louis Andersen

unread,
May 21, 2015, 11:59:06 AM5/21/15
to Roger Lipscombe, erlang-q...@erlang.org

On Thu, May 21, 2015 at 5:00 PM, Roger Lipscombe <ro...@differentpla.net> wrote:
I need delayed _restart_. Is this what Jesper refers to when he talks
about "a delay_manager"? Such that init queries that and then
might/might not delay?

%% The delay manager decouples delay policy from a worker by tracking delays in one place.
%% As such, it has global knowledge and can opt to delay registered processes more or less
%% depending on current load.
-module(delay_mgr).
-behaviour(gen_server).

[..]

%% Call this from a newly started worker, but not in it's init/1 callback since that blocks the supervisor
%% Send the process itself a message in init/1 and do it in that state.
delay(Reg) ->
    gen_server:call(?MODULE, {delay, Reg}, infinity).

[..]

handle_call({delay, Reg}, From, #state { conf = Conf }) ->
    Delay = maps:get(Reg, Conf),
    erlang:send_after(Delay, self(), {go, From}),
    {noreply, State};

handle_info({go, Reg}, State) ->
    gen_server:reply(From, ok),
    {noreply, State};

[..]

This is static skeleton, but:

* Add monitoring of delayed processes. Increase the delay for processes that respawn too often
* Decay delays for processes which operates as they should
* Add metrics and stats

--
J.

Karolis Petrauskas

unread,
May 21, 2015, 1:08:53 PM5/21/15
to Jesper Louis Andersen, erlang-q...@erlang.org
Another reason for supervisor to be a behaviour is the code upgrades.
The init/1 function is called on process startup and on code upgrades.

Karolis

Fred Hebert

unread,
May 21, 2015, 4:33:15 PM5/21/15
to Roger Lipscombe, erlang-q...@erlang.org
On 05/21, Roger Lipscombe wrote:
>I need delayed _restart_. Is this what Jesper refers to when he talks
>about "a delay_manager"? Such that init queries that and then
>might/might not delay?

That's a classic question, and one I started answering differently.
Requiring a timeout in your supervisor rebooting function means that you
are letting things crash or restart for the wrong reason.

The thing is, it's all about the guarantees[1]. In a nutshell, a
supervisor should exit on any error, and ideally bring you back to a
known, stable state.

So of course all expected or unexpected errors should be able to bring
you back to that state properly, specifically transient errors.

But the distinction is that because supervisors boot synchronously for
all apps, they also represent a chain of dependencies of what should be
available to all processes started *after* them.

That's why failure modes such as 'one for all' or 'rest for one' exist.
They allow you to specify that the processes there are related to each
other in ways that their death violates some guarantee of invariant in
the system and that the only good way to restart is by refreshing all of
them.

In a nutshell, if you expect disconnections or event that require a
backoff to happen frequently enough they are to be expected by the
processes depending on yours, then that connection or that event is not
a thing that should take place in your process' init function. Otherwise
you're indirectly stating that without this thing working, the system
should just not boot.

See the example in [2] for an idea of how to respect this. This does not
change the code in any major way, but moves function calls around to
properly respect these semantics.

My position is that this isn't a problem with supervisors' interface,
but in how they are being use and what they mean for your system. I know
this is not the most helpful response, but oh well.


[1]: http://ferd.ca/it-s-about-the-guarantees.html
[2]: http://www.erlang-in-anger.com, section 2.2.3

Christopher Phillips

unread,
May 22, 2015, 8:08:57 AM5/22/15
to Roger Lipscombe, erlang-q...@erlang.org

Message: 17
Date: Thu, 21 May 2015 16:32:40 -0400
From: Fred Hebert <mono...@ferd.ca>
To: Roger Lipscombe <ro...@differentpla.net>
Cc: "erlang-q...@erlang.org" <erlang-q...@erlang.org>
Subject: Re: [erlang-questions] Why have a supervisor behaviour?
Message-ID: <20150521203...@ferdair.local>
Content-Type: text/plain; charset=us-ascii; format=flowed
I wanted to add, every time I've seen this pattern brought up (a supervisor with a delayed restart), it's been due to something on the network has become unavailable, or overloaded to where it can't respond in a reasonable amount of time, that sort of thing, and the developers involved were seeing restart limits being hit and the system coming down, even though there was every reason to expect the resource to become available again, in time.

If that's the case for you, those types of issues would likely be better addressed using a circuit breaker or similar[1] than a custom supervisor, for the reasons Fred mentions.

In general, having this thing be unavailable is either due to bad internal state (in which case a supervisor can help you), or something external to your program (in which case the supervisor can't), and in the latter case you should be thinking about how the system should behave when it's unavailable (since it would be unavailable even with a delayed supervisor in any case). Making it a clean response that means "this resource is not available" allows you to address it in a well defined way, rather than having a child process that may or may not be there being called from elsewhere in your system. 

[1]: https://github.com/jlouis/fuse

Roger Lipscombe

unread,
May 22, 2015, 9:51:52 AM5/22/15
to Fred Hebert, erlang-q...@erlang.org
On 21 May 2015 at 21:32, Fred Hebert <mono...@ferd.ca> wrote:
> [...a bunch of useful stuff, but most importantly this next bit...]
> My position is that this isn't a problem with supervisors' interface, but in
> how they are being use and what they mean for your system. I know this is
> not the most helpful response, but oh well.

It turns out that I probably don't need a supervisor at all, then.

Currently I have:

- A supervisor, which supervises:
- A collection of host processes, each of which owns:
- An instance of a Squirrel VM [1] implemented in a C++ NIF.
- When the Squirrel VM wishes to communicate with its host process, it
sends it a message.
- For some of those messages (divide by zero exception, syntax error,
etc.), my host process responds via
handle_info({exception, Whatever}, State) -> {stop, {shutdown,
Whatever}, State}.
- This causes the supervisor to restart the host, which fires up a
fresh instance of the Squirrel VM.

Because the Squirrel VM is running arbitrary code, it can get itself
into a state where that code constantly crashes, so the host kills
itself, and the supervisor restarts it constantly. My existing custom
supervisor doesn't handle restart intensity, for precisely this
reason. If this happens really quickly, it can lead to bad effects
downstream (log spamming, etc.). Hence the business requirement to
delay the restart to give the rest of the system a breather.

It seems, however, that I *don't* really want a supervisor to handle
restarting the Squirrel VM; it looks like the host should do it, and I
might be able to remove my custom supervisor in favour of a standard
'simple_one_for_one' supervisor to handle crashes in the host process.
Not sure about that last -- I don't want one process hitting max
restart intensity to bring down the other host processes.

[1] http://www.squirrel-lang.org/

Fred Hebert

unread,
May 22, 2015, 10:19:34 AM5/22/15
to Roger Lipscombe, erlang-q...@erlang.org
On 05/22, Roger Lipscombe wrote:
>It turns out that I probably don't need a supervisor at all, then.
>
> [project description]
>
>It seems, however, that I *don't* really want a supervisor to handle
>restarting the Squirrel VM; it looks like the host should do it, and I
>might be able to remove my custom supervisor in favour of a standard
>'simple_one_for_one' supervisor to handle crashes in the host process.
>Not sure about that last -- I don't want one process hitting max
>restart intensity to bring down the other host processes.
>

Ah that's interesting. To reason about this, one question to ask is:
what is it that your system guarantees to its subsequent processes. So
if you have some form of front-end or client handling the order of
spawning and restarting a VM (who do you do it on behalf of?), there's
likely a restricted set of operations you provide, right?

Something like:

- Run task
- Interrupt task
- Get task status or state report
- Has the task completed?

Or possibly, if you're going event-based, the following events are to be
expected:

- Task accepted
- VM booted
- VM failed
- Task aborted
- Task completion

Those are probably things you expect to provide and should work fine,
because those are the kinds of failures you do expect all the time from
the Squirrel VM itself. Furthermore, it's possible you'd eventually add
in a backpressure mechanism ("only 10 VMs can run at a time for a user")
or something like that. This means what you might want is the host
process to always be able to provide that information, and isolate your
user from the VM process' fickle behaviour.

So what does this tell us? What you guarantee when the supervision tree
is booted is therefore:

- I can contact the system to know if I can host a VM and run it
- Once I am given a process, there's a manager (the host process) I can
talk to or expect to get information from.

There is no guarantee about the Squirrel VM being up and running and
available; there's a good likelihood it's gonna be there, but in
reality, it can go terribly bad and we just can't pretend it's not gonna
take place.

This means that these two types of processes are those you want to be
ready and available as soon as 'init/1' has been executed. That a VM is
available or not is not core functionality; what's core is that you can
ask to get one, and know if it didn't work.

To really help figure this out, simply ask "Can my system still run if X
is not there?" If it can run without it, then your main recovery
mechanism should probably not be the supervisor through failed `init/1`
calls; it's a thing that likely becomes your responsibility as a
developer because it's a common event. It might need to move to
`handle_info/2`; If the system can't run without it, encode it in the
`init/1` function. It's a guarantee you have to make.

You'll find out that for some database connections, it's true. For some
it's not and the DB *needs* to be there for the system to make sense.
The supervisors then let you encode these requirements in your program
structure, and their boot and shutdown sequences. Same for anything you
may depend on.

Does this make sense?

Then to pick the exact supervision strategy and error handling
mechanism, you can ask yourself what do you do when the host process
dies. Can a new one take its place seemlessly? If not, then it's
possible the error needs to bubble up (through a monitor or some
message) to the caller so *they* decide whether to give up or try again.
If you can make it transparently or it's a best effort mechanism, then
yeah, just restarting the worker is enough.

"Let it crash" is a fun fun way to get going and to grow a system, but
when it has reached some level of growth, we can't avoid starting to
really reason about how we want things to fail; It lets us slowly
discover the properties we want to expose to our users, and after a few
solid crashes, it's entirely fine to reorganize a few bits of code to
reflect the real world and its constraints.

What's great is that we've goot all the building blocks and tools to
reason about it and implement the solution properly.

Regards,
Fred.

Roger Lipscombe

unread,
May 22, 2015, 11:02:57 AM5/22/15
to Fred Hebert, erlang-q...@erlang.org
On 22 May 2015 at 15:19, Fred Hebert <mono...@ferd.ca> wrote:
> On 05/22, Roger Lipscombe wrote:
>>
>> It turns out that I probably don't need a supervisor at all, then.
>
> Ah that's interesting. To reason about this, one question to ask is: what is
> it that your system guarantees to its subsequent processes. So if you have
> some form of front-end or client handling the order of spawning and
> restarting a VM (who do you do it on behalf of?), there's likely a
> restricted set of operations you provide, right?

Yes:
- Find (or start) a VM by ID.
- Stop a VM, either by ID or by sending it a message.
- Send a message to a VM. Some of these are gen_server:call, because
either we need back-pressure, or we need a response; some are
gen_server:cast (or Pid ! foo), because we don't.

> To really help figure this out, simply ask "Can my system still run if X is
> not there?"

In this case, yes. If we get into a situation where we're consistently
failing to start the squirrel VM, or if they're _all_ consistently
failing, we'll spot that through metrics or another form of alerting.

> Does this make sense?

Yes. Absolutely. I need the squirrel_vm_manager (to give it a name) to
be up, so that I can find_or_start and stop the VMs. Whether a
particular VM is up or not is not a problem for *my* application as a
whole.

Thanks,
Roger.
Reply all
Reply to author
Forward
0 new messages