[erlang-questions] About behavior of OTP's supervisor-worker architecture

Tushar Deshpande

unread,

Sep 13, 2010, 4:47:01 PM9/13/10

to Erlang-Questions Questions

Hi,

I've a question about OTP's supervisor-worker architecture.

I understand that OTP allows us to write fault-tolerant apps.
This is made possible by supervisor-worker architecture. A
supervisor manages several workers. If a worker (or a group
of workers) fails then supervisor is able to restart it. The worker
is restarted and it resumes with the same state that it had
before crash.

Now, let's consider following situation.

A worker process has two possible implementations, P and Q.
Worker P runs under normal conditions. Worker Q is supposed
to run in case P fails.

If worker P crashes then, supervisor is notified about the crash.
Typically, the supervisor would restart worker P.

But, I would like the supervisor to behave in a different manner.
In case the worker P fails, the supervisor should start the worker Q.
The worker Q should begin its execution with the same state that
P had at the point of crash.

Is it possible to write an OTP application that does this? If yes,
then do I need to customize the supervisor code.

Best Regards,

Tushar Deshpande

Torben Hoffmann

unread,

Sep 13, 2010, 5:24:18 PM9/13/10

to Tushar Deshpande, Erlang-Questions Questions

Hi Tushar,

The OTP supervisor module does not support your desired behaviour, but you
can use the supervisor and the monitor facility of Erlang to implement it.

A rough design:
The supervisor should have dynamic children and in the function where the
call to supervisor:start_child/2 is done the pid of the started child should
be passed on to a gen_server process on the side that will call
erlang:monitor/2.
When the child dies the monitoring process will get a 'DOWN' message which
ought to contain enough information to start a new process - you just have
to include the state data of the process in the Reason for termination.

So you have to write some code to make it work, but you can reuse the
supervisor and implement the monitoring process using gen_server. If you are
really keen on this you can even implement your own OTP module with its own
behaviour and all, but I would recommend that you get the thing to work
first in order to avoid too many balls in the air in the beginning.

Cheers,
Torben

--
http://www.linkedin.com/in/torbenhoffmann

Tushar Deshpande

unread,

Sep 14, 2010, 12:11:21 AM9/14/10

to Torben Hoffmann, Erlang-Questions Questions

Hi Torben,

I appreciate your help a lot.

I think that I could build a prototype system without
using OTP. As you suggested, I can use pure erlang
processes with one process as a monitor and other
two processes as workers.

Best Regards,

Tushar Deshpande

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Torben Hoffmann

unread,

Sep 14, 2010, 2:06:43 AM9/14/10

to Tushar Deshpande, Erlang-Questions Questions

Hi Tushar!

Use OTP for everything!
Add the missing special functionality with plain Erlang.
If common enough make your own behaviour.

I have tried to code without OTP and ended up coding gen_server and gen_fsm
myself.

Cheers,
Torben

Scott Lystig Fritchie

unread,

Sep 14, 2010, 12:45:43 PM9/14/10

to Torben Hoffmann, Tushar Deshpande, Erlang-Questions Questions

Hi Torben & Tushar & fellow hackers.

Torben Hoffmann <torben...@gmail.com> wrote:

th> The OTP supervisor module does not support your desired behaviour,
th> but you can use the supervisor and the monitor facility of Erlang to
th> implement it.

Torben's description is certainly one way to do it. It looks simpler
than what I was going to describe ... and it'll probably work, depending
on the nature of the application.

You would probably put Torben's gen_server-based monitoring proc
elsewhere in the supervisor hierarchy, e.g.

... where 'th_monitor_proc' is the monitoring proc that Torben
described.

The first problem is that 'th_monitor_proc' will start before
'dynamic_sup' starts, because supervisors start their children from
left-to-right, and a static supervisor won't finish its init until all
of its children have initialized. One (usually unstated) reason
supervisors are so useful is that they start everything in a
deterministic, well-ordered manner ... that's very important when
controlling hardware (which can be very fussy about the order of
operations) or software with strict must-run-before dependencies.

So, it would be better to flip the tree like this:

top_sup
|
+---------+-----------+
| |
dynamic_sup static_sup
|
th_monitor_proc

Then we don't have who-started-first problems at startup. Then the
'th_monitor_proc' can start the dynamic children:

The 'top_sup' supervisor should use a one_for_all restart strategy.
After all, the 'dynamic_sup' might fail ... unlikely ... but one of my
favorite testing gimmicks is to run the 'appmon' application, get the
tree view of supervisor & worker processes, then start clicking 'Kill'
on random processes in the tree. :-) Try it. Watching the restarts is
entertaining *and* instructive. And it helps demonstrate how the
various restart strategies do.

You probably do not want to do this:

... especially if 'another_sup' has a different restart strategy. If
it's possible for 'dynamic_sup' to be killed while 'th_monitor_proc' is
alive, then the races where 'th_monitor_proc' tries to restart a child
but fails becaues 'dynamic_sup' isn't alive ... that situation is best
avoided.

-Scott

P.S. An astute reader might have a question about the tree below. The
question is, "What if the 'top_sup' wants to shut down. Won't
'dynamic_sup' be killed first, and won't I still have the same problem
of 'th_monitor_proc' is alive but 'dynamic_sup' is dead?"

top_sup
|
+---------+-----------+
| |
dynamic_sup static_sup
|
th_monitor_proc

The answer is "No". The Design Principles guide says:

5.10 Stopping

Since the supervisor is part of a supervision tree, it will
automatically be terminated by its supervisor. When asked to shutdown,
it will terminate all child processes in reversed start order
according to the respective shutdown specifications, and then
terminate itself.

P.P.S. Another solution is to write a module R that calls module P's
callbacks the first time that it's run and calls module Q's callbacks
when restarted. The "problem" is then shifted to making R be able to
figure out if a child is running the first time or it's been restarted.
You could write a file in the file system, or keep something in Mnesia
or a system-wide ETS table, or other stateful thing.

Managing such state is usually costs more than its value, but it can be
useful in some situations, especially if you already have to manage
state like that for other parts of the app.

-s

Reply all

Reply to author

Forward