I've a question about OTP's supervisor-worker architecture.
I understand that OTP allows us to write fault-tolerant apps.
This is made possible by supervisor-worker architecture. A
supervisor manages several workers. If a worker (or a group
of workers) fails then supervisor is able to restart it. The worker
is restarted and it resumes with the same state that it had
before crash.
Now, let's consider following situation.
A worker process has two possible implementations, P and Q.
Worker P runs under normal conditions. Worker Q is supposed
to run in case P fails.
If worker P crashes then, supervisor is notified about the crash.
Typically, the supervisor would restart worker P.
But, I would like the supervisor to behave in a different manner.
In case the worker P fails, the supervisor should start the worker Q.
The worker Q should begin its execution with the same state that
P had at the point of crash.
Is it possible to write an OTP application that does this? If yes,
then do I need to customize the supervisor code.
Best Regards,
Tushar Deshpande
The OTP supervisor module does not support your desired behaviour, but you
can use the supervisor and the monitor facility of Erlang to implement it.
A rough design:
The supervisor should have dynamic children and in the function where the
call to supervisor:start_child/2 is done the pid of the started child should
be passed on to a gen_server process on the side that will call
erlang:monitor/2.
When the child dies the monitoring process will get a 'DOWN' message which
ought to contain enough information to start a new process - you just have
to include the state data of the process in the Reason for termination.
So you have to write some code to make it work, but you can reuse the
supervisor and implement the monitoring process using gen_server. If you are
really keen on this you can even implement your own OTP module with its own
behaviour and all, but I would recommend that you get the thing to work
first in order to avoid too many balls in the air in the beginning.
Cheers,
Torben
I appreciate your help a lot.
I think that I could build a prototype system without
using OTP. As you suggested, I can use pure erlang
processes with one process as a monitor and other
two processes as workers.
Best Regards,
Tushar Deshpande
________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org
Use OTP for everything!
Add the missing special functionality with plain Erlang.
If common enough make your own behaviour.
I have tried to code without OTP and ended up coding gen_server and gen_fsm
myself.
Cheers,
Torben
Torben Hoffmann <torben...@gmail.com> wrote:
th> The OTP supervisor module does not support your desired behaviour,
th> but you can use the supervisor and the monitor facility of Erlang to
th> implement it.
Torben's description is certainly one way to do it. It looks simpler
than what I was going to describe ... and it'll probably work, depending
on the nature of the application.
You would probably put Torben's gen_server-based monitoring proc
elsewhere in the supervisor hierarchy, e.g.
top_sup
|
+---------+-----------+
| |
static_sup dynamic_sup
| | | |
th_monitor_proc proc1 proc2 proc3
... where 'th_monitor_proc' is the monitoring proc that Torben
described.
The first problem is that 'th_monitor_proc' will start before
'dynamic_sup' starts, because supervisors start their children from
left-to-right, and a static supervisor won't finish its init until all
of its children have initialized. One (usually unstated) reason
supervisors are so useful is that they start everything in a
deterministic, well-ordered manner ... that's very important when
controlling hardware (which can be very fussy about the order of
operations) or software with strict must-run-before dependencies.
So, it would be better to flip the tree like this:
top_sup
|
+---------+-----------+
| |
dynamic_sup static_sup
|
th_monitor_proc
Then we don't have who-started-first problems at startup. Then the
'th_monitor_proc' can start the dynamic children:
top_sup
|
+---------+-----------+
| |
dynamic_sup static_sup
| | | |
proc1 proc2 proc3 th_monitor_proc
The 'top_sup' supervisor should use a one_for_all restart strategy.
After all, the 'dynamic_sup' might fail ... unlikely ... but one of my
favorite testing gimmicks is to run the 'appmon' application, get the
tree view of supervisor & worker processes, then start clicking 'Kill'
on random processes in the tree. :-) Try it. Watching the restarts is
entertaining *and* instructive. And it helps demonstrate how the
various restart strategies do.
You probably do not want to do this:
top_sup
|
+---------+-----------+
| |
another_sup static_sup
| |
+--------+-----+ th_monitor_proc
| |
dynamic_sup other_whatever
| | |
proc1 proc2 proc3
... especially if 'another_sup' has a different restart strategy. If
it's possible for 'dynamic_sup' to be killed while 'th_monitor_proc' is
alive, then the races where 'th_monitor_proc' tries to restart a child
but fails becaues 'dynamic_sup' isn't alive ... that situation is best
avoided.
-Scott
P.S. An astute reader might have a question about the tree below. The
question is, "What if the 'top_sup' wants to shut down. Won't
'dynamic_sup' be killed first, and won't I still have the same problem
of 'th_monitor_proc' is alive but 'dynamic_sup' is dead?"
top_sup
|
+---------+-----------+
| |
dynamic_sup static_sup
|
th_monitor_proc
The answer is "No". The Design Principles guide says:
5.10 Stopping
Since the supervisor is part of a supervision tree, it will
automatically be terminated by its supervisor. When asked to shutdown,
it will terminate all child processes in reversed start order
according to the respective shutdown specifications, and then
terminate itself.
P.P.S. Another solution is to write a module R that calls module P's
callbacks the first time that it's run and calls module Q's callbacks
when restarted. The "problem" is then shifted to making R be able to
figure out if a child is running the first time or it's been restarted.
You could write a file in the file system, or keep something in Mnesia
or a system-wide ETS table, or other stateful thing.
Managing such state is usually costs more than its value, but it can be
useful in some situations, especially if you already have to manage
state like that for other parts of the app.
-s