Maybe we need to have a staged startup:
1. Check if we need to run the installer
2. Check if we need to upgrade
3. Start the system
Question is we can do that by just loading different sets of children into the site supervisor?
Otherwise we need a more complex mechanism where modules wait for a certain kind of startup signal.
Another option is to redo the z_supervisor into a z_supervisor_staged.
Where we can give a 'stage' to each task, and only if all tasks from another stage have reported back we continue loading the next stage of processes.
A crash should restart the whole process (I think).
Question is we can do that by just loading different sets of children into the site supervisor?
Otherwise we need a more complex mechanism where modules wait for a certain kind of startup signal.Dynamic children is a possibility. I was thinking of separte supervisors for each stage. For the installer you need to have the db running, etc, etc. That also helps with a problem I sometimes have that during a restart shutdown the db is already gone and some processes can't save their stuff.
Another option is to redo the z_supervisor into a z_supervisor_staged.Hmm, do we really need that?
Where we can give a 'stage' to each task, and only if all tasks from another stage have reported back we continue loading the next stage of processes.
A crash should restart the whole process (I think).That depends on where the crash is I guess. If you crash during the install stage there is not much point in trying to restart forever. If the system has been started already, the supervisor can just restart that part without running the installer and update checks. Or not?
On Monday, May 7, 2012 10:49:58 AM UTC+2, Marc Worrell wrote:Question is we can do that by just loading different sets of children into the site supervisor?
Otherwise we need a more complex mechanism where modules wait for a certain kind of startup signal.Dynamic children is a possibility. I was thinking of separte supervisors for each stage. For the installer you need to have the db running, etc, etc. That also helps with a problem I sometimes have that during a restart shutdown the db is already gone and some processes can't save their stuff.So we need a process of site startup/tear down that is more like the *nix startup/shutdown?I don't think the normal OTP supervisors support this.But then, it is not a new problem, someone must have written something for this :
Sometimes problems are really transient. Think of network hiccups, a hard disk that needs to spin up and gives timeouts, a database that is still booting/recovering/warming up etc.
Right.According to the otp manual children are terminated in reverse starting order before the supervisor will terminate itself. If in stage 1 the db process in the install_sup, is started and in stage 3 the system_sup processes. The site processes are terminated first before the db process is terminated. That should leave them with a functional db.Hmmz, why doesn't it work like that right now?
O yeah, forgot that. Currently zotonic sometimes has a hard time recovering from a postgres restart caused by the dreaded OOM killer. Postgres can be busy with the transaction log for quite some time after that.
I guess because they are terminated, but not _requested_ to terminate.So I think they are killed in the correct order, but they don't have a chance to do anything to clean up their act...