Hi Rocco,
many thanks for your quick answer! Unfortunately, the provided solution only works partially. I still have some cases where the "fork bomb" message is here with us :(
One of the cases is this one: under some configuration, an instance of nginx is started, so our product writes the configuration file and starts the Nginx instance pointing to that configuration file. BUT, if the configuration file could not be written (directory does not exist, etc), then the error raises, and I've not found any way to handle it:
DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1
DEBUG - Created nginx configuration directory /opt/etc/pull/instance1
DEBUG - Created nginx log directory /opt/log/pull/instance1
DEBUG - creating nginx configfile for instance 1 in /opt/etc/pull/instance1
=== 13991 === !!! Kernel has 1 child process(es).
=== 13991 === !!! At least one child process is still running when POE::Kernel->run() is ready to return.
=== 13991 === !!! Be sure to use sig_child() to reap child processes.
=== 13991 === !!! In extreme cases, failure to reap child processes has
=== 13991 === !!! resulted in a slow 'fork bomb' that has halted systems.
Could not open file: No such file or directory
I've added a DIE handler in the main session to try to handle this:
$sig_session = POE::Session->create(
inline_states => {
_start => sub {
$_[HEAP]{RELOADED} = 0;
$_[KERNEL]->sig(TERM => '_sigterm');
$_[KERNEL]->sig(INT => '_sigterm');
$_[KERNEL]->sig(DIE => '_sigterm');
$_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload');
$_[KERNEL]->alias_set('sighandler');
},
_sigdie => sub {
print "Handling exception, calling stop";
POE::Kernel->call($sig_session, '_stop');
},
_stop => sub {
# Reap any existing pid (# 1825119)
print "Handling stop";
POE::Kernel->sig_child();
use POSIX ":sys_wait_h";
1 while waitpid(WNOHANG, -1) > 0;
# Clear signal handlers...
$_[KERNEL]->sig('TERM');
But, as said above, it's not working. Checking POE's code, I can see the message lines are generated in Resources/Signals.pm, under _data_sig_finalize() method (where POE is already doing the same you recommended me, waiting for the pid).
But _data_sig_finalize() method is called in Kernel.pm just after unregistered all the signals (Kernel.pm => _finalize_kernel):
my $self = shift;
# Disable signal watching since there's now no place for them to go.
foreach ($self->_data_sig_get_safe_signals()) {
$self->loop_ignore_signal($_);
}
# Remove the kernel session's signal watcher.
$self->_data_sig_remove($self->ID, "IDLE");
# The main loop is done, no matter which event library ran it.
# sig before loop so that it clears the signal_pipe file handler
$self->_data_sig_finalize();
$self->loop_finalize();
Once here, none of my signal handlers in the main session instance would work, as the signals have been unregistered. On an exception (die) while POE::Kernel->run(), how could I handle it then??
Thanks a lot
Alberto
---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo escribió ----
>Hi, Alberto.
>
>At program end time, POE runs a quick waitpid() check for child processes that may have leaked. This check was added after a bug report where POE locked up a server after several days of running. It turned out to be the reporter's application, but it was hard to debug.
>
>Your program seems to have created two processes that it didn't reap: PIDs 5373 and 5374. The ideal solution is to reap those processes before exiting. Your program can do this using POE::Kernel's sig_child() method.
>
>In some cases, a third-party library will create processes and not properly clean them up. It can be impossible to solve this case without modifying other people's code.
>
>If you just want to ignore the problem, this might do the trick. Put these lines in your last _stop handler. They should reap the processes you've leaked before POE's check:
>
>use POSIX ":sys_wait_h";
>1 while waitpid(WNOHANG, -1) > 0;
>
>It's a bit of a pain, but I think it's better to explicitly ignore the problem than for it to go unnoticed by default.
>
>Please let me know whether that resolves your problem. It may not. For example, the processes may still be open until an object is destroyed at global destruction time.
>
>--
>Rocco Caputo
>