Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Slow fork bomb message in latest version of POE

4 views
Skip to first unread message

albertocurro

unread,
Mar 24, 2014, 5:46:26 AM3/24/14
to p...@perl.org
Guys,

We have a product developed using POE as a base framework, with some other tool libraries as log4perl; basically is a forward proxy, composed of several modules, each one of them comprising a POE::Session; all of them share an internal queue of tasks to be performed. Each module performs several tasks on initialization, and if anything goes wrong, croak() is called to stop the service -> this is considered ok, since croak() is only called during initialization, when validation is being performed.

The product is stable and works really fine, but recently I updated POE to the latest version, and since then we can see this message in the logs:

registering pdu failed: 263!
=== 5267 === 5 -> on_handle (from Handler/StoreRemote.pm at 87)
=== 5267 === 5 -> on_retry (from Handler/StoreRemote.pm at 141)
=== 5267 === 9 -> on_handle (from Handler/StoreRemote.pm at 87)
=== 5267 === 9 -> on_retry (from Handler/StoreRemote.pm at 141)
=== 5267 === !!! Kernel has child processes.
=== 5267 === !!! Stopped child process (PID 5373) reaped when POE::Kernel->run() is ready to return.
=== 5267 === !!! Stopped child process (PID 5374) reaped when POE::Kernel->run() is ready to return.
=== 5267 === !!! At least one child process is still running when POE::Kernel->run() is ready to return.
=== 5267 === !!! Be sure to use sig_child() to reap child processes.
=== 5267 === !!! In extreme cases, failure to reap child processes has
=== 5267 === !!! resulted in a slow 'fork bomb' that has halted systems.
mkdir /mnt/nfs99: Permission denied at Handler/Store.pm line 147

first lines and last line above are the errors itself, but this part is new since the upgrading:

=== 5267 === !!! Kernel has child processes.
=== 5267 === !!! Stopped child process (PID 5373) reaped when POE::Kernel->run() is ready to return.
=== 5267 === !!! Stopped child process (PID 5374) reaped when POE::Kernel->run() is ready to return.
=== 5267 === !!! At least one child process is still running when POE::Kernel->run() is ready to return.
=== 5267 === !!! Be sure to use sig_child() to reap child processes.
=== 5267 === !!! In extreme cases, failure to reap child processes has
=== 5267 === !!! resulted in a slow 'fork bomb' that has halted systems.

I can see it everytime the service is stopped because of an unhandled condition, even when POE's event loop has been already running for ours. It was not visible before, and I can't get rid of it in any way. I've tried different ways to avoid it with no luck.

Any advice or alternative approach on this?

Many thanks
Alberto

Rocco Caputo

unread,
Mar 24, 2014, 8:45:45 AM3/24/14
to albertocurro, p...@perl.org
Hi, Alberto.

At program end time, POE runs a quick waitpid() check for child processes that may have leaked. This check was added after a bug report where POE locked up a server after several days of running. It turned out to be the reporter's application, but it was hard to debug.

Your program seems to have created two processes that it didn't reap: PIDs 5373 and 5374. The ideal solution is to reap those processes before exiting. Your program can do this using POE::Kernel's sig_child() method.

In some cases, a third-party library will create processes and not properly clean them up. It can be impossible to solve this case without modifying other people's code.

If you just want to ignore the problem, this might do the trick. Put these lines in your last _stop handler. They should reap the processes you've leaked before POE's check:

use POSIX ":sys_wait_h";
1 while waitpid(WNOHANG, -1) > 0;

It's a bit of a pain, but I think it's better to explicitly ignore the problem than for it to go unnoticed by default.

Please let me know whether that resolves your problem. It may not. For example, the processes may still be open until an object is destroyed at global destruction time.

--
Rocco Caputo <rca...@pobox.com>

albertocurro

unread,
Mar 24, 2014, 11:44:36 AM3/24/14
to Rocco Caputo, p...@perl.org
Hi Rocco,

many thanks for your quick answer! Unfortunately, the provided solution only works partially. I still have some cases where the "fork bomb" message is here with us :(

One of the cases is this one: under some configuration, an instance of nginx is started, so our product writes the configuration file and starts the Nginx instance pointing to that configuration file. BUT, if the configuration file could not be written (directory does not exist, etc), then the error raises, and I've not found any way to handle it:

DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1
DEBUG - Created nginx configuration directory /opt/etc/pull/instance1
DEBUG - Created nginx log directory /opt/log/pull/instance1
DEBUG - creating nginx configfile for instance 1 in /opt/etc/pull/instance1
=== 13991 === !!! Kernel has 1 child process(es).
=== 13991 === !!! At least one child process is still running when POE::Kernel->run() is ready to return.
=== 13991 === !!! Be sure to use sig_child() to reap child processes.
=== 13991 === !!! In extreme cases, failure to reap child processes has
=== 13991 === !!! resulted in a slow 'fork bomb' that has halted systems.
Could not open file: No such file or directory

I've added a DIE handler in the main session to try to handle this:

$sig_session = POE::Session->create(
inline_states => {
_start => sub {
$_[HEAP]{RELOADED} = 0;
$_[KERNEL]->sig(TERM => '_sigterm');
$_[KERNEL]->sig(INT => '_sigterm');
$_[KERNEL]->sig(DIE => '_sigterm');
$_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload');
$_[KERNEL]->alias_set('sighandler');
},
_sigdie => sub {
print "Handling exception, calling stop";
POE::Kernel->call($sig_session, '_stop');
},
_stop => sub {
# Reap any existing pid (# 1825119)
print "Handling stop";
POE::Kernel->sig_child();
use POSIX ":sys_wait_h";
1 while waitpid(WNOHANG, -1) > 0;

# Clear signal handlers...
$_[KERNEL]->sig('TERM');

But, as said above, it's not working. Checking POE's code, I can see the message lines are generated in Resources/Signals.pm, under _data_sig_finalize() method (where POE is already doing the same you recommended me, waiting for the pid).

But _data_sig_finalize() method is called in Kernel.pm just after unregistered all the signals (Kernel.pm => _finalize_kernel):

my $self = shift;

# Disable signal watching since there's now no place for them to go.
foreach ($self->_data_sig_get_safe_signals()) {
$self->loop_ignore_signal($_);
}

# Remove the kernel session's signal watcher.
$self->_data_sig_remove($self->ID, "IDLE");

# The main loop is done, no matter which event library ran it.
# sig before loop so that it clears the signal_pipe file handler
$self->_data_sig_finalize();
$self->loop_finalize();

Once here, none of my signal handlers in the main session instance would work, as the signals have been unregistered. On an exception (die) while POE::Kernel->run(), how could I handle it then??

Thanks a lot
Alberto




---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo escribió ----

>Hi, Alberto.
>
>At program end time, POE runs a quick waitpid() check for child processes that may have leaked. This check was added after a bug report where POE locked up a server after several days of running. It turned out to be the reporter's application, but it was hard to debug.
>
>Your program seems to have created two processes that it didn't reap: PIDs 5373 and 5374. The ideal solution is to reap those processes before exiting. Your program can do this using POE::Kernel's sig_child() method.
>
>In some cases, a third-party library will create processes and not properly clean them up. It can be impossible to solve this case without modifying other people's code.
>
>If you just want to ignore the problem, this might do the trick. Put these lines in your last _stop handler. They should reap the processes you've leaked before POE's check:
>
>use POSIX ":sys_wait_h";
>1 while waitpid(WNOHANG, -1) > 0;
>
>It's a bit of a pain, but I think it's better to explicitly ignore the problem than for it to go unnoticed by default.
>
>Please let me know whether that resolves your problem. It may not. For example, the processes may still be open until an object is destroyed at global destruction time.
>
>--
>Rocco Caputo
>

albertocurro

unread,
Mar 24, 2014, 11:48:11 AM3/24/14
to albertocurro, Rocco Caputo, p...@perl.org
Hi again,

sorry! from the code below, there's a mistake as DIE signal is linked to _sigterm, while is really pointing to _sigdie; just to clarify it before someone says "it can't work, you are pointing to the wrong method!" :D

Alberto


---- Activado lun, 24 mar 2014 16:44:36 +0100 albertocurro<albert...@zoho.com> escribió ----

Rocco Caputo

unread,
Mar 24, 2014, 11:59:49 AM3/24/14
to albertocurro, p...@perl.org
You are not using sig_child() as intended. When used as intended, sig_child() will prevent shutdown until the child process has exited and has been reaped. The timing issues you're worried about should not exist.

--
Rocco Caputo <rca...@pobox.com>

albertocurro

unread,
Mar 24, 2014, 12:15:22 PM3/24/14
to p...@perl.org

Hi,

Sorry, but I don't catch what you exactly mean with "not using sig_child() as intended". Do you mean calling it from the main session so each child process will be closed properly?

The issue I have is how to handle unexpected exceptions. Seems they are thrown and raised without control, killing POE's kernel before in the way. I could be thinking in the timing in the wrong way, though...

Alberto

---- Activado lun, 24 mar 2014 16:59:49 +0100 Rocco Caputo<rca...@pobox.com> escribió ----

Rocco Caputo

unread,
Mar 24, 2014, 1:30:39 PM3/24/14
to p...@perl.org, albertocurro
Hi again.

What I mean is that I don't think you know what sig_child() does exactly, or how to use it. I base this impression on two things: First, you're calling sig_child() from a place where it will never work and at a time that is obviously too late to do anything. Second, it needs at least two parameters to work, but you're passing it nothing.

I recommend not using SIGDIE for common exception handling. Its scope is too broad, and your code will get ugly. It's probably cleaner to use eval{} or Try::Tiny to convert your unexpected exceptions into expected ones. If you catch them explicitly, then POE won't need to raise them, and there should be less strange behavior.

The problem seems to be migrating. I recommend caution against further clouding the original issue until it's resolved.

If you resolve your exceptions issue, and if you resolve your sig_child() usage issue, then your program should not be interrupted at inopportune times, and it should reap the nginx process before it exits. This should resolve all outstanding issues, as I currently understand them.

--
Rocco Caputo <rca...@pobox.com>
0 new messages