fix to htmlgend zombies

Graham Eddy

unread,

Jul 13, 2009, 2:02:57 AM7/13/09

to WView Google Group

wview 5.5.0.
i only noticed today that i too was getting lots of "[htmlgend] defunct"
zombies, since upgrading from wview 4.0.0.
interestingly, they were slowly getting picked off, but eventually
filled the proctab as more were being created than destroyed

i use both pre-generate.sh and post-generate.sh (for creating timestamps
that get ftp'ed with the generated files, from which a cron job hourly
checks the timestamp, in an attempt to have some sort of end-to-end
automated test in place)

in htmlgenerator/html.c, i changed from existing:
while (waitpid(-1, NULL, WNOHANG) == 0)
{
radUtilsSleep(5);
}
to
while (waitpid(-1, NULL, WNOHANG) > 0)
;
and i'm not getting zombies any more

my reasoning:
the original code would flush only one zombie per entry into
defaultHandler because that zombie's PID would be returned by waitpid
(i.e. not 0).
and if there weren't any zombies, waitpid would return 0 because of the
shadow child process radlib creates - so it should gently spin around
radUtilsSleep(5) until a child process terminated then flush exactly one
zombie and exit the loop

if this theory is correct, then anyone using just one of pre-generate.sh
| post-generate.sh | emailNotification should see between zero and one
htmlgend zombies at any given time but no more than that, but using two
or more of those would lead to runaway zombies (hmm, michael jackson's
thriller comes to mind..)

in htmlgenerator/html.c, i also changed:
signal (signum, defaultSigHandler);
to
radProcessSignalCatch (signum, defaultSigHandler);
as it is ambiguous to mix calls to sigaction (called from
radProcessSignalCatch) with calls to signal
i continue to be impressed with how solid & resilient wview is
------------------------------------------------------------------------
*Graham Eddy*

Michael Nausch

unread,

Jul 13, 2009, 6:02:03 AM7/13/09

to wv...@googlegroups.com

Griasde Eddy!

Quoting Graham Eddy <graha...@gmail.com>:

> if this theory is correct, then anyone using just one of pre-generate.sh
> | post-generate.sh | emailNotification should see between zero and one
> htmlgend zombies at any given time but no more than that, but using two
> or more of those would lead to runaway zombies

# top

top - 11:53:20 up 12 days, 4:40, 1 user, load average: 1.48, 1.96, 1.94
Tasks: 300 total, 1 running, 291 sleeping, 0 stopped, 8 zombie
Cpu0 : 71.6%us, 17.6%sy, 0.0%ni, 10.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 14.7%us, 2.0%sy, 0.0%ni, 82.4%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Mem: 3886940k total, 3795004k used, 91936k free, 473964k buffers
Swap: 8193140k total, 192k used, 8192948k free, 2123520k cached

Oh yes, I can sing a song 'bout those zombies. 8 zombies are not very
much, on some days there ahre 'bout hundrets. ;(

Restarting wview helps a few days ...

I'm using:
/usr/local/etc/wview/post-generate.sh
and
emailNotification

I'll disable eMAilnotification and report what happens, if your theory
is right

Pfiade,
Django
--
"Bonnie & Clyde der Postmaster-Szene!" approved by Postfix-God

http://wetterstation-pliening.info
http://dokuwiki.nausch.org

Graham Eddy

unread,

Jul 13, 2009, 6:07:05 AM7/13/09

to wv...@googlegroups.com

after i posted i realised i'd forgotten the ftp child...

rather than disabling some facilities, just apply the waitpid()>0 code
change
------------------------------------------------------------------------
*Graham Eddy*

Mark S. Teel

unread,

Jul 13, 2009, 8:32:23 AM7/13/09

to wv...@googlegroups.com

Nice work - thanks - see comments below.

Please try my proposed solution to confirm it works for you.

Mark

Graham Eddy wrote:
> wview 5.5.0.
> i only noticed today that i too was getting lots of "[htmlgend] defunct"
> zombies, since upgrading from wview 4.0.0.
> interestingly, they were slowly getting picked off, but eventually
> filled the proctab as more were being created than destroyed
>
> i use both pre-generate.sh and post-generate.sh (for creating timestamps
> that get ftp'ed with the generated files, from which a cron job hourly
> checks the timestamp, in an attempt to have some sort of end-to-end
> automated test in place)
>
> in htmlgenerator/html.c, i changed from existing:
> while (waitpid(-1, NULL, WNOHANG) == 0)
> {
> radUtilsSleep(5);
> }
> to
> while (waitpid(-1, NULL, WNOHANG) > 0)
> ;
> and i'm not getting zombies any more
>

My approach was trying to mitigate this scenario (the race condition you
originally proposed): that the parent process was calling waitpid BEFORE
the child process was ready to be cleaned up.

Your new approach assumes exactly that: at least one child will be ready
for cleanup the first time waitpid is called (thus reintroducing your
original race condition as a possibility). But it has the advantage of
cleaning up multiple children (if they are ready in time).

> my reasoning:
> the original code would flush only one zombie per entry into
> defaultHandler because that zombie's PID would be returned by waitpid
> (i.e. not 0).
>

You are correct: my current approach will only clean up one child then
pop out of the while loop.

> and if there weren't any zombies, waitpid would return 0 because of the
> shadow child process radlib creates - so it should gently spin around
> radUtilsSleep(5) until a child process terminated then flush exactly one
> zombie and exit the loop
>

No and yes. The radlib shadow process exists for the htmlgend process
but is not created for the pre and post processes. Further, the reason 0
was returned until a child pid was present is because with the option
WNOHANG, if there are no child processes finished waitpid returns 0. The
rest of this remark is correct.

> if this theory is correct, then anyone using just one of pre-generate.sh
> | post-generate.sh | emailNotification should see between zero and one
> htmlgend zombies at any given time but no more than that, but using two
> or more of those would lead to runaway zombies (hmm, michael jackson's
> thriller comes to mind..)
>

If someone is just using one of them, there should be no zombies (only
one child process to clean up). Unfortunately, the default configuration
has both scripts (empty) which on some platforms manifests itself in
this way.

Thus, the solution is to do both:

// Wait until the first child process is ready:
sleepHappened = FALSE;

while (waitpid(-1, NULL, WNOHANG) == 0)
{
radUtilsSleep(5);

sleepHappened = TRUE;
}

// Make sure at least one sleep occurs
if (! sleepHappened)
{
radUtilsSleep(5);
}

// Wait for any remaining processes:

while (waitpid(-1, NULL, WNOHANG) > 0);

What I don't like about this is it guarantees one sleep period, regardless of whether it is needed or not. If I don't enforce it though, we are back to the initial race condition if the first child is immediately ready...

> in htmlgenerator/html.c, i also changed:
> signal (signum, defaultSigHandler);
> to
> radProcessSignalCatch (signum, defaultSigHandler);
> as it is ambiguous to mix calls to sigaction (called from
> radProcessSignalCatch) with calls to signal
>

OK, but the net result is the same. The wview signal handlers started
out prior to the radlib signal utilities, thus another of those legacy
code deals (this one is harmless).

Graham Eddy

unread,

Jul 13, 2009, 8:47:33 AM7/13/09

to wv...@googlegroups.com

my understanding is that using sigaction (rather than signal) avoids the
race condition - if a child terminates while inside the signal handler,
sigaction ensures SIGCHLD is set in the sigmask anyway so on exiting the
signal handler, it jumps right back in again.
but because sigmask is only a mask, not a queue, we have to loop on
pending zombies as we don't know how many there are

.. which, translated into english, means zombies are flushed over and
over in batches by defaultHandler until no new arrivals left, with no
race condition

using signal (rather than sigaction), all sorts of convoluted pushups
need to be done because - depending upon version and implementation (and
flags set, etc) - signal might (or might not, the original semantic) set
the sigmask while inside the signal handler, leading to the potential
race condition

*ouch* my brain hurts
------------------------------------------------------------------------
*Graham Eddy*

Graham Eddy

unread,

Jul 14, 2009, 8:43:28 AM7/14/09

to wv...@googlegroups.com

re my proposed fix, after 24 hours, not a zombie in sight.
i would recommend this for next release

we can try more complex approaches if more corner cases appear
------------------------------------------------------------------------
*Graham Eddy*

Reply all

Reply to author

Forward