os.wait() losing child?

Jason Zheng

unread,

Jul 10, 2007, 7:52:58 PM7/10/07

to

This may be a silly question but is possible for os.wait() to lose track
of child processes? I'm running Python 2.4.4 on Linux kernel 2.6.20
(i686), gcc4.1.1, and glibc-2.5.

Here's what happened in my situation. I first created a few child
processes with Popen, then in a while(True) loop wait on any of the
child process to exit, then restart a child process:

import os
from subprocess import Popen

pids = {}

for i in xrange(3):
p = Popen('sleep 1', shell=True, cwd='/home/user',
stdout=file(os.devnull,'w'))
pids[p.pid] = i

while (True):
pid = os.wait()
i = pids[pid]
del pids[pid]
print "Child Process %d terminated, restarting" % i
if (someCondition):
break
p = Popen('sleep 1', shell=True, cwd='/home/user',
stdout=file(os.devnull,'w'))
pids[p.pid] = i

As I started to run this program, soon I discovered that some of the
processes stopped showing up, and eventually os.wait() will give an
error saying that there's no more child process to wait on. Can anyone
tell me what I did wrong?

greg

unread,

Jul 10, 2007, 8:59:31 PM7/10/07

to

Jason Zheng wrote:
> while (True):
> pid = os.wait()

> ...
> if (someCondition):
> break
> ...

Are you sure that someCondition() always becomes true
when the list of pids is empty? If not, you may end
up making more wait() calls than there are children.

It might be safer to do

while pids:
...

--
Greg

Jason Zheng

unread,

Jul 10, 2007, 9:05:52 PM7/10/07

to

Hate to reply to my own thread, but this is the working program that can
demonstrate what I posted earlier:

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]

for i in xrange(3):
p = Popen('sleep 1', shell=True, cwd='/home',

stdout=file(os.devnull,'w'))
pids[p.pid] = i

print "Starting child process %d (%d)" % (i,p.pid)

while (True):
(pid,exitstat) = os.wait()

i = pids[pid]
del pids[pid]

counts[i]=counts[i]+1

#terminate if count>10
if (counts[i]==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i

p = Popen('sleep 1', shell=True, cwd='/home',

Jason Zheng

unread,

Jul 10, 2007, 9:09:41 PM7/10/07

to

greg wrote:
> Jason Zheng wrote:
>> while (True):
>> pid = os.wait()
>> ...
>> if (someCondition):
>> break
> > ...
>
> Are you sure that someCondition() always becomes true
> when the list of pids is empty? If not, you may end
> up making more wait() calls than there are children.
>

Regardless of the nature of the someCondition, what I see from the print
output of my python program is that some child processes never triggers
the unblocking of os.wait() call.

~Jason

greg

unread,

Jul 11, 2007, 4:09:31 AM7/11/07

to

Jason Zheng wrote:
> Hate to reply to my own thread, but this is the working program that can
> demonstrate what I posted earlier:

I've figured out what's going on. The Popen class has a
__del__ method which does a non-blocking wait of its own.
So you need to keep the Popen instance for each subprocess
alive until your wait call has cleaned it up.

The following version seems to work okay.

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]

p = [None, None, None]

for i in xrange(3):
p[i] = Popen('sleep 1', shell=True)
pids[p[i].pid] = i
print "Starting child process %d (%d)" % (i,p[i].pid)

while (True):
(pid,exitstat) = os.wait()
i = pids[pid]
del pids[pid]
counts[i]=counts[i]+1

#terminate if count>10
if (counts[i]==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d (%d) terminated, restarting" % (i, pid),
p[i] = Popen('sleep 1', shell=True)
pids[p[i].pid] = i
print "(%d)" % p[i].pid

--
Greg

Jason Zheng

unread,

Jul 11, 2007, 1:43:44 PM7/11/07

to greg

Greg,

That explains it! Thanks a lot for your help. I guess this is something
they do to prevent zombie threads?

~Jason

Matthew Woodcraft

unread,

Jul 11, 2007, 2:25:51 PM7/11/07

to

greg <gr...@cosc.canterbury.ac.nz> wrote:
> I've figured out what's going on. The Popen class has a
> __del__ method which does a non-blocking wait of its own.
> So you need to keep the Popen instance for each subprocess
> alive until your wait call has cleaned it up.

I don't think this will be enough for the poster, who has Python 2.4:
in that version, opening a new Popen object would trigger the wait on
all 'outstanding' Popen-managed subprocesses.

It seems to me that subprocess.py assumes that it will do all wait()ing
on its children itself; I'm not sure if it's safe to rely on the
details of how this is currently arranged.

Perhaps a better way would be for subprocess.py to provide its own
variant of os.wait() for people who want 'wait-for-any-child' (though
it would be hard to support programs which also had children not
managed by subprocess.py).

-M-

Message has been deleted

Jason Zheng

unread,

Jul 11, 2007, 2:24:33 PM7/11/07

to

greg wrote:
> Jason Zheng wrote:
>> Hate to reply to my own thread, but this is the working program that
>> can demonstrate what I posted earlier:
>
> I've figured out what's going on. The Popen class has a
> __del__ method which does a non-blocking wait of its own.
> So you need to keep the Popen instance for each subprocess
> alive until your wait call has cleaned it up.
>
> The following version seems to work okay.
>

It still doesn't work on my machine. I took a closer look at the Popen
class, and I think the problem is that the __init__ method always calls
a method _cleanup, which polls every existing Popen instance. The poll
method does a nonblocking wait.

If one of my child process finishes as I create a new Popen instance,
then the _cleanup method effectively de-zombifies the child process, so
I can no longer expect to see the return of that pid on os.wait() any more.

~Jason

Jason Zheng

unread,

Jul 11, 2007, 2:48:40 PM7/11/07

to

Thanks, that's exactly what I need, my program really needs the
os.wait() to be reliable. Perhaps I could pass a flag to Popen to tell
it to never os.wait() on the new pid (but it's ok to os.wait() on other
Popen instances upon _cleanup()).

Nick Craig-Wood

unread,

Jul 11, 2007, 4:30:03 PM7/11/07

to

The problem you are having is you are letting Popen do half the job
and doing the other half yourself.

Here is a way which works, done completely with Popen. Polling the
subprocesses is slightly less efficient than using os.wait() but does
work. In practice you want to do this anyway to see if your children
exceed their time limits etc.

import os
import time
from subprocess import Popen

processes = []
counts = [0,0,0]

for i in xrange(3):

p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))

processes.append(p)
print "Starting child process %d (%d)" % (i, p.pid)

while (True):
for i,p in enumerate(processes):
exitstat = p.poll()
pid = p.pid
if exitstat is not None:
break
else:
time.sleep(0.1)
continue

counts[i]=counts[i]+1

#terminate if count>10
if (counts[i]==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i
processes[i] = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Jason Zheng

unread,

Jul 11, 2007, 8:41:53 PM7/11/07

to

Nick Craig-Wood wrote:
> The problem you are having is you are letting Popen do half the job
> and doing the other half yourself.

Except that I never wanted Popen to do any thread management for me to
begin with. Popen class has advertised itself as a replacement for
os.popen, popen2, popen4, and etc., and IMHO it should leave the
clean-up to the users, or at least leave it as an option.

> Here is a way which works, done completely with Popen. Polling the
> subprocesses is slightly less efficient than using os.wait() but does
> work. In practice you want to do this anyway to see if your children
> exceed their time limits etc.

I think your polling way works; it seems there no other way around this
problem other than polling or extending Popen class.

thanks,

Jason

Nick Craig-Wood

unread,

Jul 12, 2007, 6:30:04 AM7/12/07

to

I think polling is probably the right way of doing it...

Internally subprocess uses os.waitpid(pid) just waiting for its own
specific pids. IMHO this is the right way of doing it other than
os.wait() which waits for any pids. os.wait() can reap children that
you weren't expecting (say some library uses os.system())...

Hrvoje Niksic

unread,

Jul 12, 2007, 8:32:18 AM7/12/07

to

Jason Zheng <Xin....@jpl.nasa.gov> writes:

> greg wrote:
>> Jason Zheng wrote:
>>> Hate to reply to my own thread, but this is the working program
>>> that can demonstrate what I posted earlier:
>> I've figured out what's going on. The Popen class has a
>> __del__ method which does a non-blocking wait of its own.
>> So you need to keep the Popen instance for each subprocess
>> alive until your wait call has cleaned it up.
>> The following version seems to work okay.
>>
> It still doesn't work on my machine. I took a closer look at the Popen
> class, and I think the problem is that the __init__ method always
> calls a method _cleanup, which polls every existing Popen
> instance.

Actually, it's not that bad. _cleanup only polls the instances that
are no longer referenced by user code, but still running. If you hang
on to Popen instances, they won't be added to _active, and __init__
won't reap them (_active is only populated from Popen.__del__).

This version is a trivial modification of your code to that effect.
Does it work for you?

#!/usr/bin/python

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]

for i in xrange(3):

p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))

pids[p.pid] = p, i
print "Starting child process %d (%d)" % (i,p.pid)

while (True):
pid, ignored = os.wait()
try:
p, i = pids[pid]
except KeyError:
# not one of ours
continue
del pids[pid]
counts[i] += 1

#terminate if count>10
if (counts[i]==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i

p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))

pids[p.pid] = p, i

Hrvoje Niksic

unread,

Jul 12, 2007, 8:29:32 AM7/12/07

to

Nick Craig-Wood <ni...@craig-wood.com> writes:

>> I think your polling way works; it seems there no other way around this
>> problem other than polling or extending Popen class.
>
> I think polling is probably the right way of doing it...

It requires the program to wake up every 0.1s to poll for freshly
exited subprocesses. That doesn't consume excess CPU cycles, but it
does prevent the kernel from swapping it out when there is nothing to
do. Sleeping in os.wait allows the operating system to know exactly
what the process is waiting for, and to move it out of the way until
those conditions are met. (Pedants would also notice that polling
introduces on average 0.1/2 seconds delay between the subprocess dying
and the parent reaping it.)

In general, a program that waits for something should do so in a
single call to the OS. OP's usage of os.wait was exactly correct.

Fortunately the problem can be worked around by hanging on to Popen
instances until they are reaped. If all of them are kept referenced
when os.wait is called, they will never end up in the _active list
because the list is only populated in Popen.__del__.

> Internally subprocess uses os.waitpid(pid) just waiting for its own
> specific pids. IMHO this is the right way of doing it other than
> os.wait() which waits for any pids. os.wait() can reap children
> that you weren't expecting (say some library uses os.system())...

system calls waitpid immediately after the fork. This can still be a
problem for applications that call wait in a dedicated thread, but the
program can always ignore the processes it doesn't know anything
about.

Jason Zheng

unread,

Jul 12, 2007, 12:27:18 PM7/12/07

to

Hrvoje Niksic wrote:

>> greg wrote:
>
> Actually, it's not that bad. _cleanup only polls the instances that
> are no longer referenced by user code, but still running. If you hang
> on to Popen instances, they won't be added to _active, and __init__
> won't reap them (_active is only populated from Popen.__del__).
>

Perhaps that's the difference between Python 2.4 and 2.5. In 2.4,
Popen's __init__ always appends self to _active:

def __init__(...):
_cleanup()
...
self._execute_child(...)
...
_active.append(self)

> This version is a trivial modification of your code to that effect.
> Does it work for you?
>

Nope it still doesn't work. I'm running python 2.4.4, tho.

$ python test.py
Starting child process 0 (26497)
Starting child process 1 (26498)
Starting child process 2 (26499)
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated.
Traceback (most recent call last):
File "test.py", line 15, in ?
pid, ignored = os.wait()
OSError: [Errno 10] No child processes

Nick Craig-Wood

unread,

Jul 12, 2007, 2:30:04 PM7/12/07

to

Hrvoje Niksic <hni...@xemacs.org> wrote:
> Nick Craig-Wood <ni...@craig-wood.com> writes:
>
> >> I think your polling way works; it seems there no other way around this
> >> problem other than polling or extending Popen class.
> >
> > I think polling is probably the right way of doing it...
>
> It requires the program to wake up every 0.1s to poll for freshly
> exited subprocesses. That doesn't consume excess CPU cycles, but it
> does prevent the kernel from swapping it out when there is nothing to
> do. Sleeping in os.wait allows the operating system to know exactly
> what the process is waiting for, and to move it out of the way until
> those conditions are met. (Pedants would also notice that polling
> introduces on average 0.1/2 seconds delay between the subprocess dying
> and the parent reaping it.)

Sure!

You could get rid of this by sleeping until a SIGCHLD arrived maybe.

> In general, a program that waits for something should do so in a
> single call to the OS. OP's usage of os.wait was exactly correct.

Disagree for the reason below.

> > Internally subprocess uses os.waitpid(pid) just waiting for its own
> > specific pids. IMHO this is the right way of doing it other than
> > os.wait() which waits for any pids. os.wait() can reap children
> > that you weren't expecting (say some library uses os.system())...
>
> system calls waitpid immediately after the fork.

os.system probably wasn't the best example, but you take my point I
think!

> This can still be a problem for applications that call wait in a
> dedicated thread, but the program can always ignore the processes
> it doesn't know anything about.

Ignoring them isn't good enough because it means that the bit of code
which was waiting for that process to die with os.getpid() will never
get called, causing a deadlock in that bit of code.

What is really required is a select() like interface to wait which
takes more than one pid. I don't think there is such a thing though,
so polling is your next best option.

Matthew Woodcraft

unread,

Jul 12, 2007, 3:42:41 PM7/12/07

to

Jason Zheng <Xin....@jpl.nasa.gov> wrote:

>Hrvoje Niksic wrote:
>> Actually, it's not that bad. _cleanup only polls the instances that
>> are no longer referenced by user code, but still running. If you hang
>> on to Popen instances, they won't be added to _active, and __init__
>> won't reap them (_active is only populated from Popen.__del__).

> Perhaps that's the difference between Python 2.4 and 2.5. In 2.4,
> Popen's __init__ always appends self to _active:

Yes, that changed between 2.4 and 2.5.

Note that if you take a copy of 2.5's subprocess.py, it ought to work
fine with 2.4.

-M-

Jason Zheng

unread,

Jul 12, 2007, 4:03:32 PM7/12/07

to

Nick Craig-Wood wrote:
> Sure!
>
> You could get rid of this by sleeping until a SIGCHLD arrived maybe.

Yah, I could also just dump Popen class and use fork(). But then what's
the point of having an abstraction layer any more?

>> This can still be a problem for applications that call wait in a
>> dedicated thread, but the program can always ignore the processes
>> it doesn't know anything about.
>
> Ignoring them isn't good enough because it means that the bit of code
> which was waiting for that process to die with os.getpid() will never
> get called, causing a deadlock in that bit of code.
>

Are you talking about something like os.waitpid(os.getpid())? If the
process has completed and de-zombified by another os.wait() call, I
thought it would just throw an exception; it won't cause a deadlock by
hanging the process.

~Jason

Hrvoje Niksic

unread,

Jul 12, 2007, 4:39:41 PM7/12/07

to

Nick Craig-Wood <ni...@craig-wood.com> writes:

>> This can still be a problem for applications that call wait in a
>> dedicated thread, but the program can always ignore the processes
>> it doesn't know anything about.
>
> Ignoring them isn't good enough because it means that the bit of
> code which was waiting for that process to die with os.getpid() will
> never get called, causing a deadlock in that bit of code.

It won't deadlock, it will get an ECHILD or equivalent error because
it's waiting for a PID that doesn't correspond to a running child
process. I agree that this can be a problem if and when you use
libraries that can call system. (In that case sleeping for SIGCHLD is
probably a good solution.)

> What is really required is a select() like interface to wait which
> takes more than one pid. I don't think there is such a thing
> though, so polling is your next best option.

Except for the problems outlined in my previous message. And the fact
that polling becomes very expensive (O(n) per check) once the number
of processes becomes large. Unless one knows that a library can and
does call system, wait is the preferred solution.

Hrvoje Niksic

unread,

Jul 13, 2007, 2:52:21 AM7/13/07

to

Jason Zheng <Xin....@jpl.nasa.gov> writes:

> Hrvoje Niksic wrote:
>>> greg wrote:
>> Actually, it's not that bad. _cleanup only polls the instances that
>> are no longer referenced by user code, but still running. If you hang
>> on to Popen instances, they won't be added to _active, and __init__
>> won't reap them (_active is only populated from Popen.__del__).
>>
>
> Perhaps that's the difference between Python 2.4 and 2.5.

[...]

> Nope it still doesn't work. I'm running python 2.4.4, tho.

That explains it, then, and also why greg's code didn't work. You
still have the option to try to run 2.5's subprocess.py under 2.4.

Jason Zheng

unread,

Jul 13, 2007, 10:40:32 AM7/13/07

to

Is it more convenient to just inherit the Popen class? I'm concerned
about portability of my code. It will be run on multiple machines with
mixed Python 2.4 and 2.5 environments.

Hrvoje Niksic

unread,

Jul 13, 2007, 12:57:14 PM7/13/07

to

Jason Zheng <Xin....@jpl.nasa.gov> writes:

>>> Nope it still doesn't work. I'm running python 2.4.4, tho.
>> That explains it, then, and also why greg's code didn't work. You
>> still have the option to try to run 2.5's subprocess.py under 2.4.
> Is it more convenient to just inherit the Popen class?

You'd still need to change its behavior to not call _cleanup. For
example, by removing "your" instances from subprocess._active before
chaining up to Popen.__init__.

> I'm concerned about portability of my code. It will be run on
> multiple machines with mixed Python 2.4 and 2.5 environments.

I don't think there is a really clean way to handle this.

Jason Zheng

unread,

Jul 13, 2007, 7:52:09 PM7/13/07

to

Hrvoje Niksic wrote:
> Jason Zheng <Xin....@jpl.nasa.gov> writes:

>> I'm concerned about portability of my code. It will be run on
>> multiple machines with mixed Python 2.4 and 2.5 environments.
>
> I don't think there is a really clean way to handle this.

I think the following might just work, albeit not "clean":

#!/usr/bin/python

import os,subprocess
from subprocess import Popen

pids = {}
counts = [0,0,0]

def launch(i):

p = Popen('sleep 1', shell=True, cwd='/home',
stdout=file(os.devnull,'w'))
pids[p.pid] = p, i

if p in subprocess._active:
subprocess._active.remove(p)

print "Starting child process %d (%d)" % (i,p.pid)

for i in xrange(3):
launch(i)

while (True):
pid, ignored = os.wait()
try:
p, i = pids[pid]
except KeyError:
# not one of ours
continue
del pids[pid]
counts[i] += 1

#terminate if count>10
if (counts[i]==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i

launch(i)