Python or OS forking/threading problem?

ns...@yahoo.com

unread,

Mar 18, 2000, 3:00:00 AM3/18/00

to

Running the Python program below results in either:
i) Python segfaults (that is the parent process, see below)
ii) One or more processes hogging all the CPU
I have tried this on different setups:
a) Uni-processor Linux RH6.0 running kernel 2.2.9
b) Dual-processor Linux RH6.0 running kernel 2.2.5-15smp
Both running Python 1.5.2
When running on setup a) it runs for some time then it fails.
Setup b) fails almost immediately.
I know there are some issues with forking in a thread (I assume
pthreads are used in the implementation) but as far as I can
see it should only affect the child process (deadlocks, etc if
child process inherits locks from parent). The parent process
should not be affected by this.
I would appreciate any suggestions what is wrong here and
what I can to fix this. I have a fairly large multi-threaded
application that calls some backend scripts. My Zope server
crashes once-in-a-while because of this problem.
Since I really don't need to continue running in Python in
the child process (only want to do exec) I am going to implement
my own fork/exec function in C. What I'm hoping to achieve with
that is to avoid the Python VM locks. I suspect they have something
to do with this because when my parent process segfaults it
happens in PyThread_aquireLock (I don't remember the exact name
of the function).
Thanks
--- THE EVIL PROGRAM ---
import threading
import os
class MyThread(threading.Thread):
def run(self):
for i in range(100):
self.once()
def once(self):
pid = os.fork()
if pid == 0:
print " hello mom"
os._exit(0)
pid2, sts = os.waitpid(pid,0)
print "bye baby"
threads = []
for i in range(10):
threads.append(MyThread())
map(lambda x: x.start(), threads)
import time
time.sleep(1000)
print "DONE"

Sent via Deja.com http://www.deja.com/
Before you buy.

Andrew M. Kuchling

unread,

Mar 22, 2000, 3:00:00 AM3/22/00

to

ns...@yahoo.com writes:
> Running the Python program below results in either:
> i) Python segfaults (that is the parent process, see below)
> ii) One or more processes hogging all the CPU

Hmm... this is wacky. Here's your test program, after fixing
Deja.com's mangling of the code:

import threading
import os, sys

class MyThread(threading.Thread):
def run(self):
for i in range(50):
print 'calling once', i
self.once()
print 'Returning from run()'

def once(self):
print ' calling fork'
pid = os.fork()
print ' fork output', pid

if pid == 0:
print " hello mom"

sys.stdout.flush()
os._exit(0)
print 'Calling waitpid', pid

pid2, sts = os.waitpid(pid,0)
print " bye baby"

sys.stdout.flush()

threads = []
for i in range(10):
threads.append(MyThread())

print threads
map(lambda x: x.start(), threads)
print 'All started'

import time
print 'Final sleep'
time.sleep(5)
print "DONE"

I've added debugging printouts and changed some of the numbers; run()
only loops 10 times instead of 100, and the final delay is only 5
seconds, not 1000.

On Solaris 2.6 and Python 1.5.2, it hangs very quickly:

[<MyThread(Thread-1, initial)>, <MyThread(Thread-2, initial)>, ... ]
calling once 0
calling fork
fork output 0
hello mom
fork output 19633
Calling waitpid 19633
bye baby
calling once 1
calling fork
fork output 0
hello mom
fork output 19634
Calling waitpid 19634

Apparently hanging in waitpid... The current CVS tree on the same
machine, however, doesn't hang; it runs through the whole sequence.
Looking through the CVS logs, I can't find a relevant checkin, so I
don't know what caused the change. So, try the current CVS tree and
see if that improves matters. (But Zope won't work with the current
CVS tree; still, at least you can determine if the current CVS helps,
and then look for the precise bugfix.)

Can anyone suggest what's going on here?

--
A.M. Kuchling http://starship.python.net/crew/amk/
And how often do we meet the man who prefaces his remarks with: "I was reading
a book last night..." in the too loud, overenunciated fashion of one who might
be saying: "I keep a hippogryph in my basement." Reading confers status.
-- Robertson Davies, _A Voice from the Attic_

Neil Schemenauer

unread,

Mar 23, 2000, 3:00:00 AM3/23/00

to

Andrew M. Kuchling <akuc...@mems-exchange.org> wrote:
>Hmm... this is wacky.

I don't know much about threading but here is my small
contribution. The attached program locks up on my machine when I
increase the number of forking processes to more than one. It
happens for the CVS version of Python as well as 1.5.2.

Also, there doesn't seem to be major changes to threading.py,
posix.waitpid or threadmodule.c between 1.5.2 and the current CVS
source. Andrew, are you sure it doesn't lock up on Solaris?
Maybe you need to increase the number of threads.

Neil

import threading
import os, sys

running = threading.Semaphore(20) # about 5 is enough on my machine
forking = threading.Semaphore(1) # more than 1 seems to cause deadlocks

class MyThread(threading.Thread):
def start(self):
running.acquire()
threading.Thread.start(self)

def run(self):
print ' calling fork'
forking.acquire()

pid = os.fork()
print ' fork output', pid
if pid == 0:
print " hello mom"
sys.stdout.flush()
os._exit(0)

forking.release()

print 'Calling waitpid', pid
pid2, sts = os.waitpid(pid,0)
print " bye baby"
sys.stdout.flush()

running.release()

while 1:
t = MyThread().start()

Andrew M. Kuchling

unread,

Mar 23, 2000, 3:00:00 AM3/23/00

to

nasc...@enme.ucalgary.ca (Neil Schemenauer) writes:
> I don't know much about threading but here is my small
> contribution. The attached program locks up on my machine when I
> increase the number of forking processes to more than one. It
> happens for the CVS version of Python as well as 1.5.2.

It doesn't crash on Solaris; I increased the number of forking
processes to 10. I wonder if all those threads might be causing so
much scheduler overhead that it only looks like a deadlock on Linux,
but really it's just spending lots of time in the kernel trying to
pick the next process to run.

>Also, there doesn't seem to be major changes to threading.py,
>posix.waitpid or threadmodule.c between 1.5.2 and the current CVS
>source. Andrew, are you sure it doesn't lock up on Solaris?
> Maybe you need to increase the number of threads.

Pretty sure, I think; I ran it with 50 threads, and while it was
pretty slow (scheduling overhead, I imagine), the program *did*
complete. I thought it might be some change to signal handling that's
responsible for the fix, so that different treatment of SIGCHLD was
responsible, but there are no relevant changes since 1.5.

Unless ... I noticed something suspicious along the way; look at this
code from floatsleep() in Modules/timemodule.c:

Py_BEGIN_ALLOW_THREADS
if (select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &t) != 0) {
Py_BLOCK_THREADS
#ifdef EINTR
if (errno != EINTR) {
#else
if (1) {
#endif
PyErr_SetFromErrno(PyExc_IOError);
return -1;
}
}
Py_END_ALLOW_THREADS

Py_BLOCK_THREADS is for leaving a {BEGIN,END}_ALLOW_THREADS block (see
ceval.h), but this code doesn't always exit; if errno == EINTR, the
flow would be Py_BEGIN_ALLOW_THREADS; Py_BLOCK_THREADS;
Py_END_ALLOW_THREADS. I suspect the BLOCK_THREADS should be moved to
inside the if, so it's only executed when the function actually
returns unexpectedly. But I don't know if this might be the root of
the problem.

Any threading wizards such as Tim Peters or Greg Stein want to offer
some insight, whether into this possible bug or the original problem?

--
A.M. Kuchling http://starship.python.net/crew/amk/

Science itself, therefore, may be regarded as a minimal problem, consisting of
the completest possible presentment of facts with the least possible
expenditure of thought.
-- Ernst Mach

David Fisher

unread,

Mar 23, 2000, 3:00:00 AM3/23/00

to pytho...@python.org

Are you using a CVS copy of Python? Because my source for floatsleep()
doesn't look like that. I'm using the 1.5.2 source and the code looks like
this:

Py_BEGIN_ALLOW_THREADS
if (select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &t) != 0) {
Py_BLOCK_THREADS

PyErr_SetFromErrno(PyExc_IOError);
return -1;
}
Py_END_ALLOW_THREADS

I agree that the usage you quoted is fishy, but the above looks fine. And
forking in multiple threads locks up on my RH6.1 box also, so I don't think
the problem is in sleep.

I ran the code that you cleaned up with buffering turned off, and it seemed
to me to be locking up on the os._exit() call. I remember Tim Peters
posting a while back about an obscure race condition with a lot of processes
being created and killed. I'm no guru, but I'd place my bet that this is
the very problem.

My solution to the problem goes like this:

Patient: "Doctor! My arm hurts when I go like this."

Doctor: "So don't do that."

David

Tim Peters

unread,

Mar 24, 2000, 3:00:00 AM3/24/00

to Neil Schemenauer, pytho...@python.org

Grab the very latest CVS patches: Andrew checked in a change that I presume
addresses this mystery.

Neil Schemenauer

unread,

Mar 25, 2000, 3:00:00 AM3/25/00

to

Andrew M. Kuchling <akuc...@mems-exchange.org> wrote:

>It doesn't crash on Solaris; I increased the number of forking
>processes to 10. I wonder if all those threads might be causing so
>much scheduler overhead that it only looks like a deadlock on Linux,
>but really it's just spending lots of time in the kernel trying to
>pick the next process to run.

The CPU usage is 100% but only a small percentage is system time.
The scheduling should show up as system time shouldn't it?. I
don't think sleep() or wait() is the problem either, Python seems
to hang in the same place quite consistently. The only task
running (using 100% cpu) is doing this:

pthread_cond_wait () from /lib/libpthread.so.0
PyThread_acquire_lock at thread_pthread.h:318
PyEval_AcquireThread at ceval.c:150
t_bootstrap at ./threadmodule.c:223
pthread_start_thread () from /lib/libpthread.so.0

None of the threads are in a call to sleep(), wait() or fork().
This comment from the pthread_atfork man page seems like it
_might_ be relevent:

To understand the purpose of pthread_atfork, recall that
fork(2) duplicates the whole memory space, including
mutexes in their current locking state, but only the call
ing thread: other threads are not running in the child
process. Thus, if a mutex is locked by a thread other than
the thread calling fork, that mutex will remain locked
forever in the child process, possibly blocking the execu
tion of the child process.

I don't see how this could cause a problem for Python though.
AFAIK, Python only using one lock. The thread that has that lock
has to be the one that is calling fork().

If someone has any theories about what is happening here I would
be eager to hear them.

Neil

--
Real programmers don't make mistrakes

Neil Schemenauer

unread,

Mar 25, 2000, 3:00:00 AM3/25/00

to Tim Peters

On Fri, Mar 24, 2000 at 11:58:28PM -0500, Tim Peters wrote:
> Grab the very latest CVS patches: Andrew checked in a change
> that I presume addresses this mystery.

Not for my problem on Linux. My code didn't call sleep(). It
may be a bug with the pthreads in libc6 for Linux. I can't
reproduce it with C code though.

Neil

Florian Weimer

unread,

Mar 25, 2000, 3:00:00 AM3/25/00

to

"Neil Schemenauer" <nasc...@enme.ucalgary.ca> writes:

> Not for my problem on Linux. My code didn't call sleep(). It
> may be a bug with the pthreads in libc6 for Linux. I can't
> reproduce it with C code though.

I think I can explain what happens. Look at the following code:

#include <stdio.h>
#include <string.h>

#include <unistd.h>
#include <pthread.h>

void *
thread(void * v)
{
for (;;) {
char buf[40];
int len = sprintf(buf, "%lu\n", (unsigned long) getpid());
write(1, buf, len);
sleep(1);
}
}

main()
{
pthread_t t;
pthread_create (&t, NULL, &thread, NULL);
fork();
sleep(3600);
}

This clearly shows that the thread is not duplicated on fork().

Now look at the Python source code:

static PyObject *
posix_fork(self, args)
PyObject *self;
PyObject *args;
{
int pid;
if (!PyArg_ParseTuple(args, ":fork"))
return NULL;
pid = fork();
if (pid == -1)
return posix_error();
PyOS_AfterFork();
return PyInt_FromLong((long)pid);
}

void
PyOS_AfterFork()
{
#ifdef WITH_THREAD
main_thread = PyThread_get_thread_ident();
main_pid = getpid();
#endif
}

long PyThread_get_thread_ident _P0()
{
volatile pthread_t threadid;
if (!initialized)
PyThread_init_thread();
/* Jump through some hoops for Alpha OSF/1 */
threadid = pthread_self();
return (long) *(long *) &threadid;
}

In the child process, all threads have disappeared, but the code doesn't
seem to be prepared to handle this.

Andrew M. Kuchling

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

"Tim Peters" <tim...@email.msn.com> writes:
> Grab the very latest CVS patches: Andrew checked in a change that I presume
> addresses this mystery.

No, that's just the suspicious bit of code that I noticed in
timemodule.c; GvR confirmed that it looks wrong, so I checked in the
patch.

--
A.M. Kuchling http://starship.python.net/crew/amk/

But we cannot disguise our abhorrence of modern communication devices.
-- Queen Victoria on teleconferencing, in SEBASTIAN O #1

Snorri Gylfason

unread,

Apr 1, 2000, 3:00:00 AM4/1/00

to

Thanks everybody for looking into this problem. I was
working with Naris on this (Naris posted the original
question).

I have a few comments to your discussion that might be helpful.

1) According to Posix when doing fork inside a thread only
that thread is cloned in the new process.

a) Thats exactly why in the code below none of
the threads are cloned (only the main thread).

b) This is known problem for the child process.
If the cloned thread needs to acquire a lock that
at the time of the fork some other thread had, it
will deadlock (since in the child process there
is no thread to release that lock).
pthread_atfork is there to help the child process
to handle this problem.

2) The strange thing in our case is that it is the parent
thread that starts behaving badly. And as I will soon
point out, it depends on what the child process is doing.

3) We ran into this problem because we were using popen in
a thread. Popen is basically only doing fork and exec plus
some filehandling stuff. The popen module is implemented
in python. I replaced that module with one in C and the
problem went away (so I have a workaround:)
I did some more experiments and it seems that if the child
process leaves an atomic operation (enters Python again)
our problem may occur.
I don't know the implementation of Python but everything
I have seen indicates that the child process is still sharing
some lock with the parent process.

Hope this will help

Snorri