Sent via Deja.com http://www.deja.com/
Before you buy.
Hmm... this is wacky. Here's your test program, after fixing
Deja.com's mangling of the code:
import threading
import os, sys
class MyThread(threading.Thread):
def run(self):
for i in range(50):
print 'calling once', i
self.once()
print 'Returning from run()'
def once(self):
print ' calling fork'
pid = os.fork()
print ' fork output', pid
if pid == 0:
print " hello mom"
sys.stdout.flush()
os._exit(0)
print 'Calling waitpid', pid
pid2, sts = os.waitpid(pid,0)
print " bye baby"
sys.stdout.flush()
threads = []
for i in range(10):
threads.append(MyThread())
print threads
map(lambda x: x.start(), threads)
print 'All started'
import time
print 'Final sleep'
time.sleep(5)
print "DONE"
I've added debugging printouts and changed some of the numbers; run()
only loops 10 times instead of 100, and the final delay is only 5
seconds, not 1000.
On Solaris 2.6 and Python 1.5.2, it hangs very quickly:
[<MyThread(Thread-1, initial)>, <MyThread(Thread-2, initial)>, ... ]
calling once 0
calling fork
fork output 0
hello mom
fork output 19633
Calling waitpid 19633
bye baby
calling once 1
calling fork
fork output 0
hello mom
fork output 19634
Calling waitpid 19634
Apparently hanging in waitpid... The current CVS tree on the same
machine, however, doesn't hang; it runs through the whole sequence.
Looking through the CVS logs, I can't find a relevant checkin, so I
don't know what caused the change. So, try the current CVS tree and
see if that improves matters. (But Zope won't work with the current
CVS tree; still, at least you can determine if the current CVS helps,
and then look for the precise bugfix.)
Can anyone suggest what's going on here?
--
A.M. Kuchling http://starship.python.net/crew/amk/
And how often do we meet the man who prefaces his remarks with: "I was reading
a book last night..." in the too loud, overenunciated fashion of one who might
be saying: "I keep a hippogryph in my basement." Reading confers status.
-- Robertson Davies, _A Voice from the Attic_
I don't know much about threading but here is my small
contribution. The attached program locks up on my machine when I
increase the number of forking processes to more than one. It
happens for the CVS version of Python as well as 1.5.2.
Also, there doesn't seem to be major changes to threading.py,
posix.waitpid or threadmodule.c between 1.5.2 and the current CVS
source. Andrew, are you sure it doesn't lock up on Solaris?
Maybe you need to increase the number of threads.
Neil
import threading
import os, sys
running = threading.Semaphore(20) # about 5 is enough on my machine
forking = threading.Semaphore(1) # more than 1 seems to cause deadlocks
class MyThread(threading.Thread):
def start(self):
running.acquire()
threading.Thread.start(self)
def run(self):
print ' calling fork'
forking.acquire()
pid = os.fork()
print ' fork output', pid
if pid == 0:
print " hello mom"
sys.stdout.flush()
os._exit(0)
forking.release()
print 'Calling waitpid', pid
pid2, sts = os.waitpid(pid,0)
print " bye baby"
sys.stdout.flush()
running.release()
while 1:
t = MyThread().start()
It doesn't crash on Solaris; I increased the number of forking
processes to 10. I wonder if all those threads might be causing so
much scheduler overhead that it only looks like a deadlock on Linux,
but really it's just spending lots of time in the kernel trying to
pick the next process to run.
>Also, there doesn't seem to be major changes to threading.py,
>posix.waitpid or threadmodule.c between 1.5.2 and the current CVS
>source. Andrew, are you sure it doesn't lock up on Solaris?
> Maybe you need to increase the number of threads.
Pretty sure, I think; I ran it with 50 threads, and while it was
pretty slow (scheduling overhead, I imagine), the program *did*
complete. I thought it might be some change to signal handling that's
responsible for the fix, so that different treatment of SIGCHLD was
responsible, but there are no relevant changes since 1.5.
Unless ... I noticed something suspicious along the way; look at this
code from floatsleep() in Modules/timemodule.c:
Py_BEGIN_ALLOW_THREADS
if (select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &t) != 0) {
Py_BLOCK_THREADS
#ifdef EINTR
if (errno != EINTR) {
#else
if (1) {
#endif
PyErr_SetFromErrno(PyExc_IOError);
return -1;
}
}
Py_END_ALLOW_THREADS
Py_BLOCK_THREADS is for leaving a {BEGIN,END}_ALLOW_THREADS block (see
ceval.h), but this code doesn't always exit; if errno == EINTR, the
flow would be Py_BEGIN_ALLOW_THREADS; Py_BLOCK_THREADS;
Py_END_ALLOW_THREADS. I suspect the BLOCK_THREADS should be moved to
inside the if, so it's only executed when the function actually
returns unexpectedly. But I don't know if this might be the root of
the problem.
Any threading wizards such as Tim Peters or Greg Stein want to offer
some insight, whether into this possible bug or the original problem?
--
A.M. Kuchling http://starship.python.net/crew/amk/
Science itself, therefore, may be regarded as a minimal problem, consisting of
the completest possible presentment of facts with the least possible
expenditure of thought.
-- Ernst Mach
Are you using a CVS copy of Python? Because my source for floatsleep()
doesn't look like that. I'm using the 1.5.2 source and the code looks like
this:
Py_BEGIN_ALLOW_THREADS
if (select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &t) != 0) {
Py_BLOCK_THREADS
PyErr_SetFromErrno(PyExc_IOError);
return -1;
}
Py_END_ALLOW_THREADS
I agree that the usage you quoted is fishy, but the above looks fine. And
forking in multiple threads locks up on my RH6.1 box also, so I don't think
the problem is in sleep.
I ran the code that you cleaned up with buffering turned off, and it seemed
to me to be locking up on the os._exit() call. I remember Tim Peters
posting a while back about an obscure race condition with a lot of processes
being created and killed. I'm no guru, but I'd place my bet that this is
the very problem.
My solution to the problem goes like this:
Patient: "Doctor! My arm hurts when I go like this."
Doctor: "So don't do that."
David
The CPU usage is 100% but only a small percentage is system time.
The scheduling should show up as system time shouldn't it?. I
don't think sleep() or wait() is the problem either, Python seems
to hang in the same place quite consistently. The only task
running (using 100% cpu) is doing this:
pthread_cond_wait () from /lib/libpthread.so.0
PyThread_acquire_lock at thread_pthread.h:318
PyEval_AcquireThread at ceval.c:150
t_bootstrap at ./threadmodule.c:223
pthread_start_thread () from /lib/libpthread.so.0
None of the threads are in a call to sleep(), wait() or fork().
This comment from the pthread_atfork man page seems like it
_might_ be relevent:
To understand the purpose of pthread_atfork, recall that
fork(2) duplicates the whole memory space, including
mutexes in their current locking state, but only the call
ing thread: other threads are not running in the child
process. Thus, if a mutex is locked by a thread other than
the thread calling fork, that mutex will remain locked
forever in the child process, possibly blocking the execu
tion of the child process.
I don't see how this could cause a problem for Python though.
AFAIK, Python only using one lock. The thread that has that lock
has to be the one that is calling fork().
If someone has any theories about what is happening here I would
be eager to hear them.
Neil
--
Real programmers don't make mistrakes
Not for my problem on Linux. My code didn't call sleep(). It
may be a bug with the pthreads in libc6 for Linux. I can't
reproduce it with C code though.
Neil
> Not for my problem on Linux. My code didn't call sleep(). It
> may be a bug with the pthreads in libc6 for Linux. I can't
> reproduce it with C code though.
I think I can explain what happens. Look at the following code:
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
void *
thread(void * v)
{
for (;;) {
char buf[40];
int len = sprintf(buf, "%lu\n", (unsigned long) getpid());
write(1, buf, len);
sleep(1);
}
}
main()
{
pthread_t t;
pthread_create (&t, NULL, &thread, NULL);
fork();
sleep(3600);
}
This clearly shows that the thread is not duplicated on fork().
Now look at the Python source code:
static PyObject *
posix_fork(self, args)
PyObject *self;
PyObject *args;
{
int pid;
if (!PyArg_ParseTuple(args, ":fork"))
return NULL;
pid = fork();
if (pid == -1)
return posix_error();
PyOS_AfterFork();
return PyInt_FromLong((long)pid);
}
void
PyOS_AfterFork()
{
#ifdef WITH_THREAD
main_thread = PyThread_get_thread_ident();
main_pid = getpid();
#endif
}
long PyThread_get_thread_ident _P0()
{
volatile pthread_t threadid;
if (!initialized)
PyThread_init_thread();
/* Jump through some hoops for Alpha OSF/1 */
threadid = pthread_self();
return (long) *(long *) &threadid;
}
In the child process, all threads have disappeared, but the code doesn't
seem to be prepared to handle this.
No, that's just the suspicious bit of code that I noticed in
timemodule.c; GvR confirmed that it looks wrong, so I checked in the
patch.
--
A.M. Kuchling http://starship.python.net/crew/amk/
But we cannot disguise our abhorrence of modern communication devices.
-- Queen Victoria on teleconferencing, in SEBASTIAN O #1
I have a few comments to your discussion that might be helpful.
1) According to Posix when doing fork inside a thread only
that thread is cloned in the new process.
a) Thats exactly why in the code below none of
the threads are cloned (only the main thread).
b) This is known problem for the child process.
If the cloned thread needs to acquire a lock that
at the time of the fork some other thread had, it
will deadlock (since in the child process there
is no thread to release that lock).
pthread_atfork is there to help the child process
to handle this problem.
2) The strange thing in our case is that it is the parent
thread that starts behaving badly. And as I will soon
point out, it depends on what the child process is doing.
3) We ran into this problem because we were using popen in
a thread. Popen is basically only doing fork and exec plus
some filehandling stuff. The popen module is implemented
in python. I replaced that module with one in C and the
problem went away (so I have a workaround:)
I did some more experiments and it seems that if the child
process leaves an atomic operation (enters Python again)
our problem may occur.
I don't know the implementation of Python but everything
I have seen indicates that the child process is still sharing
some lock with the parent process.
Hope this will help
Snorri