I have a question primarily for Anne Archibald, the author of the
cookbook entry on multithreading,
http://www.scipy.org/Cookbook/Multithreading.
I tried replacing the 'if name=='__main__' clause in the attachment
handythread.py with
from numpy import ones, exp
def f(x):
print x
y = ones(10000000)
exp(y)
and the wall-clock time with foreach was 4.72s vs 6.68s for a simple for-loop.
First of all, that's amazing! I've been internally railing against the
GIL for months. But it looks like only a portion of f is being done
concurrently. In fact if I comment out the 'exp(y)', I don't see any
speedup at all.
It makes sense that you can't malloc simultaneously from different
threads... but if I replace 'ones' with 'empty', the time drops
precipitously, indicating that most of the time taken by 'ones' is
spent actually filling the array with ones. It seems like you should
be able to do that concurrently.
So my question is, what kinds of numpy functions tend to release the
GIL? Is there a system to it, so that one can figure out ahead of time
where a speedup is likely, or do you have to try and see? Do
third-party f2py functions with the 'threadsafe' option release the
GIL?
Thanks,
Anand
_______________________________________________
SciPy-user mailing list
SciPy...@scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-user
Simple example:
I want to evaluate a function using a C extension
(implemented with ctypes) for several parameters in
the function. The parameters are in a list. Then I
use the handythread.py approach and for each thread
call the C extension function with a new parameter
value from the list and, when the thread returns, I
add the result (say, a float number) to a result list.
Will, the GIL let the threads run independently? I
hope my example is clear.
Thanks for any info.
--- Anand Patil <anand.prab...@gmail.com>
wrote:
-- Lou Pecora, my views are my own.
____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> First of all, that's amazing! I've been internally railing against the
> GIL for months. But it looks like only a portion of f is being done
> concurrently. In fact if I comment out the 'exp(y)', I don't see any
> speedup at all.
>
> It makes sense that you can't malloc simultaneously from different
> threads... but if I replace 'ones' with 'empty', the time drops
> precipitously, indicating that most of the time taken by 'ones' is
> spent actually filling the array with ones. It seems like you should
> be able to do that concurrently.
>
> So my question is, what kinds of numpy functions tend to release the
> GIL? Is there a system to it, so that one can figure out ahead of time
> where a speedup is likely, or do you have to try and see? Do
> third-party f2py functions with the 'threadsafe' option release the
> GIL?
In general, the answer is that if a C extension can function outside
the GIL, it has to explicitly release it. TBH, I'm not sure what it
has to do first to make sure the interpreter is in a safe state -
maybe nothing - but it has to explicitly declare that it's not going
to modify any interpreter state.
Many numpy functions - exp is obviously an example - do this. Others
don't. It would be useful to go through the code looking at which ones
do and don't release the GIL, and put it in their docstrings; it might
be possible to make more release the GIL. It's a pretty safe bet that
the ufuncs do; I would guess that the linear algebra functions do too.
Probably not much else.
If an extension uses ctypes, whether it releases the GIL is up to
ctypes. I would guess that it doesn't, since ctypes knows nothing
about the C function, but I have never actually used ctypes.
Anne
> It would be useful to go through the code looking at which ones
> do and don't release the GIL, and put it in their docstrings; it might
> be possible to make more release the GIL. It's a pretty safe bet that
> the ufuncs do; I would guess that the linear algebra functions do too.
> Probably not much else.
I second that suggestion. In fact I'd be willing to help out if it's a
tedious but simple job.
> If an extension uses ctypes, whether it releases the GIL is up to
> ctypes. I would guess that it doesn't, since ctypes knows nothing
> about the C function, but I have never actually used ctypes.
Makes sense. Does anyone know about f2py extensions with 'cf2py
threadsafe' set? From the f2py user's guide, the threadsafe option
will
Use Py_BEGIN_ALLOW_THREADS .. Py_END_ALLOW_THREADS block around the
call to Fortran/C function.
Is that sufficien to release the GIL? What if the functions have callbacks??
Anand
Also, please ensure that you have at least 3 processors available (the
default). If not, you may introduce problems especially if you only
have two processors because one processor will be used by system for
other tasks.
Without knowing your 'simple for-loop' I do not see you apparently see.
from numpy import ones, exp
import time
if __name__=='__main__':
def f(x):
y = ones(10000000)
exp(y)
t1=time.time()
foreach(f,range(100))
t2=time.time()
for ndx in range(100):
y = ones(10000000)
exp(y)
t3=time.time()
print 'Handythread / simple loop)=, (t3-t2)/(t2-t1)
With this code, the 'for loop' takes about 2.7 times as long as the
handythread loop for a quad-core system. Further, on my Linux system I
can see via 'top' that handythread is using 3 (of the four cores) and
then this drops to 1 with the loop. Note this is not 3 to 1 as would
be expected if linear speed but rather close - there is overhead
involved. If you have limited resources (ie memory or processors) or
another OS that is not fully multithreaded, you may run into
additional problems since handythread.py assumes everything is
possible.
Regards
Bruce
> from numpy import ones, exp
> import time
>
> if __name__=='__main__':
> def f(x):
>
> y = ones(10000000)
> exp(y)
> t1=time.time()
> foreach(f,range(100))
> t2=time.time()
> for ndx in range(100):
>
> y = ones(10000000)
> exp(y)
> t3=time.time()
> print 'Handythread / simple loop)=, (t3-t2)/(t2-t1)
>
> With this code, the 'for loop' takes about 2.7 times as long as the
> handythread loop for a quad-core system.
That's very interesting. I set the 'threads' option to 2, since I have
a dual-core system, and the handythread example is still only about
1.5x faster than the for-loop example, even though I can see that both
my cores are being fully utilized. That could be because my machine
devotes a good fraction of one of its cores to just being a Mac, but
it doesn't look like that's what is making the difference.
The strange thing is that for me the 'for-loop' version above takes
67s, whereas a version with f modified as follows:
def f(x):
y = ones(10000000)
# exp(y)
takes 13s whether I use handythread or a for-loop. I think that means
'ones' can only be executed by one thread at a time. Based on that, if
my machine had three free cores I would expect about a 2.16X speedup
tops, but you're seeing a 2.7X speedup.
That means our machines are doing something differently (yours is
better). Do you see any speedup from handythread with the modified
version of f?
Anne, Thanks for your answers. They are helping, but
I'm still vague on the GIL. I have a few more
questions, two on your handythread.py code and one on
releasing the GIL for a C extension. Thanks for you
patience and help. BTW, I have a MacBook Pro with 2
CPUs.
(1) In your code if return_ = True I get a return
value from the foreach function only when nthreads>1,
but not when nthreads=1. Looking at the code the
nthreads=1 ends up in the else: at the bottom which
looks like:
else:
if return_:
for v in l:
f(v)
else:
return
and is puzzling. Nothing is returned in the if part
and f is not even called in the else part. Is this a
bug?
(2) If I replace the sleep(0.5) call in your f
function with a loop that just does a simple
calculation to eat up time, then in the call to
foreach when nthreads=2 the time to run the code goes
up by factors of ~100 or so. I'm guessing here that
it's because the GIL is not release for my version,
but is release in the sleep(0.5) function in your
version. Is that right?
(3) You mention that ctypes probably doesn't release
the GIL. I would guess that too, since it would be
dangerous as I (vaguely) understand the GIL. But does
the GIL have to be released in the Cextension or can
it be release in the step just before I call the C
extension from Python? I.e. is release on the Python
side possible? If not, I guess I will have to look
over the numpy code as you suggest. If possible, I
suppose the GIL must be enabled immediately on return
from the C extension.
Thanks, again.
-- Lou Pecora, my views are my own.
____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
Removing the expy(y) gives about the same time, which you can take
either way. But really you need to understand Python.
>From http://docs.python.org/api/threads.html:
"Therefore, the rule exists that only the thread that has acquired the
global interpreter lock may operate on Python objects or call Python/C
API functions. In order to support multi-threaded Python programs, the
interpreter regularly releases and reacquires the lock -- by default,
every 100 bytecode instructions (this can be changed with
sys.setcheckinterval()). "
If the operation is fast enough, then it will be done before the lock
is released by the interpreter does can release and reacquire the
lock. Thus there is no advantage in threading as in this case. So by
doing more work, this release/reacquire action becomes more important
to the overall performance.
This feature is also part of the reason why you can not get a linear
speedup for this using Python.
It is better to set the number of threads in handythread.py:
N threads Ratio of handythread.py to a for loop
1 0.995360257543
2 1.81112657674
3 2.51939329739
4 2.95551097958
5 3.04222213598
I do not get 100% of cpu time of each processor even for the for-loop
part. So until that happens, threads are not going to be as good as
they could be. Also, I can not comment on the OS but I do know some
are better than others for threading performance.
Regards
Bruce
Well, what needs to happen is that someone needs to go through and
track down occurrences of Py_BEGIN_ALLOW_THREADS ..
Py_END_ALLOW_THREADS in numpy.
A brute-force way of finding code that probably doesn't do it would be
to simply run each function in a foreach() with two threads and then
with one and see if there's any speedup. Messy and crude; probably
better just to look at the code, but numpy can be labyrinthine.
> Makes sense. Does anyone know about f2py extensions with 'cf2py
> threadsafe' set? From the f2py user's guide, the threadsafe option
> will
>
> Use Py_BEGIN_ALLOW_THREADS .. Py_END_ALLOW_THREADS block around the
> call to Fortran/C function.
>
> Is that sufficien to release the GIL? What if the functions have callbacks??
That's exactly what is needed to release the GIL.
I think, from looking at the code, that F2PY does nothing to reacquire
the GIL if it's entering a callback; this would mean that using
callbacks in a "threadsafe" function would cause a crash. So Don't Do
That. (But I'm not totally sure; maybe try generating one just to
check.)
Anne
Yep. Oops. Fixed in the v2 versions of the files. The wiki doesn't
make a very good version control system. Is it worth incorporating
those files into scipy?
> (2) If I replace the sleep(0.5) call in your f
> function with a loop that just does a simple
> calculation to eat up time, then in the call to
> foreach when nthreads=2 the time to run the code goes
> up by factors of ~100 or so. I'm guessing here that
> it's because the GIL is not release for my version,
> but is release in the sleep(0.5) function in your
> version. Is that right?
Depends what your function does, really. If your function takes half a
second but never releases the GIL, it should take twice as long. If
your function releases the GIL, then it should take about the same
time. But it's quite tricky to write a function that works hard and
releases the GIL. A good rule of thumb is to count the lines of python
that are getting executed. If there are only a few - say you're doing
sum(log(exp(arange(1000000)))) - there's a good chance the GIL will be
released. If you're running millions of python instructions, the GIL
is held all that time, and you won't get a speedup.
> (3) You mention that ctypes probably doesn't release
> the GIL. I would guess that too, since it would be
> dangerous as I (vaguely) understand the GIL. But does
> the GIL have to be released in the Cextension or can
> it be release in the step just before I call the C
> extension from Python? I.e. is release on the Python
> side possible? If not, I guess I will have to look
> over the numpy code as you suggest. If possible, I
> suppose the GIL must be enabled immediately on return
> from the C extension.
You can't execute any python bytecodes without holding the GIL, so
it's impossible for python code to release the GIL. But it would be
perfectly possible, in principle, for SWIG, F2PY, or ctypes to put a
"release the GIL" in their wrappers. This will be a problem for some
functions - either ones that aren't reentrant, or ones that call back
to python (though in principle it might be possible to reacquire the
GIL for the duration of a callback). But for a typical C function that
acts only on data you give it and that doesn't know anything about
python, it should be safe to run it without the GIL engaged. It seems
like f2py can actually do this for functions marked as threadsafe; I
don't know about ctypes or SWIG.
Anne
I don't know about ctypes or SWIG.
I vote yes. In my opinion the following would combine to form a killer feature:
- The handythread idea is developed a little, maybe to provide
functionality comparable to OpenMP
- Instructions for releasing the GIL in different extension types
(swig, f2py, pyrex) are combined in one place
- The numpy functions that release the GIL are clearly enumerated.
Seriously, this is too big of a deal to be just a cookbook entry. I
spent a full week last month beating my head against OpenMP trying to
do something embarrassingly parallel in an f2py extension. I had to
apply a patch to gcc 4.2's libgomp, compile it manually, learn how
linking works, and try several other options because OpenMP was so
frustrating. Now it works but I have tons of bug-prone code
duplication in Fortran because I couldn't figure out how to just apply
the same parallelism structure to all subroutines. The ability to
multithread from Python would have saved me all of that work.
Anand
> Yep. Oops. Fixed in the v2 versions of the files.
> The wiki doesn't
> make a very good version control system. Is it worth
> incorporating
> those files into scipy?
If you mean put a new version of the example code up
to SciPy cookbook, then yes because bugs confuse
newbies like me. :-)
> Depends what your function does, really. If your
> function takes half a
> second but never releases the GIL, it should take
> twice as long. If
> your function releases the GIL, then it should take
> about the same
> time. But it's quite tricky to write a function that
> works hard and
> releases the GIL. A good rule of thumb is to count
> the lines of python
> that are getting executed. If there are only a few -
> say you're doing
> sum(log(exp(arange(1000000)))) - there's a good
> chance the GIL will be
> released. If you're running millions of python
> instructions, the GIL
> is held all that time, and you won't get a speedup.
Hmmm... gotta think about that.
Sounds like it's better to call those macros:
Py_BEGIN_ALLOW_THREADS
and
Py_END_ALLOW_THREADS
on the C side. Is that all that's needed? Then will
code like your handythread.py work with threads if f
calls a C extension that uses those macros? Or is
there more that needs to be done to set this up?
-- Lou Pecora, my views are my own.
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
> > Yep. Oops. Fixed in the v2 versions of the files.
> The wiki doesn't
> > make a very good version control system. Is it
> worth incorporating
> > those files into scipy?
>
> I vote yes. In my opinion the following would
> combine to form a killer feature:
>
> - The handythread idea is developed a little, maybe
> to provide
> functionality comparable to OpenMP
> - Instructions for releasing the GIL in different
> extension types
> (swig, f2py, pyrex) are combined in one place
> - The numpy functions that release the GIL are
> clearly enumerated.
Yes, this is good, but I recognize that it's laying a
lot of work on someone with initials A.A. I would be
happy to have the handythread.py along with simple
instructions of how to use Py_BEGIN_ALLOW_THREADS and
Py_BEGIN_ALLOW_THREADS in the C extension to make it
all work together ... Providing that can be done
easily with a few C calls. Maybe it's more
complicated than I realize. In which case: OY !
-- Lou Pecora, my views are my own.
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
I guess I was kind of thinking other people might jump in. :-) Surely
there are lots of us who want to multithread from Python?
I've already volunteered to look through the numpy functions and find
which ones release the GIL. I'd be happy to contribute to the
handythread-like library, too.
Anand
Karl Young
Center for Imaging of Neurodegenerative Disease, UCSF
VA Medical Center, MRS Unit (114M)
Phone: (415) 221-4810 x3114
FAX: (415) 668-2864
Email: karl young at ucsf edu
> In general, the answer is that if a C extension can function outside
> the GIL, it has to explicitly release it. TBH, I'm not sure what it
> has to do first to make sure the interpreter is in a safe state -
> maybe nothing - but it has to explicitly declare that it's not going
> to modify any interpreter state.
>
> Many numpy functions - exp is obviously an example - do this. Others
> don't. It would be useful to go through the code looking at which ones
> do and don't release the GIL, and put it in their docstrings; it might
> be possible to make more release the GIL. It's a pretty safe bet that
> the ufuncs do; I would guess that the linear algebra functions do too.
> Probably not much else.
>
> If an extension uses ctypes, whether it releases the GIL is up to
> ctypes. I would guess that it doesn't, since ctypes knows nothing
> about the C function, but I have never actually used ctypes.
Of course does ctypes release the GIL on foreign function calls. And the GIL
is acquired if Python implemented callback functions call back into
Python code.
There is nothing that ctypes needs to know about the C function - if the
C function is not thread safe, you must not call it from other threads.
Except - if the C function makes Python api calls, however, the GIL must not be
released. In this case you should use the Python calling convention; for details
look up the docs (pydll and such).
This is even documented ;-)
Thomas
> Of course does ctypes release the GIL on foreign
> function calls. And the GIL
> is acquired if Python implemented callback functions
> call back into
> Python code.
I'm sorry, I don't understand what you just said. Can
you restate it? I will also check the ctypes docs.
> There is nothing that ctypes needs to know about the
> C function - if the
> C function is not thread safe, you must not call it
> from other threads.
How do I tell if the C function is thread safe?
> Except - if the C function makes Python api calls,
> however, the GIL must not be
> released. In this case you should use the Python
> calling convention; for details
> look up the docs (pydll and such).
My C function will make NO Python API calls. Can I
just call the Py_BEGIN_ALLOW_THREADS and
Py_BEGIN_ALLOW_THREADS macros in the C function to
allow return to another thread while the C function
calculates? Can the C function be called for another
thread? There are lots of docs. Which do you suggest
for me?
-- Lou Pecora, my views are my own.
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
ctypes releases the GIL when it calls a C function. Some C functions
take callbacks; ctypes lets you pass Python functions as these
callbacks. There is a C stub wrapped around the Python function to
handle the communication. This stub reacquires the GIL before calling
the Python function.
> > There is nothing that ctypes needs to know about the
> > C function - if the
> > C function is not thread safe, you must not call it
> > from other threads.
>
> How do I tell if the C function is thread safe?
You have to analyze the C function and the way you are calling it.
It's not necessarily an easy thing. Basically, you have to make sure
that concurrent calls to your functions don't touch the same data.
> > Except - if the C function makes Python api calls,
> > however, the GIL must not be
> > released. In this case you should use the Python
> > calling convention; for details
> > look up the docs (pydll and such).
>
> My C function will make NO Python API calls. Can I
> just call the Py_BEGIN_ALLOW_THREADS and
> Py_BEGIN_ALLOW_THREADS macros in the C function to
> allow return to another thread while the C function
> calculates?
With ctypes, this is not necessary.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
> ctypes releases the GIL when it calls a C function.
> Some C functions
> take callbacks; ctypes lets you pass Python
> functions as these
> callbacks. There is a C stub wrapped around the
> Python function to
> handle the communication. This stub reacquires the
> GIL before calling
> the Python function.
> > How do I tell if the C function is thread safe?
> You have to analyze the C function and the way you
> are calling it.
> It's not necessarily an easy thing. Basically, you
> have to make sure
> that concurrent calls to your functions don't touch
> the same data.
> > My C function will make NO Python API calls. Can
> I
> > just call the Py_BEGIN_ALLOW_THREADS and
> > Py_BEGIN_ALLOW_THREADS macros in the C function
> to
> > allow return to another thread while the C
> function
> > calculates?
> With ctypes, this is not necessary.
Robert, thanks very much for clarifying that. I get
it. ctypes is certainly more sophisticated than I
realized! Very nice. I am even more in debt to those
who pushed me to use it.
-- Lou Pecora, my views are my own.
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs