bypassing the GIL

Alex van Houten

unread,

Feb 23, 2011, 9:27:17 AM2/23/11

to cython...@googlegroups.com

Hi,

I am aware that the statements after "with nogil:" should not touch any Python
objects. However, the data processing I am doing includes convolution,
deconvolution, interpolation and a little complicated algebra. These routines
use routines from numpy and scipy and are very hard to translate to Cython.
Would take me quite some time.
It all runs with the Python multiprocessing module. The actual calculations take
0.30s per configuration, but the overhead from IPC is huge (using 8 cores),
increasing the processing time to 0.97s per configuration!
Now I was thinking of reverting back to threads to reduce the IPC overhead. In
this way I would not have the shared array issue. I dealt with this by passing
Numpy arrays to a multiprocessing managed queue, hence the overhead.
So my question is: is there a way to release the gil without rewriting
everything to Cython?
I know that one can put Py_BEGIN_ALLOW_THREADS
and Py_END_ALLOW_THREADS around blocks from C-extensions, but I have never seen
any examples on the web with a Cython generated C-extension.
I could generate the C code using "cython module.pyx", but then I wouldn't know
where in module.c I would have to put these macros (Py_BEGIN_ALLOW_THREADS and
Py_END_ALLOW_THREADS). Can anyone help?
Btw, the problem is embarassingly parallel. There is a lot of processing time to
win.

Thanks,
Alex.

Stefan Behnel

unread,

Feb 23, 2011, 9:41:07 AM2/23/11

to cython...@googlegroups.com

Alex van Houten, 23.02.2011 15:27:

> I am aware that the statements after "with nogil:" should not touch any Python
> objects. However, the data processing I am doing includes convolution,
> deconvolution, interpolation and a little complicated algebra. These routines
> use routines from numpy and scipy and are very hard to translate to Cython.
> Would take me quite some time.
> It all runs with the Python multiprocessing module. The actual calculations take
> 0.30s per configuration, but the overhead from IPC is huge (using 8 cores),
> increasing the processing time to 0.97s per configuration!
> Now I was thinking of reverting back to threads to reduce the IPC overhead. In
> this way I would not have the shared array issue. I dealt with this by passing
> Numpy arrays to a multiprocessing managed queue, hence the overhead.
> So my question is: is there a way to release the gil without rewriting
> everything to Cython?

Do I understand correctly that you want to call into NumPy functionality,
but want to do it with the GIL released? I doubt that that will help that
much. For one, it can't work, because calling Python functions/methods
requires the GIL, be it in C, Cython or Python. Then, AFAIR, NumPy frees
the GIL itself during lengthy operations.

> I know that one can put Py_BEGIN_ALLOW_THREADS
> and Py_END_ALLOW_THREADS around blocks from C-extensions, but I have never seen
> any examples on the web with a Cython generated C-extension.
> I could generate the C code using "cython module.pyx", but then I wouldn't know
> where in module.c I would have to put these macros (Py_BEGIN_ALLOW_THREADS and
> Py_END_ALLOW_THREADS). Can anyone help?

There isn't much to gain on that path. "with nogil" gives you exactly that,
with the additional benefit of having Cython check that you are not doing
something harmful. That's a *feature*. Working around it would likely get
you hard crashing code.

> Btw, the problem is embarassingly parallel. There is a lot of processing time to
> win.

What size are your arrays? Is the amount of operations your problem or the
size of the arrays?

Stefan

Dag Sverre Seljebotn

unread,

Feb 23, 2011, 9:45:27 AM2/23/11

to cython...@googlegroups.com

No, I think this is the problem: NumPy and SciPy have many cases.

I think the simplest thing to do is to submit a patch to NumPy/SciPy for
the code you are calling!

An alternative is to keep using multiprocessing, but allocate the arrays
in memory-mapped files that can be shared between processes. A search
for NumPy and memory mapped files should get you there.

Dag Sverre

Dag Sverre Seljebotn

unread,

Feb 23, 2011, 9:46:44 AM2/23/11

to cython...@googlegroups.com

sorry: ...where they do not properly release the GIL when they can. (In
many cases this is simply due to nobody knowing how to having had a need
to scratch the itch.)

Dag

Alex van Houten

unread,

Feb 23, 2011, 10:07:15 AM2/23/11

to cython...@googlegroups.com

Stefan Behnel <stefan_ml <at> behnel.de> writes:

> What size are your arrays? Is the amount of operations your problem or the
> size of the arrays?
>
> Stefan
>
>

8 processes each return a (12,256,256) array of floats, every 0.3s. They put
these in multiprocessing result queues, managed using multiprocessing.Manager.
Each process has its own result queue. The main process spawns 8 threads, one
for each process result queue, to unload them and to update an array where all
the data is accumulated.
This may seem cumbersome, but the size of the result queues should be kept small
and they should be unloaded quickly or it will cause delays or memory errors.
Hence the threads for unloading. Without them delays are even larger.
So the array sizes are the problem, they cause the IPC overhead. The actual
calculations take 0.3s per core, which is fast enough.

Alex.

Alex van Houten

unread,

Feb 23, 2011, 10:54:20 AM2/23/11

to cython...@googlegroups.com

Dag Sverre Seljebotn <dagss <at> student.matnat.uio.no> writes:

> An alternative is to keep using multiprocessing, but allocate the arrays
> in memory-mapped files that can be shared between processes. A search
> for NumPy and memory mapped files should get you there.
>
> Dag Sverre
>
>

I did have a look at sharedmem, but I am not sure how stable or fast it is, up
to what arraysize it is reliable. Has any serious testing been done?
Konrad Hinsen
http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessing.pdf
has a few slides on sharedmem:
"Portability: there is no shared memory under Windows." Does that mean it will
not run under Windows? Sorry, but my code will be running on a Windows machine.
"don't modify shared memory contents in the slave processes". But that is
necessary in my case!
"only to transfer data from the master to the slaves." But I need it the other
way round!
And then there is this slide "shared memory with in-place modification" with all
the warning signs.
Does not seem I want to go down that road.
But please comment.

Alex.

Francesc Alted

unread,

Feb 23, 2011, 10:55:47 AM2/23/11

to cython...@googlegroups.com

A Dimecres 23 Febrer 2011 16:07:15, Alex van Houten va escriure:

Yeah, I agree that most probably the bottleneck is copy/IPC overhead in
the multiprocessing module. However, you must be aware that
multiprocessing does not launch threads, but *processes*, so this is why
releasing the GIL does not improve at all the performance of your
program.

I'd say that the best solution for this is to use a pure *threaded*
approach. Unfortunately, calling most of NumPy/SciPy routines from
Cython does imply using the Python interface, that will set the GIL,
even if you are calling them within a 'with nogil:' statement, so you
won't see any apparent improvement by using Python threads (even if they
are called from Cython). So, the most productive venue should be using
a pure *C* threaded solution. That means implementing your worker in
pure C and use a pure C thread solution (like pthreads or OpenMP). Then
you can call this routine from Cython and hope for an speedup if you are
lucky (it is not always easy to get it).

Having the possibility to manipulate C threads directly from Cython is
considered a good thing and probably a hot topic for the next
months/years to come, and we will hopefully start some work in this
direction during the next Cython workshop:

http://wiki.cython.org/workshop2011

Hope this helps,

--
Francesc Alted

Francesc Alted

unread,

Feb 23, 2011, 11:10:42 AM2/23/11

to cython...@googlegroups.com

A Dimecres 23 Febrer 2011 16:54:20, Alex van Houten va escriure:

> Dag Sverre Seljebotn <dagss <at> student.matnat.uio.no> writes:
> > An alternative is to keep using multiprocessing, but allocate the
> > arrays in memory-mapped files that can be shared between
> > processes. A search for NumPy and memory mapped files should get
> > you there.
> >
> > Dag Sverre
>
> I did have a look at sharedmem, but I am not sure how stable or fast
> it is, up to what arraysize it is reliable. Has any serious testing
> been done? Konrad Hinsen
> http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessin

> g.pdf has a few slides on sharedmem:

> "Portability: there is no shared memory under Windows." Does that
> mean it will not run under Windows? Sorry, but my code will be
> running on a Windows machine. "don't modify shared memory contents
> in the slave processes". But that is necessary in my case!
> "only to transfer data from the master to the slaves." But I need it
> the other way round!
> And then there is this slide "shared memory with in-place
> modification" with all the warning signs.
> Does not seem I want to go down that road.

I understand that you are afraid of getting into this, but Konrad is
very correct about this: programming with shared memory (be with threads
or by explicitly using shared memory among process) is extremely
dangerous, and users should be aware of this.

Having said this, depending on you needs, you can still assess the level
of 'danger' of using a shared memory approach. For example, if you use
your shared memory only for 'read' purposes, you are safe. Or, if you
are partitioning well your dataset and let each thread write on a
*unique* memory area, you are safe again. The basic great danger of
dealing with shared memory is if you try to write in the same area at
the same time. You can deal with this last situation too (by using
locking techniques), but this is much more tricky to implement correctly
than it might seem at first sight, and hence Konrad's warnings.

--
Francesc Alted

Stefan Behnel

unread,

Feb 23, 2011, 11:14:09 AM2/23/11

to cython...@googlegroups.com

Francesc Alted, 23.02.2011 17:10:

> A Dimecres 23 Febrer 2011 16:54:20, Alex van Houten va escriure:
>> Dag Sverre Seljebotn<dagss<at> student.matnat.uio.no> writes:
>>> An alternative is to keep using multiprocessing, but allocate the
>>> arrays in memory-mapped files that can be shared between
>>> processes. A search for NumPy and memory mapped files should get
>>> you there.
>>>
>>> Dag Sverre
>>
>> I did have a look at sharedmem, but I am not sure how stable or fast
>> it is, up to what arraysize it is reliable. Has any serious testing
>> been done? Konrad Hinsen
>> http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiprocessin
>> g.pdf has a few slides on sharedmem:
>> "Portability: there is no shared memory under Windows." Does that
>> mean it will not run under Windows? Sorry, but my code will be
>> running on a Windows machine. "don't modify shared memory contents
>> in the slave processes". But that is necessary in my case!
>> "only to transfer data from the master to the slaves." But I need it
>> the other way round!
>> And then there is this slide "shared memory with in-place
>> modification" with all the warning signs.
>> Does not seem I want to go down that road.
>

> [...]if you

> are partitioning well your dataset and let each thread write on a
> *unique* memory area, you are safe

From what I understood so far, this is exactly the use case here: distinct
arrays being distributed to separate threads, each of which works on it and
then hands it back.

Perfect use case for shared memory, IMHO.

Stefan

Alex van Houten

unread,

Feb 23, 2011, 11:15:11 AM2/23/11

to cython...@googlegroups.com

Francesc Alted <faltet <at> gmail.com> writes:

> Yeah, I agree that most probably the bottleneck is copy/IPC overhead in
> the multiprocessing module. However, you must be aware that
> multiprocessing does not launch threads, but *processes*, so this is why
> releasing the GIL does not improve at all the performance of your
> program.

Thank you, I was aware of that. The idea was to revert from multiprocessing to
multithreading, bypassing the IPC overhead. But that would raise the GIL
problem. If I could release the GIL when using only threads, the code would be
really fast.
Alex.

Francesc Alted

unread,

Feb 23, 2011, 11:22:19 AM2/23/11

to cython...@googlegroups.com

A Dimecres 23 Febrer 2011 17:15:11, Alex van Houten va escriure:

As I said before, you cannot do that just by calling Python statements
withing a 'with nogil:' block: as soon as you enter Python code again
(and calling NumPy/SciPy functions will do this), the GIL will be gained
again (and if not, then your code will surely crash very early).

--
Francesc Alted

Stefan Behnel

unread,

Feb 23, 2011, 11:22:54 AM2/23/11

to cython...@googlegroups.com

Stefan Behnel, 23.02.2011 17:14:

... separate processes ...

Francesc Alted

unread,

Feb 23, 2011, 11:23:49 AM2/23/11

to cython...@googlegroups.com

A Dimecres 23 Febrer 2011 17:14:09, Stefan Behnel va escriure:

> Francesc Alted, 23.02.2011 17:10:
> > A Dimecres 23 Febrer 2011 16:54:20, Alex van Houten va escriure:
> >> Dag Sverre Seljebotn<dagss<at> student.matnat.uio.no> writes:
> >>> An alternative is to keep using multiprocessing, but allocate the
> >>> arrays in memory-mapped files that can be shared between
> >>> processes. A search for NumPy and memory mapped files should get
> >>> you there.
> >>>
> >>> Dag Sverre
> >>
> >> I did have a look at sharedmem, but I am not sure how stable or
> >> fast it is, up to what arraysize it is reliable. Has any serious
> >> testing been done? Konrad Hinsen
> >> http://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_multiproces

> >> sin g.pdf has a few slides on sharedmem:

> >> "Portability: there is no shared memory under Windows." Does that
> >> mean it will not run under Windows? Sorry, but my code will be
> >> running on a Windows machine. "don't modify shared memory contents
> >> in the slave processes". But that is necessary in my case!
> >> "only to transfer data from the master to the slaves." But I need
> >> it the other way round!
> >> And then there is this slide "shared memory with in-place
> >> modification" with all the warning signs.
> >> Does not seem I want to go down that road.
> >
> > [...]if you
> > are partitioning well your dataset and let each thread write on a
> > *unique* memory area, you are safe
>
> From what I understood so far, this is exactly the use case here:
> distinct arrays being distributed to separate threads, each of which
> works on it and then hands it back.
>
> Perfect use case for shared memory, IMHO.

Iep, but apparently OP uses Windows, so he won't be able to use
multiprocessing+memshared. The only option is then using Windows
threads or OpenMP.

--
Francesc Alted

Alex van Houten

unread,

Feb 23, 2011, 11:28:29 AM2/23/11

to cython...@googlegroups.com

Stefan Behnel <stefan_ml <at> behnel.de> writes:

> From what I understood so far, this is exactly the use case here: distinct

> arrays being distributed to separate processes, each of which works on it and

> then hands it back.
>
> Perfect use case for shared memory, IMHO.
>
> Stefan
>
>

Thanks, is it fast?
If so, I will try it.
Alex.

Alex van Houten

unread,

Feb 23, 2011, 11:36:32 AM2/23/11

to cython...@googlegroups.com

Francesc Alted <faltet <at> gmail.com> writes:

> Iep, but apparently OP uses Windows, so he won't be able to use
> multiprocessing+memshared. The only option is then using Windows
> threads or OpenMP.
>

Btw, why does sharedmem not work on Windows? I guess numpy.memmap should be
available on Windows.
Alex.

Francesc Alted

unread,

Feb 23, 2011, 11:59:12 AM2/23/11

to cython...@googlegroups.com

A Dimecres 23 Febrer 2011 17:36:32, Alex van Houten va escriure:

Oops, by looking at Konrad slices, I think I saw that sharedmem was not
supported on Win, but after looking at docs:

http://docs.python.org/library/mmap.html

apparently you can get shared memory support for Windows too (just pass
-1 as `fileno` to map anonymous memory).

Don't know about numpy.memmap on Win, you should check this yourself.

--
Francesc Alted

Sturla Molden

unread,

Feb 23, 2011, 12:21:09 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 17:59, skrev Francesc Alted:
>
> Don't know about numpy.memmap on Win, you should check this yourself.
>
>

The problem is that NumPy arrays is pickled by value. So when using it
with multiprocessing, you get a copy of the buffer's content instead of
a pointer to the shared memory segment.

So the solution is to use named shared memory instead of anonymous (i.e.
System V IPC instead of BSD mmap). Windows have a similar distinction.
Ga�l Varoquaux and I made an implementation of shared memory NumPy
arrays a couple of years ago. Basically they use Cython to create a
named shared memory segment, and changes how they are pickled by
pickling the name (a string) instead of buffer content. Then they are
unpickled by opening the shrerad memory buffer by name. This way
multiprocessing can share memory between processes by just passing these
arrays over multiprocessing.Queue. It runs on Windows and any Unix/Linux
supporting System V IPC.

The other option is handle inheritance. Basically create alle the shared
memory you need before forking processes (multiprocessing.Array), and
then recreate NumPy arrays by using it as buffer. It is not nearly as
elegant as System V IPC based ndarrays. You have to allocate everything
in advance, and there will be a lot of boilerplace code to set up the
correct arrays. E.g. one must pass starting address, strides and shape
to view a part of the multiprocessing.Array as a numpy.ndarray.

Third other means of IPC (e.g. pipes, Windows named pipes, Unix sockets)
are very fast, so this might not be needed. The overhead is hardly more
than a memcpy in the kernel. Also beware that MPI depends on passing
array copies, and usually performs and scales better than OpenMP. If IPC
is the bottleneck, there is usually something wrong with the algorithm,
and chances are that shared memory (or threads) will not be better due
to false sharing of cache lines.

Sturla

Sturla Molden

unread,

Feb 23, 2011, 12:28:21 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 15:27, skrev Alex van Houten:
> So my question is: is there a way to release the gil without rewriting
> everything to Cython?
>

C and C++:
Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros.

Fortran (f2py):
Declare as 'threadsafe' in the interface.

Shared libraries from C or Fortran (ctypes):
Methods in a CDLL and WinDLL will release the GIL on invocation, PyDLL
will not.

Cython and Pyrex:
Use a 'with nogil:' block.

Sturla

Sturla Molden

unread,

Feb 23, 2011, 12:46:49 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 16:55, skrev Francesc Alted:
> Having the possibility to manipulate C threads directly from Cython is
> considered a good thing

Not needed. Python threads are native OS threads that runs freely, but
with a nicer API. If I write my own load balancer (e.g. a guided
sheduler with the same rules as OpenMP's), the combination of Cython and
Python threads performs similarly to C and OpenMP.

OpenMP hides away the nastyness of manual load balancing, but proper
support for closures in Cython will improve on this.

Sturla

Francesc Alted

unread,

Feb 23, 2011, 1:08:43 PM2/23/11

to cython...@googlegroups.com

A Dimecres 23 Febrer 2011 18:46:49, Sturla Molden va escriure:

> Den 23.02.2011 16:55, skrev Francesc Alted:
> > Having the possibility to manipulate C threads directly from Cython
> > is considered a good thing
>
> Not needed. Python threads are native OS threads that runs freely,
> but with a nicer API. If I write my own load balancer (e.g. a guided
> sheduler with the same rules as OpenMP's), the combination of Cython
> and Python threads performs similarly to C and OpenMP.

Sorry, I should have said "native OS threads" instead of "C threads".
What I meant is to design a way to do truly threading code inside
Cython. And I agree that Python API is a 'good enough' one, so this can
be chosen without problems.

Hmm, now that I think, pure Python threads should perform like native OS
threads, provided that:

- GIL is released
- You don't call Python code from these threads

Do you have some experience writing Python threaded code in Cython and
managed to get actual speed-ups? Just curious.

--
Francesc Alted

Sturla Molden

unread,

Feb 23, 2011, 1:44:21 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 15:27, skrev Alex van Houten:

> I could generate the C code using "cython module.pyx", but then I wouldn't know
> where in module.c I would have to put these macros (Py_BEGIN_ALLOW_THREADS and
> Py_END_ALLOW_THREADS). Can anyone help?

You can put them around any C code that do not touch PyObject* pointers.

Beware that all C code not using PyObject* might not be thread-safe and
re-entrant. In this case you need to use your own synchronization
(mutex, semaphore, critical section, spinlock) if you don't rely on the
GIL.

You can use OpenMP in C even though the GIL is kept, as long as only one
OpenMP thread has simultaneous access to the Python C API.

Another thing:

Most numerical code requiring parallel processing don't require any
special programming on our part. Just let numerical libraries do it for
you. Some of the libraries that will do this are:

Linear algebra (BLAS, LAPACK):
- Intel MKL
- AMD ACML
- GotoBLAS
- ATLAS
- AMD ACML-GPU
- Nvidia CUBLAS

FFTs:
- FFTW
- Intel MKL
- AMD ACML

General science:
- IMSL linked with Intel MLK
- NAG

You can spend hours implementing you own parallel matrix multiplication
routine for complex valued arrays, but not come nearly the performance
of ZGEMM from MKL or ACML. Don't waste your time on it.

You can see a lot of complaints on the Internet about Python's GIL from
people with Java or .NET experience. It usually looks like this:

- They try to solve problems that are I/O bound, for which the GIL is
not a problem anyway. They think the GIL is a problem just because it is
there.

- Sometimes they try to solve computational problems that are better
handled by libraries. Also they tend to complain that "programming for
multicore CPUs is difficult", to which another of this crowd replies
"yes but we run many programs at once" (as if they did not use
multitasking on their single-core a couple of years ago).

These people (which includes 99% of the world's Java programmers) tend
to have two problems:

- The only tool they know to work with is Java threads. They don't know
anything about the performance characteristics of threads, neither for
i/o bound nor compute-bound code.

- They don't know anything about numerical libraries, some of which have
been around for centuries, and always try to reinvent the wheel.

Finally they end up writing immensely buggy and slow multi-threaded
code, for something that would be better solved by a single function
call to an optimized BLAS library. And you can be sure they will
conclude that "Python's GIL is the problem".

Time is better spent selecting the best algorithm and selecting the best
numerical library! It does not matter from which language you call the
performance libraries. If 95% of the time is spent inside BLAS or FFTW,
you can just as well use plain Python. Using Cython or C, or even
multi-threading, is optimizing in the wrong end: the reminding 5% of the
runtime! If you can double the performance there, you have pushed the
time in BLAS from 95% to 97.5%. That will be hardly noticable on the
overall performance. That is what Hoare and Knuth called "premature
optimization". You can get better performance from using IMSL et al.
with Python, than writing your own computational code in C.

If these libraries does not help, the next step is try an
autovectorizing compiler (Intel, Cray, Absoft, Portland).

If that does not help, try OpenMP.

When everything else have failed, you could consider threads or
multiprocessing, but not before.

It is important to start optimizing in the corrent end, which is
selecting the best algorithm and library!

Even if you think you are writing "code from scratch", chances are you
are not. Almost all numerical computing depends on linear algebra and
array operations, for which optimized BLAS and LAPACK is the answer.

Sturla

Alex van Houten

unread,

Feb 23, 2011, 1:45:27 PM2/23/11

to cython...@googlegroups.com

Sturla Molden <sturlamolden <at> yahoo.no> writes:

> So the solution is to use named shared memory instead of anonymous (i.e.
> System V IPC instead of BSD mmap). Windows have a similar distinction.

> Gaël Varoquaux and I made an implementation of shared memory NumPy

> arrays a couple of years ago. Basically they use Cython to create a
> named shared memory segment, and changes how they are pickled by
> pickling the name (a string) instead of buffer content. Then they are
> unpickled by opening the shrerad memory buffer by name. This way
> multiprocessing can share memory between processes by just passing these
> arrays over multiprocessing.Queue. It runs on Windows and any Unix/Linux
> supporting System V IPC.
>

Thanks, does this mean that sharedmem works on Windows? Then I don't understand
Konrad's remark.

> Third other means of IPC (e.g. pipes, Windows named pipes, Unix sockets)
> are very fast, so this might not be needed. The overhead is hardly more
> than a memcpy in the kernel.

Do you mean this:
http://docs.python.org/library/multiprocessing.html#multiprocessing.Pipe
?
I could try that, too. It says
"Very large pickles (approximately 32 MB+, though it depends on the OS) may
raise a ValueError exception." But my arrays are smaller.

Thanks,
Alex

Sturla Molden

unread,

Feb 23, 2011, 2:05:09 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 19:08, skrev Francesc Alted:
> Hmm, now that I think, pure Python threads should perform like native OS
> threads, provided that:
>
> - GIL is released
> - You don't call Python code from these threads
>

Indeed.

> Do you have some experience writing Python threaded code in Cython and
> managed to get actual speed-ups? Just curious.
>

I'll attach a benchmark I did of scipy.spatial.cKDTree a couple of years
ago (smaller is better). The red line is a similar C implementation, but
optimized with OpenMP. The green line is Cython and Python threads
running a "with nogil:" block, and using a hand-crafted guided load
scheduler (a Cython class synchronized by the GIL). The conclusion
should be obvious.

Sturla

benchmark-27022009.png

Sturla Molden

unread,

Feb 23, 2011, 2:18:01 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 19:45, skrev Alex van Houten:
> Thanks, does this mean that sharedmem works on Windows? Then I don't understand
> Konrad's remark.
>

Which implementation for Cython are you thinking of?

Windows has shared memory, but the API is different from Linux.

Shared memory can be named or anonymous. Anonymous shared memory must be
shared using handle inheritance (i.e. it must be created before the call
to fork on Linux or CreateProcess on Windows). Named shared memory can
be opened from any process.

The one me and Gaël wrote certainly works on Linux and Windows alike,
but there is a problem with a memory leak on Linux. It is due to a bug
in multiprocessing, not our code, so there is nothing we can do about
it. (If you care to know: Shared memory must be manually deleted on
Linux, but not on Windows. Multiprocessing exits processes by calling
os._exit instead of sys.exit, which prevents any clean-up code from
executing -- not just ours. And as a result the segment is left orphaned.)

multiprocessing.Pipe is a pipe, multiprocessing.Queue is a synchronized
duplex pipe.

Sturla

Sturla Molden

unread,

Feb 23, 2011, 2:43:16 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 20:18, skrev Sturla Molden:
>
> The one me and Gaël wrote certainly works on Linux and Windows alike,
> but there is a problem with a memory leak on Linux. It is due to a bug
> in multiprocessing, not our code, so there is nothing we can do about it.

'Nothing' is too strong wording, but I have not had time to correct it
yet. The allocator must e.g. run in a server thread on the main process,
instead of letting any process allocate freely. And since I don't use
Linux, I just left it as it is...

Sturla

Sturla Molden

unread,

Feb 23, 2011, 3:07:14 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 20:18, skrev Sturla Molden:

> Den 23.02.2011 19:45, skrev Alex van Houten:
>> Thanks, does this mean that sharedmem works on Windows? Then I don't
>> understand
>> Konrad's remark.
>>
> Which implementation for Cython are you thinking of?

Here is some old sharedmem code I found on my computer. It needs to be
cleand up a bit (fix the Linux os._exit issue, update for 64-bit
support), but it shows how to use shared memory with Cython,
multiprocessing and NumPy.

Generally I'd recommend AGAINST shared memory, and recommend just
passing array copyies instead. If the communication overhead is too big,
the algorithm should be changed. I would also recomment MPICH2 and
mpi4py instead of multiprocessing, if applicable.

Sturla

sharedmem.zip

Dag Sverre Seljebotn

unread,

Feb 23, 2011, 3:24:41 PM2/23/11

to cython...@googlegroups.com

On 02/23/2011 07:44 PM, Sturla Molden wrote:
> Den 23.02.2011 15:27, skrev Alex van Houten:
>> I could generate the C code using "cython module.pyx", but then I
>> wouldn't know
>> where in module.c I would have to put these macros
>> (Py_BEGIN_ALLOW_THREADS and
>> Py_END_ALLOW_THREADS). Can anyone help?
>
> You can put them around any C code that do not touch PyObject* pointers.
>
> Beware that all C code not using PyObject* might not be thread-safe
> and re-entrant. In this case you need to use your own synchronization
> (mutex, semaphore, critical section, spinlock) if you don't rely on
> the GIL.
>
> You can use OpenMP in C even though the GIL is kept, as long as only
> one OpenMP thread has simultaneous access to the Python C API.

<snip>

Thanks for the rant. But to return to the OP's question, it specifically
mentioned *embarrasingly* parallel problems, which is another can of
worms entirely. He also mentioned that the algorithms in question did
call NumPy and SciPy (presumably for the heavy lifting, like
convolution). I.e., the real problem is simply that SciPy does not
release the GIL. Didn't you complain yourself on the SciPy list some
time ago that the GIL isn't released often enough in SciPy?

Dag Sverre

Dag Sverre Seljebotn

unread,

Feb 23, 2011, 3:33:12 PM2/23/11

to cython...@googlegroups.com

It's not to me, since the plot comes without confidence intervals, and I
don't know how you did the benchmarks. Or, the conclusion seems to be
that there's no difference between them?

Francesc: I did it on some code once and got a nice speedup. It does
work. For embarassingly parallel problems, my main issue is usability:
It was a major pain to debug (Ctrl+C doesn't work -- although Sage has
signal-based macros one can use for this), you can't use exceptions, you
can't use debug print statements but must do printf, and so on. In
short, one may as well (or perhaps rather, with OpenMP) write C code.

And if it is not embarrassingly parallel, you're quite a few levels down
in portability and level of abstraction due to the lack of OpenMP. But
you can use pthreads, sure.

Dag Sverre

Sturla Molden

unread,

Feb 23, 2011, 4:14:49 PM2/23/11

to cython...@googlegroups.com

Den 23.02.2011 21:33, skrev Dag Sverre Seljebotn:
> It's not to me, since the plot comes without confidence intervals, and
> I don't know how you did the benchmarks. Or, the conclusion seems to
> be that there's no difference between them?

The conclusion is that OpenMP threads and Python threads perform
similarly. (Which is no surprise, but answers Francesc's question.)

It is a problem that SciPy does not release the GIL as often as it
should. Some part of SciPy is specifially written to depend on the GIL,
e.g. the interface to FFTPACK. Interfaces to LAPACK the GIL is kept for
no reason. Etc. It is often better to use the numerical libraries
directly than depend on SciPy.

Some parts of SciPy must depend on the GIL: E.g. the Fortran library
MINPACK depends on global data, and is not thread safe. Many old Fortran
codes also have the 'save' statement for local variables, and are
therefore not re-entrant. So it's not just bad programming.

As for NumPy, it does not depend on old Fortran codes so there is no
excuse for the inconsideration about GIL release.

For those that have money, the 'gold standard' of scientific libraries
(Visual Numerics IMSL) is available for Python as an alternative to SciPy:

http://www.roguewave.com/products/imsl-numerical-libraries/pyimsl-studio.aspx

Sturla

Dag Sverre Seljebotn

unread,

Feb 24, 2011, 4:01:02 AM2/24/11

to cython...@googlegroups.com

On 02/23/2011 10:14 PM, Sturla Molden wrote:
> Den 23.02.2011 21:33, skrev Dag Sverre Seljebotn:
>> It's not to me, since the plot comes without confidence intervals,
>> and I don't know how you did the benchmarks. Or, the conclusion seems
>> to be that there's no difference between them?
>
> The conclusion is that OpenMP threads and Python threads perform
> similarly. (Which is no surprise, but answers Francesc's question.)
>
> It is a problem that SciPy does not release the GIL as often as it
> should. Some part of SciPy is specifially written to depend on the
> GIL, e.g. the interface to FFTPACK. Interfaces to LAPACK the GIL is
> kept for no reason. Etc. It is often better to use the numerical
> libraries directly than depend on SciPy.
>
> Some parts of SciPy must depend on the GIL: E.g. the Fortran library
> MINPACK depends on global data, and is not thread safe. Many old
> Fortran codes also have the 'save' statement for local variables, and
> are therefore not re-entrant. So it's not just bad programming.

Well, presumably the Fortran codes could be fixed without *that* much
effort. But yes.

Note: I just finished my consultancy for Enthought with rewrapping most
of the Fortran code in SciPy. This means that, e.g.,
scipy.linalg.fblas/flapack is now written in Cython, so releasing the
GIL is as simple as inserting "with nogil". (I didn't though, as I
wanted to make that -- the time to insert nogil is after this much is
accepted upstream and gets more testing).

https://github.com/jasonmccampbell/scipy-refactor/tree/fwrap

(Only numscons for now, though the adaption to distutils should be trivial.)

Dag Sverre

Yosef Meller

unread,

Feb 24, 2011, 4:20:37 AM2/24/11

to cython...@googlegroups.com

On יום רביעי 23 פברואר 2011 23:14:49 Sturla Molden wrote:
> Some parts of SciPy must depend on the GIL: E.g. the Fortran library
> MINPACK depends on global data, and is not thread safe. Many old Fortran
> codes also have the 'save' statement for local variables, and are
> therefore not re-entrant. So it's not just bad programming.

I've done the work once to change that, but it didn't get applied for some
reason. I don't have the time to push it now.
http://projects.scipy.org/scipy/ticket/713

Francesc Alted

unread,

Feb 24, 2011, 5:00:27 AM2/24/11

to cython...@googlegroups.com

A Dijous 24 Febrer 2011 10:01:02, Dag Sverre Seljebotn va escriure:

> > Some parts of SciPy must depend on the GIL: E.g. the Fortran
> > library MINPACK depends on global data, and is not thread safe.
> > Many old Fortran codes also have the 'save' statement for local
> > variables, and are therefore not re-entrant. So it's not just bad
> > programming.
>
> Well, presumably the Fortran codes could be fixed without *that* much
> effort. But yes.
>
> Note: I just finished my consultancy for Enthought with rewrapping
> most of the Fortran code in SciPy. This means that, e.g.,
> scipy.linalg.fblas/flapack is now written in Cython, so releasing the
> GIL is as simple as inserting "with nogil".

That's really great, and that was my impression: for taking advantage of
threading inside Cython, you should call either C code or Cython
functions defined as 'cdef' *and* release the GIL.

I was a bit confused because it seemed to me that somebody suggested
that by calling pure NumPy/SciPy functions (say, ``array.sum()``) in
threaded code from Cython and releasing the GIL was enough to get speed-
ups. Can somebody confirm that this is not the case? Or Cython can
really do this kind of magic?

--
Francesc Alted

Dag Sverre Seljebotn

unread,

Feb 24, 2011, 5:11:46 AM2/24/11

to cython...@googlegroups.com

This is not the case, your understanding is correct.

Dag Sverre

Sturla Molden

unread,

Feb 24, 2011, 7:28:43 AM2/24/11

to cython...@googlegroups.com

Den 24.02.2011 10:01, skrev Dag Sverre Seljebotn:
>
> Note: I just finished my consultancy for Enthought with rewrapping
> most of the Fortran code in SciPy. This means that, e.g.,
> scipy.linalg.fblas/flapack is now written in Cython, so releasing the
> GIL is as simple as inserting "with nogil". (I didn't though, as I
> wanted to make that -- the time to insert nogil is after this much is
> accepted upstream and gets more testing).
>

fwrap is nice, and the 'correct' way to wrap Fortran, unlike f2py which
merely happens to work by accident for Fortran 90 and later :-)

The problem with Fortran is that binary code might not be compiled to
something that is threadsafe. I had to track down a bug in one of my
programs yesterday: Absoft's runtime crashed with the error "allocatable
array already allocated". It only happened when I used OpenMP threads.
It turned out I had forgotten to remove a compiler switch that gave all
my locals the SAVE attribute. This is a rather common optimisation in
Fortran. Sometimes it is hard-coded, but often it is done by the compiler.

If you use "with nogil", it will only work with threadsafe LAPACK
libraries. And that depends on how it was compiled. :-(

How can we tell with which LAPACK SciPy was linked? Not just version,
but also compiler settings. What if someone wants to compile against a
"single-processing" version of MKL or ACML for better speed (e.g. when
using MPI)?

I guess that is some of the reason why SciPy does not release the GIL as
much as it could. Fortran compilers can be evil to threads.

It is easier with C, where we don't expect compilers to make all locals
static.

Sturla

Jon Olav Vik

unread,

Feb 24, 2011, 11:58:17 AM2/24/11

to cython-users

It is. To use it for parallel I/O, however, I hacked support for
`offset` and `shape` keywords in np.load and
np.lib.format.open_memmap. I've taken my first baby steps to submit
patches for this:
http://article.gmane.org/gmane.comp.python.numeric.general/42619
That post is for the current trunk, a patch for 1.4 is here:
http://article.gmane.org/gmane.comp.python.numeric.general/42626

Hope this helps,
Jon Olav

Sturla Molden

unread,

Feb 24, 2011, 1:13:37 PM2/24/11

to cython...@googlegroups.com

Den 24.02.2011 17:58, skrev Jon Olav Vik:
> It is. To use it for parallel I/O, however, I hacked support for
> `offset` and `shape` keywords in np.load and
> np.lib.format.open_memmap. I've taken my first baby steps to submit
> patches for this:
> http://article.gmane.org/gmane.comp.python.numeric.general/42619
> That post is for the current trunk, a patch for 1.4 is here:
> http://article.gmane.org/gmane.comp.python.numeric.general/42626

Ok, I am going to say this only one more time:

* NumPy arrays are pickled by taking a copy of the buffer -- even if
they are 'shared'. If you pass an ndarray over a Queue, or any other IPC
in multiprocessing, you get a pickled copy of the array. IT DOES NOT DO
WHAT WE WANT. A copy of shared memory is not faster than a copy of
private memory!

* NumPy arrays referencing mmap.mmap cannot be shared with
multiprocessing because the shared memory is 'anonymous'. Thus it must
be instantiated BEFORE the subprocesses are spawned. All BSD mmap can do
is give us a big, static segment in advance. That is Fortran 66 style
programming. When was dynamic memory invented?

* You can work around this limitations like this, but it still feels
like old Fortran:
http://folk.uio.no/sturlamo/python/multiprocessing-tutorial.pdf

* See the attachment I posted previously for how to use shared memory
with NumPy and multiprocessing. We need to use named segments (System V
IPC), not anonymous mappings (BSD mmap). This actually works the way we
expect. The memory buffer remain shared, only the array descriptor is
communicated. These arrays can be allocated and deallocated at will.

Sturla

Jon Olav Vik

unread,

Feb 25, 2011, 4:51:23 AM2/25/11

to cython-users

On Feb 24, 7:13 pm, Sturla Molden <sturlamol...@yahoo.no> wrote:
> Den 24.02.2011 17:58, skrev Jon Olav Vik:
>
> > It is. To use it for parallel I/O, however, I hacked support for
> > `offset` and `shape` keywords in np.load and
> > np.lib.format.open_memmap. I've taken my first baby steps to submit
> > patches for this:
> >http://article.gmane.org/gmane.comp.python.numeric.general/42619
> > That post is for the current trunk, a patch for 1.4 is here:
> >http://article.gmane.org/gmane.comp.python.numeric.general/42626
>
> Ok, I am going to say this only one more time:

Thank you, but I fear the lesson may be lost on me 8-)

I probably quoted more text than I should have; my reply was just to
the implied question:

> I guess numpy.memmap should be available on Windows.

I can confirm that it does, and point out a convenient way to have
different processes read and write to different parts of the same
file. I realize that this has nothing to do with shared memory, but
for my purposes (trivially parallel computation, large data) it has
proved useful. (There remains the issue of telling each process what
part of the data to work with; I've been hacking around with
environment variables and/or MPI.)

That said, my main work has been on Linux; on 32-bit Python on
Windows, I cannot allocate files > 2 GB in this way. But at least it
allows me to develop on my Windows laptop.

(Googling around, I see that you have had issues with memmap before:
http://thread.gmane.org/gmane.comp.python.numeric.general/15850
I haven't tested how memmap currently performs in 64-bit Python on
Windows.)

Alex van Houten

unread,

Feb 25, 2011, 9:59:43 AM2/25/11

to cython...@googlegroups.com

Francesc Alted <faltet <at> gmail.com> writes:

> Yeah, I agree that most probably the bottleneck is copy/IPC overhead in
> the multiprocessing module.

Problem solved. The IPC overhead was not the problem, it were the interpolation
(256,256) grids that were calculated and feeded to the task queue through the
main process, for each configuration. This was done in pure CPython. I thought
it was simple and short in computing time, so I didn't optimise it. Then I found
it took long. I moved that computation to the processes, such that each process
has to calculate its own grid. Also, I have rewritten the grid calculations in
Cython.

These were the numbers per configuration on a Intel Xeon dual quadcore E5520
(previously I quoted compute times on an I7 machine, which is somewhat faster)
Total processing time : 1.22s
actual computations : 0.37s
IPC overhead : 0.12s
grid setup in main : 0.75s

Now the grids are done in the processes and optimized by a factor>100 by
rewriting in Cython! It is now added to the actual computations, but that has
negligible effect, because it takes less than 0.01s.

These are my new numbers on that machine:
Total processing time : 0.49s
actual computations : 0.37s
IPC overhead : 0.12s

This means that the overhead from IPC in multiprocessing is 0.12/0.49 = just
less than 25% on 8 cores!

Lessons to be learned:
- Do more thorough profiling
- Cython is great!
- The IPC overhead from Python's multiprocessing from 8 cores that return arrays
of floats of size (12,256,256) every 0.37s is about 25%.

Sharedmem may speed things up even more, but I don't need it. There is not a lot
to win anyway and this code is fast enough!

Thanks for all your help guys! A happy weekend for me!
Cheers,
Hanno.

Francesc Alted

unread,

Feb 25, 2011, 10:57:53 AM2/25/11

to cython...@googlegroups.com

A Divendres 25 Febrer 2011 15:59:43, Alex van Houten va escriure:

> Problem solved. The IPC overhead was not the problem, it were the
> interpolation (256,256) grids that were calculated and feeded to the
> task queue through the main process, for each configuration. This
> was done in pure CPython. I thought it was simple and short in
> computing time, so I didn't optimise it. Then I found it took long.
> I moved that computation to the processes, such that each process
> has to calculate its own grid. Also, I have rewritten the grid
> calculations in Cython.

Excellent news. Perhaps you may want to make your code public, as it
may serve for inspiration for others.

[clip]

> Lessons to be learned:
> - Do more thorough profiling
> - Cython is great!
> - The IPC overhead from Python's multiprocessing from 8 cores that
> return arrays of floats of size (12,256,256) every 0.37s is about
> 25%.

For what is worth, I think what you call IPC overhead is rather the time
to pickle/unpickle arrays to be transported to workers (as Sturla
suggested):

>>> a = np.arange(12*256*256).reshape(12,256,256)
>>> timeit s = cPickle.dumps(a,protocol=-1); b = cPickle.loads(s)
100 loops, best of 3: 13.2 ms per loop
>>> 13.2*8 # you have 8 threads
105.59999999999999

i.e. 105 ms is pretty close to 120 ms in your machine.

And now that I see this, a nice speed-up for multiprocessing would be to
replace the pickle/unpickle mechanism by a pure copy when using NumPy
arrays:

>>> timeit a.copy()
100 loops, best of 3: 1.97 ms per loop

which is almost 7x faster (I suppose that packages like mpi4py are
actually using the copy approach). Not that this overhead is important
for your case, but that's food for thought anyways.

--
Francesc Alted

Sturla Molden

unread,

Feb 25, 2011, 1:23:24 PM2/25/11

to cython...@googlegroups.com

Den 25.02.2011 10:51, skrev Jon Olav Vik:
>> I guess numpy.memmap should be available on Windows.
> I can confirm that it does, and point out a convenient way to have
> different processes read and write to different parts of the same
> file. I realize that this has nothing to do with shared memory, but
> for my purposes (trivially parallel computation, large data) it has
> proved useful.

The difference is that a physical file has a "name in the filesystem".
Any process that knows the filename can memory map the file. What I
suggested therefore, is to give the shared memory a name, just like a
file has a name.

numpy.memmap cannot do this, whereas the Cython code I posted can.

On Windows it will look like a physical file on a drive called r"\\." instead of "C:". When we know this name, we can use numpy.memmap (or similar means) to memory map it from anywhere.

Sturla

Sturla Molden

unread,

Feb 25, 2011, 1:59:57 PM2/25/11

to cython...@googlegroups.com

Den 25.02.2011 15:59, skrev Alex van Houten:
> - The IPC overhead from Python's multiprocessing from 8 cores that return arrays
> of floats of size (12,256,256) every 0.37s is about 25%.
>
> Sharedmem may speed things up even more, but I don't need it. There is not a lot
> to win anyway and this code is fast enough!

Shared memory is a form of IPC, it's not an alternative to IPC.

Shared memory is also how IPC like pipes, named pipes (Windows), fifos
(Linux), and Unix sockets (Linux) are implemented. When you write to a
pipe you memcpy to a shared memory segment maintained by the OS. When
you read from a pipe, you memcpy from it. In effect, the overhead is
hardly more than two calls to memcpy protected by a spinlock.

Even TCP/IP sockets can give you gigabit/sec transfer rates. Remember
Beowulf clusters used for "High Performance Computing"? Even network
connections are fast enough IPC for the world's fastest supercomputers
-- and IPC on localhost is a lot faster than that.

When we use NumPy arrays in IPC, the expensive part is actually
"pickling the array". Even with my "shared memory ndarrays", pickling is
the major bottleneck. That is the major reason I recommended against
using them.

What we should have is an "IPC protocol for the PEP 3188 Py_buffer". We
don't need the heavy machinery of cPickle to serialize a Py_buffer. It
could e.g. be based on shared memory on localhost and tcp/ip between
remote computers. Anyone care to join me in making that? It could even
result in a future PEP. I want an interface for Python, C and Fortran :-)

Sturla

Sturla Molden

unread,

Feb 25, 2011, 2:16:00 PM2/25/11

to cython...@googlegroups.com

Den 25.02.2011 16:57, skrev Francesc Alted:
> And now that I see this, a nice speed-up for multiprocessing would be to
> replace the pickle/unpickle mechanism by a pure copy when using NumPy
> arrays:

Yes. But it should be generalised to any PEP 3188 Py_buffer, not just
NumPy ndarrays. Does NumPy use the Py_buffer ABI now?

Sturla

Alex van Houten

unread,

Feb 25, 2011, 4:57:53 PM2/25/11

to cython...@googlegroups.com

Sturla Molden <sturlamolden <at> yahoo.no> writes:

> > Sharedmem may speed things up even more, but I don't need it.
> > There is not a lot
> > to win anyway and this code is fast enough!
>
> Shared memory is a form of IPC, it's not an alternative to IPC.

When I write "sharedmem", I mean your sharedmem package.

https://bitbucket.org/cleemesser/numpy-sharedmem/src

Cheers,
Alex.

Alex van Houten

unread,

Feb 26, 2011, 7:09:21 AM2/26/11

to cython...@googlegroups.com

Francesc Alted <faltet <at> gmail.com> writes:

> >>> a = np.arange(12*256*256).reshape(12,256,256)
> >>> timeit s = cPickle.dumps(a,protocol=-1); b = cPickle.loads(s)
> 100 loops, best of 3: 13.2 ms per loop
> >>> 13.2*8 # you have 8 threads
> 105.59999999999999
>
> i.e. 105 ms is pretty close to 120 ms in your machine.

Ok, thanks. I guess that means that switching to Sturla's sharedmem package will
not speed things up.

Alex.

Francesc Alted

unread,

Feb 26, 2011, 7:47:10 AM2/26/11

to cython...@googlegroups.com

A Divendres 25 Febrer 2011 20:16:00, Sturla Molden va escriure:

I think so. Look at this:

https://github.com/numpy/numpy/commit/f553be91f4905fee9bfa3760791ba49c721cef90

This should be included in NumPy 1.5.x (you need Python 2.6 or higher).

--
Francesc Alted

Francesc Alted

unread,

Feb 26, 2011, 8:15:47 AM2/26/11

to cython...@googlegroups.com

A Divendres 25 Febrer 2011 19:59:57, Sturla Molden va escriure:

> When we use NumPy arrays in IPC, the expensive part is actually
> "pickling the array". Even with my "shared memory ndarrays", pickling
> is the major bottleneck. That is the major reason I recommended
> against using them.

Good to know.

> What we should have is an "IPC protocol for the PEP 3188 Py_buffer".
> We don't need the heavy machinery of cPickle to serialize a
> Py_buffer. It could e.g. be based on shared memory on localhost and
> tcp/ip between remote computers. Anyone care to join me in making
> that? It could even result in a future PEP. I want an interface for
> Python, C and Fortran :-)

That would be really nice. But I'm wondering whether the pickle code
for a NumPy could be optimized, i.e. I don't see a good reason on why
the pickle of a binary (NumPy) object should take much more than a
simple memcpy (I mean, for large enough arrays).

But again, using PEP 3118 for transmitting objects among processes would
represent a major speed-up in the multiprocessing module indeed.
Although perhaps this list is not the best for discussing this. Should
we move this discussion to numpy or python-dev list?

--
Francesc Alted

Alex van Houten

unread,

Feb 27, 2011, 3:43:54 PM2/27/11

to cython...@googlegroups.com

Francesc Alted <faltet <at> gmail.com> writes:

>
> Excellent news. Perhaps you may want to make your code public, as it
> may serve for inspiration for others.
>

Sure. But where should I post it?
Cheers,
Alex.

Reply all

Reply to author

Forward