[ANN] Introduction to Cython parallel features

Francesc Alted

unread,

Sep 16, 2011, 5:32:47 AM9/16/11

to cython...@googlegroups.com

Just a brief note to announce that I just have given a quick
introduction to the new parallel capabilities in Cython. This has
been part of our traditional Python school that this year is taking
place at the University of St Andrews, Scotland. The lecture comes
with exercises and solutions:

https://python.g-node.org/wiki/parallel#parallel_cython

The material is free, so anybody is encouraged to make use of it for
learning purposes, or as the basis for other courses/talks.

Cheers!

--
Francesc Alted

mark florisson

unread,

Sep 16, 2011, 6:16:32 AM9/16/11

to cython...@googlegroups.com

Looks good! Next time you might want to announce it in advance to have
more interested people join the presentation, I was actually not so
far away :).

Dag Sverre Seljebotn

unread,

Sep 16, 2011, 6:28:31 AM9/16/11

to cython...@googlegroups.com

Thanks!

Note that there's a bug in the slides: "nogil" on a function indicates
that it can safely be called without holding the GIL, but it does not
release the GIL.

Myself I lay the fault strictly on Cython being too complicated here.

(There was a thread a while back where we discussed how we could perhaps
simplify this)..

Dag Sverre

Francesc Alted

unread,

Sep 16, 2011, 6:49:57 AM9/16/11

to cython...@googlegroups.com

Hey Mark,

2011/9/16 mark florisson <markflo...@gmail.com>:

Well, unfortunately the places for the school were limited to 30
people, and you need to apply at least several months before. Even
with this, you still need to be selected because we normally receive
way more applications than places we have (this year the ratio has
been 4:1).

At any rate, if you are interested, in future editions we may call you
as a lecturer instead :)

--
Francesc Alted

Francesc Alted

unread,

Sep 16, 2011, 6:54:02 AM9/16/11

to cython...@googlegroups.com

2011/9/16 Dag Sverre Seljebotn <d.s.se...@astro.uio.no>:

> Note that there's a bug in the slides: "nogil" on a function indicates that
> it can safely be called without holding the GIL, but it does not release the
> GIL.

Ah, that's a good point, thanks! Will try to remember it for the next time.

> Myself I lay the fault strictly on Cython being too complicated here.
>
> (There was a thread a while back where we discussed how we could perhaps
> simplify this)..

Exactly. In fact, I was bit surprised about the cdef function needing
the 'nogil' qualifier, and asked to the list. I don't remember the
outcome of the discussion, but my understanding was that this was not
going to change anytime soon.

--
Francesc Alted

mark florisson

unread,

Sep 16, 2011, 6:54:15 AM9/16/11

to cython...@googlegroups.com

On 16 September 2011 11:49, Francesc Alted <fal...@pytables.org> wrote:
> Hey Mark,
>
> 2011/9/16 mark florisson <markflo...@gmail.com>:
>> On 16 September 2011 10:32, Francesc Alted <fal...@pytables.org> wrote:
>> Looks good! Next time you might want to announce it in advance to have
>> more interested people join the presentation, I was actually not so
>> far away :).
>
> Well, unfortunately the places for the school were limited to 30
> people, and you need to apply at least several months before. Even
> with this, you still need to be selected because we normally receive
> way more applications than places we have (this year the ratio has
> been 4:1).

Oh wow, didn't know that. Might be a good idea to get a bigger room
next time. I'm currently at the university of Edinburgh, which has a
HPC department (they have HECTOR, the UK supercomputer). Perhaps they
would also welcome presentations. If there will be any such plan in
the future, I could talk to them.

Francesc Alted

unread,

Sep 16, 2011, 7:46:27 AM9/16/11

to cython...@googlegroups.com

2011/9/16 mark florisson <markflo...@gmail.com>:

>> Well, unfortunately the places for the school were limited to 30
>> people, and you need to apply at least several months before. Even
>> with this, you still need to be selected because we normally receive
>> way more applications than places we have (this year the ratio has
>> been 4:1).
>
> Oh wow, didn't know that. Might be a good idea to get a bigger room
> next time. I'm currently at the university of Edinburgh, which has a
> HPC department (they have HECTOR, the UK supercomputer). Perhaps they
> would also welcome presentations. If there will be any such plan in
> the future, I could talk to them.

Actually it is not a problem of small rooms (in fact, the room we are
right now could easily host 2x or 3x more people). The thing is that
we are auto-limiting ourselves in order to better tutorise students.

Anyway, we are still looking for a new host for the next edition of
the school, so if your university would be interested in hosting it,
please tell us about that.

Thanks!

--
Francesc Alted

mark florisson

unread,

Sep 16, 2011, 7:52:12 AM9/16/11

to cython...@googlegroups.com

Ok, for when is that planned? Is there a website with some more
information about this (I think the website you linked only contains
information for this session)?

> --
> Francesc Alted
>

Francesc Alted

unread,

Sep 16, 2011, 8:00:38 AM9/16/11

to cython...@googlegroups.com

2011/9/16 mark florisson <markflo...@gmail.com>:

Oh, there is not a special scheduling for the next school, so timing
is flexible. And here it is the official web page of the school:

https://python.g-node.org/wiki/

BTW, the contact person is Tiziano Zito:

http://www.cognition.tu-berlin.de/zito/

If you detect interest in your institution, please contact him directly.

Thanks,

--
Francesc Alted

mark florisson

unread,

Sep 16, 2011, 8:10:53 AM9/16/11

to cython...@googlegroups.com

Ok, I'll ask around, let's see what happens.

Francesc Alted

unread,

Sep 16, 2011, 11:14:39 AM9/16/11

to cython...@googlegroups.com

2011/9/16 Francesc Alted <fal...@pytables.org>:

> 2011/9/16 Dag Sverre Seljebotn <d.s.se...@astro.uio.no>:
>> Note that there's a bug in the slides: "nogil" on a function indicates that
>> it can safely be called without holding the GIL, but it does not release the
>> GIL.
>
> Ah, that's a good point, thanks! Will try to remember it for the next time.

And just for completeness, I think that:

for i in prange(N, nogil=True):
some computations here

should be a shortcut of:

with nogil:
for i in prange(N):
some computations here

Is that correct?

--
Francesc Alted

mark florisson

unread,

Sep 16, 2011, 11:23:54 AM9/16/11

to cython...@googlegroups.com

Yes, it actually releases the GIL, so those two are equivalent.

Francesc Alted

unread,

Sep 16, 2011, 11:40:02 AM9/16/11

to cython...@googlegroups.com

2011/9/16 mark florisson <markflo...@gmail.com>:

>> And just for completeness, I think that:
>>
>> for i in prange(N, nogil=True):
>> some computations here
>>
>> should be a shortcut of:
>>
>> with nogil:
>> for i in prange(N):
>> some computations here
>>
>> Is that correct?
>

> Yes, it actually releases the GIL, so those two are equivalent.
>

Ah, fine. And just out of curiosity, why is it necessary to release
the GIL before the parallel loop? It is because OpenMP requires the
calling thread to not become blocked? Or it is because some other
reason?

Thanks,

--
Francesc Alted

mark florisson

unread,

Sep 16, 2011, 11:52:40 AM9/16/11

to cython...@googlegroups.com

Well, if you keep the GIL, then you should allow GIL code, which means
you have to grab the GIL in each of the threads, which basically means
the entire point of the parallel loop is undone (it will in fact be
slower, unless you release the GIL inside the block) , so it's
basically to prevent the careless user from doing this possibly
non-obvious mistake.

You can still obtain the GIL in the parallel sections using the 'with
gil' block though, and you may even raise exceptions. Another
advantage of having a nogil prange is that the compiler will prevent
you from using python operations, so you're sure to use the fast
buffer indexing, etc.

Your code will also be better optimized if you don't have
break/continue/return/propagating exceptions, as it doesn't have to
flush a flag every iteration to see if it needs to stop. And basically
every python operation can raise an exception.

Sturla Molden

unread,

Sep 16, 2011, 12:05:54 PM9/16/11

to cython...@googlegroups.com

Den 16.09.2011 17:40, skrev Francesc Alted:
> Ah, fine. And just out of curiosity, why is it necessary to release
> the GIL before the parallel loop? It is because OpenMP requires the
> calling thread to not become blocked? Or it is because some other reason?

The OpenMP threads are registered with Python. I.e. they are
synchronized with the GIL unless you release it.

Sturla

Sturla Molden

unread,

Sep 16, 2011, 12:10:32 PM9/16/11

to cython...@googlegroups.com

Den 16.09.2011 18:05, skrev Sturla Molden:
>
> The OpenMP threads are registered with Python. I.e. they are
> synchronized with the GIL unless you release it.

... which means it is legal to use Python objects in an OpenMP thread if
it owns the GIL. Notice that the GIL can be re-acquired using "with gil"
instead of "with nogil".

Sturla

mark florisson

unread,

Sep 16, 2011, 12:10:47 PM9/16/11

to cython...@googlegroups.com

Well, with the cython.parallel support you *have* to release the GIL
(unless you are already in a nogil block, or if you are defined in a
nogil function that is called with or without the gil).

If you use the 'with gil' statement (or if you call a with gil
function), then the OpenMP threads will be registered with Python, but
your threads will run with the GIL released until you obtain it. This
is done to allow propagating exceptions, you don't really have to
worry about these details as a user.

Francesc Alted

unread,

Sep 16, 2011, 1:19:59 PM9/16/11

to cython...@googlegroups.com

2011/9/16 mark florisson <markflo...@gmail.com>:

Ok (after reading this several times), that makes sense.

Thanks for your patience,

--
Francesc Alted

Sturla Molden

unread,

Sep 16, 2011, 3:00:51 PM9/16/11

to cython...@googlegroups.com

Den 16.09.2011 18:10, skrev mark florisson:
> If you use the 'with gil' statement (or if you call a with gil
> function), then the OpenMP threads will be registered with Python, but
> your threads will run with the GIL released until you obtain it. This
> is done to allow propagating exceptions, you don't really have to
> worry about these details as a user.

What do you mean by "propagating exceptions" in the context of OpenMP?
OpenMP disallows jumps out of a block decorated with a pragma: goto,
setjmp/longjmp, and exceptions (C++ or Windows SEH) are considered evil.
An exception must be caught inside the OpenMP-block that raised it. You
cannot raise an exception from within an OpenMP thread and expect it to
"propagate" nicely out of the prange loop. Thus, how do we catch an
exception in a nogil block? I see numerous fail modes here...

Sturla

Dag Sverre Seljebotn

unread,

Sep 16, 2011, 3:09:53 PM9/16/11

to cython...@googlegroups.com

Here's a hint: It's normally a better strategy to assume that others are
at or above your own level of expertise, than to boastingly assume that
they are below you. That way you don't risk offending anybody. In
particular when said others have spent weeks implementing a feature.

For an answer, please type

python runtests.py --no-cleanup parallel

and then go look in the C source in BUILD. I assure you there's no
breaks or longjmp's to be found.

Dag Sverre

Sturla Molden

unread,

Sep 17, 2011, 8:45:24 AM9/17/11

to cython...@googlegroups.com

Den 16.09.2011 21:09, skrev Dag Sverre Seljebotn:
>
> Here's a hint: It's normally a better strategy to assume that others
> are at or above your own level of expertise, than to boastingly assume
> that they are below you. That way you don't risk offending anybody. In
> particular when said others have spent weeks implementing a feature.

Sorry, I did not mean to offend you.

I know that Python exceptions are not implemented with jump and stack
unroll statements (unlike C++). But still, it is nice to know which
thread will eventually get the Python exception, and what the other
threads will do.

But I guess I could trust you and Mark to having ensured that it
eventually ends up in the mother thread, so the programmer need not
worry which thread gets the exception at all. And if so, Cython would be
the only language (that I know of) where exception handling actually
works with OpenMP code. Which again means that parallel.prange does not
suffer from the biggest disadvantage of OpenMP, which is no standardized
error handling. A C++ programmer cannot raise an exception inside an
OpenMP parallel block and catch it outside. A C or Fortran programmer
cannot break, goto or longjmp out of an OpenMP parallel block. But if
you made sure that Cython can safely raise a Python exception inside a
parallel loop, and all threads will shut down and exit gracefully (more
or less), that is a major argument for using parallel.prange instead of
OpenMP in C, C++ or Fortran.

I would really like to run a parallel.prange loop inside a "try" block,
and catch an exception if it fails -- without having to worry about
which thread raised or caught the exception.

Sturla

mark florisson

unread,

Sep 17, 2011, 8:55:43 AM9/17/11

to cython...@googlegroups.com

You can break, continue, return and raise exceptions. Currently nested
parallelism is disabled, so it doesn't matter which thread raises the
exception, it re-raises it in the master thread after the parallel
construct. A prange block is exited by jumping to the end of the loop
body and by then skipping any subsequent iterations, a parallel
section by jumping to the end and then "falling off" the end of the
block. Of course you may use any combination of try/except/finally
(try/finally works in nogil mode, try/except doesn't).

Basically, you can break in multiple ways out of a parallel/prange
section, but it will always prefer exceptions over return or break
(e.g. if you raise at the same time in one thread as you return in
other thread). You can use try/except all you want around any such
section, and there's also nogil try/finally of course. If you raise
multiple exceptions you will simply get the first one (i.e., it's
undefined which one you get as the executed order of the iterations is
undefined).

mark florisson

unread,

Sep 17, 2011, 8:57:37 AM9/17/11

to cython...@googlegroups.com

BTW, this also works with nested parallelism, where the "mother
thread" is simply the master thread of the inner parallel section,
i.e. the encountering thread. It's gcc that was the problem.

Dag Sverre Seljebotn

unread,

Sep 17, 2011, 9:00:39 AM9/17/11

to cython...@googlegroups.com

Yes, as you say, other threads are more or less gracefully shut down. I
don't remember the exact rules if the documentation doesn't say, it was
all Mark's work.

I think each thread (except threads that raises) will complete its
current loop iteration, and that if multiple threads raise an exception
then all but the first exception will be ignored. But I'm not totally
sure without checking the docs or generated C sources (yes, I'm lazy too
:-) ).

But I'm sure that eventually all threads are exited and the first
exception raised ends up in the scope and thread that invoked the prange
in the first place.

Dag Sverre

Sturla Molden

unread,

Sep 17, 2011, 9:29:52 AM9/17/11

to cython...@googlegroups.com

Den 17.09.2011 14:55, skrev mark florisson:
> You can break, continue, return and raise exceptions. Currently nested
> parallelism is disabled, so it doesn't matter which thread raises the
> exception, it re-raises it in the master thread after the parallel
> construct. A prange block is exited by jumping to the end of the loop
> body and by then skipping any subsequent iterations, a parallel
> section by jumping to the end and then "falling off" the end of the
> block. Of course you may use any combination of try/except/finally
> (try/finally works in nogil mode, try/except doesn't). Basically, you
> can break in multiple ways out of a parallel/prange section, but it
> will always prefer exceptions over return or break (e.g. if you raise
> at the same time in one thread as you return in other thread). You can
> use try/except all you want around any such section, and there's also
> nogil try/finally of course. If you raise multiple exceptions you will
> simply get the first one (i.e., it's undefined which one you get as
> the executed order of the iterations is undefined).

Thanks for the explanation :-)

This kind of coding style would be impossible with C++ and OpenMP. There
we need to write error handling the same way you explained that Cython
generates C. I have used OpenMP with C++ and Fortran 95 at lot, and the
way I must do error handling always annoys me. This sounds much better.

I cannot recall ever having used a nested parallel block. But it is
useful for authours of libraries like MKL and ACML of course (but they
are not using Cython).

Sturla

Sturla Molden

unread,

Sep 17, 2011, 9:57:21 AM9/17/11

to cython...@googlegroups.com

Den 17.09.2011 15:00, skrev Dag Sverre Seljebotn:
>
> But I'm sure that eventually all threads are exited and the first
> exception raised ends up in the scope and thread that invoked the
> prange in the first place.
>

OpenMP puts an implicit barrier synchronization at the end of a parallel
block. So when the exception is caught outside the prange loop, we can
be sure that all threads have shut down (it doesn't matter how).

If you don't make sure the other threads shut down, and just let them
complete their iterations, the program would just wait on the implicit
barrier, possibly for a very long time. This would be easy to notice. A
test case could just use a timer and see how long it takes for the
exception to be caught.

One can use an atomic read/write on a shared variable as a stop flag in
a nogil context, but it would often be inefficient to check it on every
iteration of prange. It's one of those things that could be controlled
with a keyword argument to prange. And every GIL aquirement ("with gil")
should preferably check if another thread has raised an exception.
Grabbing the GIL has a big overhead anyway, so it does not matter if we
check a stop flag as well.

Sturla

mark florisson

unread,

Sep 17, 2011, 10:20:34 AM9/17/11

to cython...@googlegroups.com

On 17 September 2011 14:57, Sturla Molden <sturla...@yahoo.no> wrote:
> Den 17.09.2011 15:00, skrev Dag Sverre Seljebotn:
>>
>> But I'm sure that eventually all threads are exited and the first
>> exception raised ends up in the scope and thread that invoked the prange in
>> the first place.
>>
>
> OpenMP puts an implicit barrier synchronization at the end of a parallel
> block. So when the exception is caught outside the prange loop, we can be
> sure that all threads have shut down (it doesn't matter how).
>
> If you don't make sure the other threads shut down, and just let them
> complete their iterations, the program would just wait on the implicit
> barrier, possibly for a very long time. This would be easy to notice. A test
> case could just use a timer and see how long it takes for the exception to
> be caught.
>
> One can use an atomic read/write on a shared variable as a stop flag in a
> nogil context, but it would often be inefficient to check it on every
> iteration of prange.

That's what it does, it writes to a shared flag and flushes it. Every
iteration does the check to see if it needs to skip the loop body. I'm
not sure you want another option for that, I think it's
micro-optimization that is too hard to explain concisely, I don't
think many people would be interest in that (unless you can provide
benchmarks that show that the check is actually slow).

> It's one of those things that could be controlled with
> a keyword argument to prange. And every GIL aquirement ("with gil") should
> preferably check if another thread has raised an exception. Grabbing the GIL
> has a big overhead anyway, so it does not matter if we check a stop flag as
> well.
>
> Sturla
>
>

We could additionally do that, have more of those checkpoints. It
would be very easy to implement.

Sturla Molden

unread,

Sep 17, 2011, 11:11:37 AM9/17/11

to cython...@googlegroups.com

Den 17.09.2011 16:20, skrev mark florisson:
> That's what it does, it writes to a shared flag and flushes it. Every
> iteration does the check to see if it needs to skip the loop body.

The flush pragma provides a full memory fence, and will force a
synchronization to the main memory. But the other threads might still
have a local copy, until they flush as well.

If a particular code will work depends on:

1. The compiler -- the flag could be kept in a register.

2. The OpenMP implementation -- how shared objects are stored, how a
flush is implemented.

3. The hardware -- a multi-core CPU with shared cache might behave
differently from multiple processors with private caches.

The flag must be flushed before it is checked, which is the expensive
part I was taking about. Otherwise you have no guarrantee that the value
you read is consistent with the value stored in memory.

See e.g. this example:

http://msdn.microsoft.com/en-us/library/sz9sd6et%28v=vs.80%29.aspx

Sturla

mark florisson

unread,

Sep 17, 2011, 11:31:10 AM9/17/11

to cython...@googlegroups.com

On 17 September 2011 16:11, Sturla Molden <sturla...@yahoo.no> wrote:
> Den 17.09.2011 16:20, skrev mark florisson:
>>
>> That's what it does, it writes to a shared flag and flushes it. Every
>> iteration does the check to see if it needs to skip the loop body.
>
> The flush pragma provides a full memory fence, and will force a
> synchronization to the main memory. But the other threads might still have a
> local copy, until they flush as well.

Yes, all threads flush the variable at the end of their loop body each
iteration. So how much of an impact are we talking here? We can't rely
on just with gil functions/statements as checkpoints (they may not
even be present), so "every iteration" is the plan we chose. If you
can provide a benchmark that shows we need an option to disable the
check, I'd be happy to implement it (note: this check is only in place
if you break/return/propagate an exception).

And please, before you try to teach me more things I already know,
could you inspect the generated code? It already took all these
considerations you've been listing into account. So if you find that
we can actually generate better code after inspecting it, we're open
to suggestions.

Sturla Molden

unread,

Sep 17, 2011, 11:57:00 AM9/17/11

to cython...@googlegroups.com

Den 17.09.2011 17:31, skrev mark florisson:
> Yes, all threads flush the variable at the end of their loop body each
> iteration. So how much of an impact are we talking here?

A full memory fence (Interlocked*) on x86 takes between 150 and 250 CPU
cycles, as I understand it. That should be comparable to about four or
five floating point additions. With full contention on a 4 core CPU, it
could e.g. be in the order of 16-20 floating point additions per
iteration of the loop. It might not be noticed, but it depends on the
tightness of the prange loop.

Sturla

mark florisson

unread,

Sep 17, 2011, 12:10:55 PM9/17/11

to cython...@googlegroups.com

Indeed there seems to be quite some overhead, it's nearly twice as
fast without the flush for a loop that doesn't do anything. But if you
need to do something simple then you wouldn't be using
break/return/acquiring the GIL I assume (so the check wouldn't be
there), and otherwise I'm assuming the flush is negligible compared to
the work done in the loop, so I'm not sure if this is a big deal.

mark florisson

unread,

Sep 17, 2011, 12:14:26 PM9/17/11

to cython...@googlegroups.com

With a 'with gil' block in there the speed difference is about 0.3%.

Dag Sverre Seljebotn

unread,

Sep 17, 2011, 12:35:23 PM9/17/11

to cython...@googlegroups.com

Yes. If there's a better way of getting the same semantics, of course
one should do it. But the "manual switch" is simply to avoid raising
exceptions and code things the old-fashioned way, if the loop is very
tight and yet is able to fail.

Dag Sverre

Robert Bradshaw

unread,

Sep 17, 2011, 6:25:17 PM9/17/11

to cython...@googlegroups.com

On Sat, Sep 17, 2011 at 6:29 AM, Sturla Molden <sturla...@yahoo.no> wrote:
> Den 17.09.2011 14:55, skrev mark florisson:
>>
>> You can break, continue, return and raise exceptions. Currently nested
>> parallelism is disabled, so it doesn't matter which thread raises the
>> exception, it re-raises it in the master thread after the parallel
>> construct. A prange block is exited by jumping to the end of the loop body
>> and by then skipping any subsequent iterations, a parallel section by
>> jumping to the end and then "falling off" the end of the block. Of course
>> you may use any combination of try/except/finally (try/finally works in
>> nogil mode, try/except doesn't). Basically, you can break in multiple ways
>> out of a parallel/prange section, but it will always prefer exceptions over
>> return or break (e.g. if you raise at the same time in one thread as you
>> return in other thread). You can use try/except all you want around any such
>> section, and there's also nogil try/finally of course. If you raise multiple
>> exceptions you will simply get the first one (i.e., it's undefined which one
>> you get as the executed order of the iterations is undefined).
>
> Thanks for the explanation :-)
>
> This kind of coding style would be impossible with C++ and OpenMP. There we
> need to write error handling the same way you explained that Cython
> generates C. I have used OpenMP with C++ and Fortran 95 at lot, and the way
> I must do error handling always annoys me. This sounds much better.

Yes, this is one of many things that makes prange much better than a
simple transliteration of OpenMP into Cython. Note that all this extra
code is only used iff there's a (possibly exception raising) with gil
block or other non-standard exit. If flushing on every iteration is an
issue, I imagine we could flush only when it's written to or on every
Nth iteration otherwise.

- Robert

mark florisson

unread,

Sep 17, 2011, 6:38:57 PM9/17/11

to cython...@googlegroups.com

On 17 September 2011 23:25, Robert Bradshaw

Well, you can't do it after it's being written to only, as you don't
know when it will be written to, that's why you check the flag :)
Every Nth iteration is possible though. I'm still unconvinced it's an
issue at all though.

Robert Bradshaw

unread,

Sep 17, 2011, 6:49:09 PM9/17/11

to cython...@googlegroups.com

Yeah. I meant a thread flushes immediately it writes.

> Every Nth iteration is possible though.

But of course one would want to avoid %N, that can be expensive too.
Probably best to do a counter down to 0 and then check.

> I'm still unconvinced it's an
> issue at all though.

Your timings seem to indicate it's not, at least on your platform. And
much, *much,* better than forcing users to worry about the
restrictions in breaking out of a loop themselves.

- Robert

mark florisson

unread,

Sep 17, 2011, 6:52:00 PM9/17/11

to cython...@googlegroups.com

On 17 September 2011 23:49, Robert Bradshaw

Right, it does that, but OpenMP requires that other threads (that need
to read it) should also flush the variable. So both the writer and the
readers need to flush.

Robert Bradshaw

unread,

Sep 17, 2011, 6:59:28 PM9/17/11

to cython...@googlegroups.com

On Sat, Sep 17, 2011 at 3:52 PM, mark florisson

Yes, I understand that. (Flush is a bad name, should be "sync" or
something.) I meant the writer flushes, the readers flush every Nth
iteration (if it is a performance issue to flush every iteration).

- Robert

mark florisson

unread,

Sep 17, 2011, 7:06:43 PM9/17/11

to cython...@googlegroups.com

On 17 September 2011 23:59, Robert Bradshaw

Ah, ok. Indeed, it wouldn't make sense to flush your write any time
later than immediately.

Sturla Molden

unread,

Sep 17, 2011, 7:16:20 PM9/17/11

to cython...@googlegroups.com

Den 18.09.2011 00:52, skrev mark florisson:
> On 17 September 2011 23:49, Robert Bradshaw
> <robe...@math.washington.edu> wrote:
>>
>> Yeah. I meant a thread flushes immediately it writes.
> Right, it does that, but OpenMP requires that other threads (that need
> to read it) should also flush the variable. So both the writer and the
> readers need to flush.
>

The specification of OpenMP seems to indicate that a shared variable
should be consistent across all threads after a flush. But in practice
it works like Mark says, that also reader threads must flush to get a
consistent view of the variable.

In practice, a flush synchronizes a shared variable with RAM, but might
not affect the view that other threads have of the shared variable,
countrary to making the views consistent across all threads. The OpenMP
docs does not indicate that reader threads must flush, a flush should
only be needed to commit a write, but it might be that the specification
was impossible or too awkward to implement.

Sturla

Robert L Cloud

unread,

Sep 17, 2011, 8:40:58 PM9/17/11

to cython-users

Hi, I've read the thread but it was a bit technical for me and I am
confused on the best way to implement my cython function in parallel.

Here is the original:
cimport numpy as np
cimport cython
ctypedef np.float32_t dtype_t
@cython.boundscheck(False)
@cython.wraparound(False)
def solve(np.ndarray[dtype_t, ndim=2] u,
np.ndarray[dtype_t, ndim=2] u_new,
int dimy,
int dimx):
cdef int y, x
for y in range(1, dimy - 1):
for x in range(1, dimx - 1):
u_new[y,x] = (u[y + 1,x] + u[y - 1,x] +
u[y,x + 1] + u[y,x - 1]) / 4

Now, I have a nested loop in here, and if I release the gil and use
prange for the y loop, I will not be able to use the regular python
range function for the inner x loop. The solution would be to use
prange on the x loop and release the gil, but from my limited parallel
programming experience, it would be better to decompose with larger
blocks(the y loop).

Can I release the gil with the outer loop and use prange, and then
reclaim the gil with the inner loop? If even possible, might there be
significant overhead by doing this?

What would be the recommended decomposition for this particular
problem. This function must be called many thousands of times until
my system converges.

Lastly, thanks for all the great work with Cython. I greatly
appreciate it as a bridge between python and high performance and with
the new parallel features, it somewhat complicates the arguments for
multi language programming. I mean, in the project i'm associated
with, we are using Fortran for the heavy lifting, but aren't even
using shared memory programming, only distributed. We are constantly
reevaluating our methods and now with trivially easy shared memory
programming coupled with python's expressivity, it may sway the
balance away from using Fortran except for interfacing with some
legacy codes.

Thanks & Regards
Robert L Cloud
http://www.robertlouiscloud.com

Robert Bradshaw

unread,

Sep 18, 2011, 12:17:09 AM9/18/11

to cython...@googlegroups.com

On Sat, Sep 17, 2011 at 5:40 PM, Robert L Cloud <rcl...@gmail.com> wrote:
> Hi, I've read the thread but it was a bit technical for me and I am
> confused on the best way to implement my cython function in parallel.
>
> Here is the original:
> cimport numpy as np
> cimport cython
> ctypedef np.float32_t dtype_t
> @cython.boundscheck(False)
> @cython.wraparound(False)
> def solve(np.ndarray[dtype_t, ndim=2] u,
> np.ndarray[dtype_t, ndim=2] u_new,
> int dimy,
> int dimx):
> cdef int y, x
> for y in range(1, dimy - 1):
> for x in range(1, dimx - 1):
> u_new[y,x] = (u[y + 1,x] + u[y - 1,x] +
> u[y,x + 1] + u[y,x - 1]) / 4
>
>
> Now, I have a nested loop in here, and if I release the gil and use
> prange for the y loop, I will not be able to use the regular python
> range function for the inner x loop.

Actually, Cython doesn't use Python's range here either (as the bounds
and index variable are typed, it can turn this into a C for loop).

> The solution would be to use
> prange on the x loop and release the gil, but from my limited parallel
> programming experience, it would be better to decompose with larger
> blocks(the y loop).
>
> Can I release the gil with the outer loop and use prange, and then
> reclaim the gil with the inner loop? If even possible, might there be
> significant overhead by doing this?
>
> What would be the recommended decomposition for this particular
> problem.

Put the prange on the outer loop, and it should just work.

> This function must be called many thousands of times until
> my system converges.

You may also want to consider making your solve function into a cpdef
function, for faster calling.

> Lastly, thanks for all the great work with Cython.

Thanks.

> I greatly
> appreciate it as a bridge between python and high performance and with
> the new parallel features, it somewhat complicates the arguments for
> multi language programming. I mean, in the project i'm associated
> with, we are using Fortran for the heavy lifting, but aren't even
> using shared memory programming, only distributed. We are constantly
> reevaluating our methods and now with trivially easy shared memory
> programming coupled with python's expressivity, it may sway the
> balance away from using Fortran except for interfacing with some
> legacy codes.

I think you're not the only one in that boat...

- Robert

Dag Sverre Seljebotn

unread,

Sep 18, 2011, 2:37:38 AM9/18/11

to cython...@googlegroups.com

How about using a (private) pointer to a shared flag memory area?

The problem with the modulo-iteration is that the number of iterations you want to skip strongly depend on what the loop does. Each iteration could take an hour or a microsecond....checking is probably out of the question though...
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Sturla Molden

unread,

Sep 18, 2011, 3:47:09 AM9/18/11

to cython...@googlegroups.com

Den 18.09.2011 08:37, skrev Dag Sverre Seljebotn:
> How about using a (private) pointer to a shared flag memory area?

It should possibly work, if we don't need to synchronize reads and
writes. (I don't think we do.) But we would need the volatile qualifier,
just in case someone has a super smart compiler. If we do this,
synchronization with RAM should only happen when the cache line gets
dirty (i.e. after a write).

Sturla

mark florisson

unread,

Sep 18, 2011, 4:40:39 AM9/18/11

to cython...@googlegroups.com

That's not how I read it (I quote from the OpenMP 3.0 spec):

"The flush operation provides a guarantee of consistency between a
thread’s temporary view and memory. Therefore, the flush operation can
be used to guarantee that a value written to a variable by one thread
may be read by a second thread. To accomplish this, the programmer
must ensure that the second thread has not written to the variable
since its last flush of the variable, and that the following sequence
of events happens in the specified order:
1. The value is written to the variable by the first thread.
2. The variable is flushed by the first thread.
3. The variable is flushed by the second thread.
4. The value is read from the variable by the second thread."

mark florisson

unread,

Sep 18, 2011, 4:42:41 AM9/18/11

to cython...@googlegroups.com

Won't volatile force a read from memory every time you try to read the variable?

Sturla Molden

unread,

Sep 18, 2011, 9:20:18 AM9/18/11

to cython...@googlegroups.com

Den 18.09.2011 10:42, skrev mark florisson:

Won't volatile force a read from memory every time you try to read the variable?

Yes it will, but not with a memory fence.

I did not properly understand what "volatile" does until recently, and it is not what most of us expect. The only thing "volatile" will do is to prevent an optimizing compiler from storing the variable in a register. It does not enforce a read or write to main memory. And a modern computer has hierarchical memory (i.e. RAM and several cache layers on the CPU). Forgive me if this is already known to you (it might very well be common knowledge):

If you dereference a pointer a lot, and it does not point to a volatile value, an optimizing compiler can legally decide to keep a local copy in a register. This will typically happen the compiler can decide that the pointer is not aliased. If you keep on polling *flag in a loop, the compiler can decide to store the value *flag temporarily in a register, and you will not detect that another thread has changed the value. However: Volatile does not ensure that the value is written to RAM if there is hierarchical memory. It only prevents register allocation.

Getting a value from cache is very fast, so that's not a problem.

To read or write all the way down to main memory (RAM), we need to use a "full memory fence". That is, Interlocked* API on Windows or __synch_* primitives on GCC. Setting up a memory fence is much more expensive than merely reading from memory (cache or RAM) without a fence (by a factor of 10 or 100 in terms of CPU cycles). So what a "flush" pragma does in OpenMP, is to set a value with full memory fence (i.e. acquire a lock), issue a memcpy to synchronize the local copy of the shared variable, and release the lock (again with a full fence). The memcpy might not reach down to RAM, but the hardware will make sure that hierachical memory on different processors are kept in synchrony.

What happens then with reading or writing a volatile value *flag, is that the value is intially written to a location in L1 or L2 cache instead of a temporary register. That's all volatile will do for us.

But then the hardware will respond like this: The cache line with the address of the variable (typically less than 128 bytes) is tagged dirty. If two processors share cache line (e.g. they both are polling the value of *flag), they stop whatever they are doing and synchronize their cache with RAM. This is called "false sharing". But in this case, it only happens when someone writes to *flag, not when *flag is polled. When *flag is polled, and not changed since the previous poll, you just get a value quickly from the nearest cache.

If two threads access the same variable simultaneously, it can result in a garbage value being written or read. But here there is only two possible values, 0 or 1, so I don't think it matters. The worst thing that can happen is that the thread reading garbage competes one iteration in surplus, which is not very relevant.

Sturla

mark florisson

unread,

Sep 18, 2011, 9:47:40 AM9/18/11

to cython...@googlegroups.com

I thought false sharing meant that you access data that never changes,
but your cache line gets flushed because other (unrelated) data
happens to be in the same cache line and got written to by another
party.

> If two threads access the same variable simultaneously, it can result in a
> garbage value being written or read. But here there is only two possible
> values, 0 or 1, so I don't think it matters. The worst thing that can happen
> is that the thread reading garbage competes one iteration in surplus, which
> is not very relevant.
>
> Sturla
>
>
>

We also assign other values, basically 1, 2, or 3 (the default is 0,
i.e. "don't do anything). Could sig_atomic_t help here? Although I
think the standard just specifies it to be atomic with respect to
signal handlers.

Anyway, thanks for your explanation, I knew most of it, but it's good
to get a complete picture. So if an OpenMP flush ensures that
everything is synchronized, then what exactly does a flush for reading
do? Does it use this __synch__ to invalidate the cache (for what
reason, the hardware should have already done that)?

Robert L Cloud

unread,

Sep 18, 2011, 4:15:25 PM9/18/11

to cython-users

Hi guys, I've parallelized the solver I posted earlier in this thread
and have graphed the runtimes vs number of threads. I realized very
excellent speedup, nearly linear. The new code and the graphs are at
the following link
http://www.rcloud.me/2011/09/18/parallel-cython-implementation/

Robert

mark florisson

unread,

Sep 18, 2011, 4:27:28 PM9/18/11

to cython...@googlegroups.com

Cool, thanks for sharing :)

Robert L Cloud

unread,

Sep 18, 2011, 6:04:44 PM9/18/11

to cython-users

I've run into some interesting results regarding the parallel cython
Warmup runs have had a tremendous performance impact when running
sequential cython.

In fact, because I didn't know this, this page was in error(now fixed)
http://www.rcloud.me/2011/09/18/cython-implementation/

also at that page there is a comparison of speeds resulting from using
intel's compiler vs the gnu compiler to make the cython shared
libraries. Both use optimization level 3 in the respective
compiler.

However, warmup runs have virtually no impact on the performance using
parallel cython, even when using a single thread. Just an interesting
observation I thought I would share. Science is fun!

Robert L Cloud

unread,

Sep 18, 2011, 7:41:30 PM9/18/11

to cython-users

I did some poor analysis in my first post about parallel cython, and
did not graph against the regularly compiled sequential cython program
running without openmp. I found that the sequential cython program
was faster than a two thread openmp implementation using prange.
Using 4 and 8 cores, however, the openmp code ran faster. These tests
were done with the intel compiler. I've posted the full code, in
hopes someone can determine a way to make the baseline openmp code
faster. They should also be run with the gnu compiler to see if there
is any difference.

The updated analysis can be found here:
http://www.rcloud.me/2011/09/18/parallel-cython-implementation/

mark florisson

unread,

Sep 19, 2011, 4:45:44 AM9/19/11

to cython...@googlegroups.com

That's surprising, I'll see if I can look into that tonight. There is
definitely some overhead to the OpenMP as it is generic with respect
to start/stop/step. In OpenMP you can't have an empty loop and you
don't know which way you're iterating (ascending or descending, i.e.,
whether you need the sign '<' or '>'), so what it does is it computes
the amount of steps it will have to take, and then computes the index
in the loop body.

I see that iterations over range() without a step are optimized, but
range() with a step isn't. If this performance difference can be
attributed to the index computation, then we could do the same thing
as range for prange.
There's also the initialization of privates to invalid values each iteration.

It might also be that due to OpenMP the C compiler can't make certain
optimizations. That's just an assumption, I'll have a look tonight.

Dag Sverre Seljebotn

unread,

Sep 19, 2011, 5:50:28 AM9/19/11

to cython...@googlegroups.com

One should also take into account the O(1) overhead of entering the loop
and spawning a thread etc. It's a good idea to increase the number of
iterations and see if the difference gets smaller...

Dag Sverre

mark florisson

unread,

Sep 19, 2011, 5:53:03 AM9/19/11

to cython...@googlegroups.com

On 19 September 2011 10:50, Dag Sverre Seljebotn

Yeah, but his runs take at least ~40 seconds.

Sturla Molden

unread,

Sep 19, 2011, 10:26:46 AM9/19/11

to cython...@googlegroups.com

Den 18.09.2011 15:47, skrev mark florisson:
> We also assign other values, basically 1, 2, or 3 (the default is 0,
> i.e. "don't do anything).

Then leave it as it is :-)

> So if an OpenMP flush ensures that everything is synchronized, then
> what exactly does a flush for reading do?

I doesn't. We have to flush before reading.

Sturla

mark florisson

unread,

Sep 19, 2011, 3:38:32 PM9/19/11

to cython...@googlegroups.com

Ah, so you mean to say that you need it because it may not be
volatile, so the reader may have it in a register. That makes sense,
thanks.

mark florisson

unread,

Sep 19, 2011, 4:31:33 PM9/19/11

to cython...@googlegroups.com

I didn't actually look very well at your code last time, but you're
calling the function with the prange lots of times, so there is always
communication overhead even if OpenMP is using a thread pool. My
results were not quite as bad though:

range() : 34.2953708172
1 thread : 43.5137341022
2 threads : 24.4000601768
4 threads : 15.2648839951

So perhaps icc's OpenMP runtime library doesn't keep a thread pool
around? I don't know, I don't use that compiler. As you can see the
sequential OpenMP version is a little slower though. However, if I
tweak the C code (i.e. don't initialize privates to invalid values,
don't compute the index variable but update it directly in the for
loop (there is no step) and remove some other sanity checks) I get the
exact same results.

Now if I remove the OpenMP pragmas but use the code from prange() I
get the exact same result as for range()! So I'm just going to blame
the C compiler and runtime libraries here, I don't think there's much
I can do to improve performance. My suggestion is to try and rewrite
your loop as one big parallel prange and not call a function with a
prange a lot of times. My other suggestion is to use a better
compiler/runtime library.

mark florisson

unread,

Sep 19, 2011, 4:37:04 PM9/19/11

to cython...@googlegroups.com

Also, we have to consider Amdahl's Law, you're doing iteration and
calls from Python and the prange loop is certainly not the only place
where work is taking place. So it's not a very good benchmark, you
should compare only the range() agains only the prange() and ignore
the sequential stuff.

Sturla Molden

unread,

Sep 19, 2011, 5:32:14 PM9/19/11

to cython...@googlegroups.com

Den 19.09.2011 21:38, skrev mark florisson:
> Ah, so you mean to say that you need it because it may not be
> volatile, so the reader may have it in a register. That makes sense,
> thanks.

If you declare a shared variable volatile, OpenMP will do an implicit
flush before each access.

Dag Sverre suggested a private pointer instead of a shared variable.
That could result in a register allocation if the compiler decides the
pointer has no alias in the loop. If we dereference the same pointer a
lot, an an optimizing compiler can use a register temporarily. We would
need to declare the target volatile to avoid that.

A volatile target for a private pointer does not ensure synchronization
with main memory, though. Only a full memory fence (e.g. flushing a
shared variable) will do that.

Sturla

Robert L Cloud

unread,

Sep 19, 2011, 9:48:46 PM9/19/11

to cython-users

Y'all might be interested in this which compares the performance of
Cython to PyOpenCL. Even a sequential Cython program exceeds my
(admittedly poorly implemented) OpenCL program on my machine.

http://www.rcloud.me/2011/09/20/pyopencl-implementation/

Reply all

Reply to author

Forward