https://python.g-node.org/wiki/parallel#parallel_cython
The material is free, so anybody is encouraged to make use of it for
learning purposes, or as the basis for other courses/talks.
Cheers!
--
Francesc Alted
Looks good! Next time you might want to announce it in advance to have
more interested people join the presentation, I was actually not so
far away :).
Thanks!
Note that there's a bug in the slides: "nogil" on a function indicates
that it can safely be called without holding the GIL, but it does not
release the GIL.
Myself I lay the fault strictly on Cython being too complicated here.
(There was a thread a while back where we discussed how we could perhaps
simplify this)..
Dag Sverre
2011/9/16 mark florisson <markflo...@gmail.com>:
Well, unfortunately the places for the school were limited to 30
people, and you need to apply at least several months before. Even
with this, you still need to be selected because we normally receive
way more applications than places we have (this year the ratio has
been 4:1).
At any rate, if you are interested, in future editions we may call you
as a lecturer instead :)
--
Francesc Alted
Ah, that's a good point, thanks! Will try to remember it for the next time.
> Myself I lay the fault strictly on Cython being too complicated here.
>
> (There was a thread a while back where we discussed how we could perhaps
> simplify this)..
Exactly. In fact, I was bit surprised about the cdef function needing
the 'nogil' qualifier, and asked to the list. I don't remember the
outcome of the discussion, but my understanding was that this was not
going to change anytime soon.
--
Francesc Alted
Oh wow, didn't know that. Might be a good idea to get a bigger room
next time. I'm currently at the university of Edinburgh, which has a
HPC department (they have HECTOR, the UK supercomputer). Perhaps they
would also welcome presentations. If there will be any such plan in
the future, I could talk to them.
Actually it is not a problem of small rooms (in fact, the room we are
right now could easily host 2x or 3x more people). The thing is that
we are auto-limiting ourselves in order to better tutorise students.
Anyway, we are still looking for a new host for the next edition of
the school, so if your university would be interested in hosting it,
please tell us about that.
Thanks!
--
Francesc Alted
Ok, for when is that planned? Is there a website with some more
information about this (I think the website you linked only contains
information for this session)?
> --
> Francesc Alted
>
Oh, there is not a special scheduling for the next school, so timing
is flexible. And here it is the official web page of the school:
https://python.g-node.org/wiki/
BTW, the contact person is Tiziano Zito:
http://www.cognition.tu-berlin.de/zito/
If you detect interest in your institution, please contact him directly.
Thanks,
--
Francesc Alted
Ok, I'll ask around, let's see what happens.
And just for completeness, I think that:
for i in prange(N, nogil=True):
some computations here
should be a shortcut of:
with nogil:
for i in prange(N):
some computations here
Is that correct?
--
Francesc Alted
Yes, it actually releases the GIL, so those two are equivalent.
Ah, fine. And just out of curiosity, why is it necessary to release
the GIL before the parallel loop? It is because OpenMP requires the
calling thread to not become blocked? Or it is because some other
reason?
Thanks,
--
Francesc Alted
Well, if you keep the GIL, then you should allow GIL code, which means
you have to grab the GIL in each of the threads, which basically means
the entire point of the parallel loop is undone (it will in fact be
slower, unless you release the GIL inside the block) , so it's
basically to prevent the careless user from doing this possibly
non-obvious mistake.
You can still obtain the GIL in the parallel sections using the 'with
gil' block though, and you may even raise exceptions. Another
advantage of having a nogil prange is that the compiler will prevent
you from using python operations, so you're sure to use the fast
buffer indexing, etc.
Your code will also be better optimized if you don't have
break/continue/return/propagating exceptions, as it doesn't have to
flush a flag every iteration to see if it needs to stop. And basically
every python operation can raise an exception.
The OpenMP threads are registered with Python. I.e. they are
synchronized with the GIL unless you release it.
Sturla
... which means it is legal to use Python objects in an OpenMP thread if
it owns the GIL. Notice that the GIL can be re-acquired using "with gil"
instead of "with nogil".
Sturla
Well, with the cython.parallel support you *have* to release the GIL
(unless you are already in a nogil block, or if you are defined in a
nogil function that is called with or without the gil).
If you use the 'with gil' statement (or if you call a with gil
function), then the OpenMP threads will be registered with Python, but
your threads will run with the GIL released until you obtain it. This
is done to allow propagating exceptions, you don't really have to
worry about these details as a user.
Ok (after reading this several times), that makes sense.
Thanks for your patience,
--
Francesc Alted
What do you mean by "propagating exceptions" in the context of OpenMP?
OpenMP disallows jumps out of a block decorated with a pragma: goto,
setjmp/longjmp, and exceptions (C++ or Windows SEH) are considered evil.
An exception must be caught inside the OpenMP-block that raised it. You
cannot raise an exception from within an OpenMP thread and expect it to
"propagate" nicely out of the prange loop. Thus, how do we catch an
exception in a nogil block? I see numerous fail modes here...
Sturla
Here's a hint: It's normally a better strategy to assume that others are
at or above your own level of expertise, than to boastingly assume that
they are below you. That way you don't risk offending anybody. In
particular when said others have spent weeks implementing a feature.
For an answer, please type
python runtests.py --no-cleanup parallel
and then go look in the C source in BUILD. I assure you there's no
breaks or longjmp's to be found.
Dag Sverre
Sorry, I did not mean to offend you.
I know that Python exceptions are not implemented with jump and stack
unroll statements (unlike C++). But still, it is nice to know which
thread will eventually get the Python exception, and what the other
threads will do.
But I guess I could trust you and Mark to having ensured that it
eventually ends up in the mother thread, so the programmer need not
worry which thread gets the exception at all. And if so, Cython would be
the only language (that I know of) where exception handling actually
works with OpenMP code. Which again means that parallel.prange does not
suffer from the biggest disadvantage of OpenMP, which is no standardized
error handling. A C++ programmer cannot raise an exception inside an
OpenMP parallel block and catch it outside. A C or Fortran programmer
cannot break, goto or longjmp out of an OpenMP parallel block. But if
you made sure that Cython can safely raise a Python exception inside a
parallel loop, and all threads will shut down and exit gracefully (more
or less), that is a major argument for using parallel.prange instead of
OpenMP in C, C++ or Fortran.
I would really like to run a parallel.prange loop inside a "try" block,
and catch an exception if it fails -- without having to worry about
which thread raised or caught the exception.
Sturla
You can break, continue, return and raise exceptions. Currently nested
parallelism is disabled, so it doesn't matter which thread raises the
exception, it re-raises it in the master thread after the parallel
construct. A prange block is exited by jumping to the end of the loop
body and by then skipping any subsequent iterations, a parallel
section by jumping to the end and then "falling off" the end of the
block. Of course you may use any combination of try/except/finally
(try/finally works in nogil mode, try/except doesn't).
Basically, you can break in multiple ways out of a parallel/prange
section, but it will always prefer exceptions over return or break
(e.g. if you raise at the same time in one thread as you return in
other thread). You can use try/except all you want around any such
section, and there's also nogil try/finally of course. If you raise
multiple exceptions you will simply get the first one (i.e., it's
undefined which one you get as the executed order of the iterations is
undefined).
BTW, this also works with nested parallelism, where the "mother
thread" is simply the master thread of the inner parallel section,
i.e. the encountering thread. It's gcc that was the problem.
Yes, as you say, other threads are more or less gracefully shut down. I
don't remember the exact rules if the documentation doesn't say, it was
all Mark's work.
I think each thread (except threads that raises) will complete its
current loop iteration, and that if multiple threads raise an exception
then all but the first exception will be ignored. But I'm not totally
sure without checking the docs or generated C sources (yes, I'm lazy too
:-) ).
But I'm sure that eventually all threads are exited and the first
exception raised ends up in the scope and thread that invoked the prange
in the first place.
Dag Sverre
Thanks for the explanation :-)
This kind of coding style would be impossible with C++ and OpenMP. There
we need to write error handling the same way you explained that Cython
generates C. I have used OpenMP with C++ and Fortran 95 at lot, and the
way I must do error handling always annoys me. This sounds much better.
I cannot recall ever having used a nested parallel block. But it is
useful for authours of libraries like MKL and ACML of course (but they
are not using Cython).
Sturla
OpenMP puts an implicit barrier synchronization at the end of a parallel
block. So when the exception is caught outside the prange loop, we can
be sure that all threads have shut down (it doesn't matter how).
If you don't make sure the other threads shut down, and just let them
complete their iterations, the program would just wait on the implicit
barrier, possibly for a very long time. This would be easy to notice. A
test case could just use a timer and see how long it takes for the
exception to be caught.
One can use an atomic read/write on a shared variable as a stop flag in
a nogil context, but it would often be inefficient to check it on every
iteration of prange. It's one of those things that could be controlled
with a keyword argument to prange. And every GIL aquirement ("with gil")
should preferably check if another thread has raised an exception.
Grabbing the GIL has a big overhead anyway, so it does not matter if we
check a stop flag as well.
Sturla
That's what it does, it writes to a shared flag and flushes it. Every
iteration does the check to see if it needs to skip the loop body. I'm
not sure you want another option for that, I think it's
micro-optimization that is too hard to explain concisely, I don't
think many people would be interest in that (unless you can provide
benchmarks that show that the check is actually slow).
> It's one of those things that could be controlled with
> a keyword argument to prange. And every GIL aquirement ("with gil") should
> preferably check if another thread has raised an exception. Grabbing the GIL
> has a big overhead anyway, so it does not matter if we check a stop flag as
> well.
>
> Sturla
>
>
We could additionally do that, have more of those checkpoints. It
would be very easy to implement.
The flush pragma provides a full memory fence, and will force a
synchronization to the main memory. But the other threads might still
have a local copy, until they flush as well.
If a particular code will work depends on:
1. The compiler -- the flag could be kept in a register.
2. The OpenMP implementation -- how shared objects are stored, how a
flush is implemented.
3. The hardware -- a multi-core CPU with shared cache might behave
differently from multiple processors with private caches.
The flag must be flushed before it is checked, which is the expensive
part I was taking about. Otherwise you have no guarrantee that the value
you read is consistent with the value stored in memory.
See e.g. this example:
http://msdn.microsoft.com/en-us/library/sz9sd6et%28v=vs.80%29.aspx
Sturla
Yes, all threads flush the variable at the end of their loop body each
iteration. So how much of an impact are we talking here? We can't rely
on just with gil functions/statements as checkpoints (they may not
even be present), so "every iteration" is the plan we chose. If you
can provide a benchmark that shows we need an option to disable the
check, I'd be happy to implement it (note: this check is only in place
if you break/return/propagate an exception).
And please, before you try to teach me more things I already know,
could you inspect the generated code? It already took all these
considerations you've been listing into account. So if you find that
we can actually generate better code after inspecting it, we're open
to suggestions.
A full memory fence (Interlocked*) on x86 takes between 150 and 250 CPU
cycles, as I understand it. That should be comparable to about four or
five floating point additions. With full contention on a 4 core CPU, it
could e.g. be in the order of 16-20 floating point additions per
iteration of the loop. It might not be noticed, but it depends on the
tightness of the prange loop.
Sturla
Indeed there seems to be quite some overhead, it's nearly twice as
fast without the flush for a loop that doesn't do anything. But if you
need to do something simple then you wouldn't be using
break/return/acquiring the GIL I assume (so the check wouldn't be
there), and otherwise I'm assuming the flush is negligible compared to
the work done in the loop, so I'm not sure if this is a big deal.
With a 'with gil' block in there the speed difference is about 0.3%.
Yes. If there's a better way of getting the same semantics, of course
one should do it. But the "manual switch" is simply to avoid raising
exceptions and code things the old-fashioned way, if the loop is very
tight and yet is able to fail.
Dag Sverre
Yes, this is one of many things that makes prange much better than a
simple transliteration of OpenMP into Cython. Note that all this extra
code is only used iff there's a (possibly exception raising) with gil
block or other non-standard exit. If flushing on every iteration is an
issue, I imagine we could flush only when it's written to or on every
Nth iteration otherwise.
- Robert
Well, you can't do it after it's being written to only, as you don't
know when it will be written to, that's why you check the flag :)
Every Nth iteration is possible though. I'm still unconvinced it's an
issue at all though.
Yeah. I meant a thread flushes immediately it writes.
> Every Nth iteration is possible though.
But of course one would want to avoid %N, that can be expensive too.
Probably best to do a counter down to 0 and then check.
> I'm still unconvinced it's an
> issue at all though.
Your timings seem to indicate it's not, at least on your platform. And
much, *much,* better than forcing users to worry about the
restrictions in breaking out of a loop themselves.
- Robert
Right, it does that, but OpenMP requires that other threads (that need
to read it) should also flush the variable. So both the writer and the
readers need to flush.
Yes, I understand that. (Flush is a bad name, should be "sync" or
something.) I meant the writer flushes, the readers flush every Nth
iteration (if it is a performance issue to flush every iteration).
- Robert
Ah, ok. Indeed, it wouldn't make sense to flush your write any time
later than immediately.
The specification of OpenMP seems to indicate that a shared variable
should be consistent across all threads after a flush. But in practice
it works like Mark says, that also reader threads must flush to get a
consistent view of the variable.
In practice, a flush synchronizes a shared variable with RAM, but might
not affect the view that other threads have of the shared variable,
countrary to making the views consistent across all threads. The OpenMP
docs does not indicate that reader threads must flush, a flush should
only be needed to commit a write, but it might be that the specification
was impossible or too awkward to implement.
Sturla
Actually, Cython doesn't use Python's range here either (as the bounds
and index variable are typed, it can turn this into a C for loop).
> The solution would be to use
> prange on the x loop and release the gil, but from my limited parallel
> programming experience, it would be better to decompose with larger
> blocks(the y loop).
>
> Can I release the gil with the outer loop and use prange, and then
> reclaim the gil with the inner loop? If even possible, might there be
> significant overhead by doing this?
>
> What would be the recommended decomposition for this particular
> problem.
Put the prange on the outer loop, and it should just work.
> This function must be called many thousands of times until
> my system converges.
You may also want to consider making your solve function into a cpdef
function, for faster calling.
> Lastly, thanks for all the great work with Cython.
Thanks.
> I greatly
> appreciate it as a bridge between python and high performance and with
> the new parallel features, it somewhat complicates the arguments for
> multi language programming. I mean, in the project i'm associated
> with, we are using Fortran for the heavy lifting, but aren't even
> using shared memory programming, only distributed. We are constantly
> reevaluating our methods and now with trivially easy shared memory
> programming coupled with python's expressivity, it may sway the
> balance away from using Fortran except for interfacing with some
> legacy codes.
I think you're not the only one in that boat...
- Robert
It should possibly work, if we don't need to synchronize reads and
writes. (I don't think we do.) But we would need the volatile qualifier,
just in case someone has a super smart compiler. If we do this,
synchronization with RAM should only happen when the cache line gets
dirty (i.e. after a write).
Sturla
That's not how I read it (I quote from the OpenMP 3.0 spec):
"The flush operation provides a guarantee of consistency between a
thread’s temporary view and memory. Therefore, the flush operation can
be used to guarantee that a value written to a variable by one thread
may be read by a second thread. To accomplish this, the programmer
must ensure that the second thread has not written to the variable
since its last flush of the variable, and that the following sequence
of events happens in the specified order:
1. The value is written to the variable by the first thread.
2. The variable is flushed by the first thread.
3. The variable is flushed by the second thread.
4. The value is read from the variable by the second thread."
Won't volatile force a read from memory every time you try to read the variable?
Won't volatile force a read from memory every time you try to read the variable?
I thought false sharing meant that you access data that never changes,
but your cache line gets flushed because other (unrelated) data
happens to be in the same cache line and got written to by another
party.
> If two threads access the same variable simultaneously, it can result in a
> garbage value being written or read. But here there is only two possible
> values, 0 or 1, so I don't think it matters. The worst thing that can happen
> is that the thread reading garbage competes one iteration in surplus, which
> is not very relevant.
>
> Sturla
>
>
>
We also assign other values, basically 1, 2, or 3 (the default is 0,
i.e. "don't do anything). Could sig_atomic_t help here? Although I
think the standard just specifies it to be atomic with respect to
signal handlers.
Anyway, thanks for your explanation, I knew most of it, but it's good
to get a complete picture. So if an OpenMP flush ensures that
everything is synchronized, then what exactly does a flush for reading
do? Does it use this __synch__ to invalidate the cache (for what
reason, the hardware should have already done that)?
Cool, thanks for sharing :)
That's surprising, I'll see if I can look into that tonight. There is
definitely some overhead to the OpenMP as it is generic with respect
to start/stop/step. In OpenMP you can't have an empty loop and you
don't know which way you're iterating (ascending or descending, i.e.,
whether you need the sign '<' or '>'), so what it does is it computes
the amount of steps it will have to take, and then computes the index
in the loop body.
I see that iterations over range() without a step are optimized, but
range() with a step isn't. If this performance difference can be
attributed to the index computation, then we could do the same thing
as range for prange.
There's also the initialization of privates to invalid values each iteration.
It might also be that due to OpenMP the C compiler can't make certain
optimizations. That's just an assumption, I'll have a look tonight.
One should also take into account the O(1) overhead of entering the loop
and spawning a thread etc. It's a good idea to increase the number of
iterations and see if the difference gets smaller...
Dag Sverre
Yeah, but his runs take at least ~40 seconds.
Then leave it as it is :-)
> So if an OpenMP flush ensures that everything is synchronized, then
> what exactly does a flush for reading do?
I doesn't. We have to flush before reading.
Sturla
Ah, so you mean to say that you need it because it may not be
volatile, so the reader may have it in a register. That makes sense,
thanks.
I didn't actually look very well at your code last time, but you're
calling the function with the prange lots of times, so there is always
communication overhead even if OpenMP is using a thread pool. My
results were not quite as bad though:
range() : 34.2953708172
1 thread : 43.5137341022
2 threads : 24.4000601768
4 threads : 15.2648839951
So perhaps icc's OpenMP runtime library doesn't keep a thread pool
around? I don't know, I don't use that compiler. As you can see the
sequential OpenMP version is a little slower though. However, if I
tweak the C code (i.e. don't initialize privates to invalid values,
don't compute the index variable but update it directly in the for
loop (there is no step) and remove some other sanity checks) I get the
exact same results.
Now if I remove the OpenMP pragmas but use the code from prange() I
get the exact same result as for range()! So I'm just going to blame
the C compiler and runtime libraries here, I don't think there's much
I can do to improve performance. My suggestion is to try and rewrite
your loop as one big parallel prange and not call a function with a
prange a lot of times. My other suggestion is to use a better
compiler/runtime library.
Also, we have to consider Amdahl's Law, you're doing iteration and
calls from Python and the prange loop is certainly not the only place
where work is taking place. So it's not a very good benchmark, you
should compare only the range() agains only the prange() and ignore
the sequential stuff.
If you declare a shared variable volatile, OpenMP will do an implicit
flush before each access.
Dag Sverre suggested a private pointer instead of a shared variable.
That could result in a register allocation if the compiler decides the
pointer has no alias in the loop. If we dereference the same pointer a
lot, an an optimizing compiler can use a register temporarily. We would
need to declare the target volatile to avoid that.
A volatile target for a private pointer does not ensure synchronization
with main memory, though. Only a full memory fence (e.g. flushing a
shared variable) will do that.
Sturla