Detect at runtime if OpenMP is possible ...

Jerome Kieffer

unread,

Jun 30, 2015, 8:08:42 AM6/30/15

to cython...@googlegroups.com

Dear Cythoners,

I try to write efficient code running everywhere ... using OpenMP
whenever possible. Often serial code is faster than parallel code on a
single core so I would like to select the path at _runtime_.

Do you think it would be possible/interesting to have the number of
threads available without linking to OpenMP ?

There is a reference to "threadsavailable" commented out a bit
everywhere in the source code of Cython. Could you tell me why it is
hidden ? there must be a reason.

Thanks in advance. Best regards

--
Jérôme Kieffer
tel +33 476 882 445

Sturla Molden

unread,

Jun 30, 2015, 8:41:05 AM6/30/15

to cython...@googlegroups.com

Jerome Kieffer <goo...@terre-adelie.org> wrote:

> I try to write efficient code running everywhere ... using OpenMP
> whenever possible. Often serial code is faster than parallel code on a
> single core so I would like to select the path at _runtime_.

Any decent OpenMP runtime will do that automatically.

> Do you think it would be possible/interesting to have the number of
> threads available without linking to OpenMP ?

No. If you use prange in Cython you must link to OpenMP. In that case you
can query OpenMP for the number of threads, which is omp_get_num_threads().
Note that the return value will typically be 1 outside a parallel block.
This also mean that outside a prange block in Cython there is just one
thread.

If you use threading.Thread instead of prange you can create as many
threads as your computer allows. That is typically limited by the amount of
RAM. multiprocessing.cpu_count() will give you the number of processors.
You dont need to query for the number of threads because you presumably
know how many you have spawned.

Sturla

Jerome Kieffer

unread,

Jun 30, 2015, 12:25:25 PM6/30/15

to cython...@googlegroups.com

On Tue, 30 Jun 2015 12:40:57 +0000 (UTC)
Sturla Molden <sturla...@gmail.com> wrote:

> Jerome Kieffer <goo...@terre-adelie.org> wrote:
>
> > I try to write efficient code running everywhere ... using OpenMP
> > whenever possible. Often serial code is faster than parallel code on a
> > single core so I would like to select the path at _runtime_.
>
> Any decent OpenMP runtime will do that automatically.

As a developer on Linux I agree ... but it fails on MacOSX (using xcode5/6).
... does it mean MacOSX is not decent ?

> > Do you think it would be possible/interesting to have the number of
> > threads available without linking to OpenMP ?

> No. If you use prange in Cython you must link to OpenMP.

Cython's prange works well without linking to OpenMP ... it is just the serial range.
It would be nice that at compile time the code knows if the compiler is OpenMP capable or not.

IFDEF _OPENMP:
include "ext_omp.pxi"
ELSE:
include "ext_nomp.pxi"

While if you fail linking to lib, the compilation just crashes which is not a good "user experience".

> In that case you
> can query OpenMP for the number of threads, which is omp_get_num_threads().

This is typically the code which fails (to link) on MacOSX.

> If you use threading.Thread instead of prange you can create as many
> threads as your computer allows. That is typically limited by the amount of
> RAM. multiprocessing.cpu_count() will give you the number of
> processors. You dont need to query for the number of threads because
> you presumably know how many you have spawned.

Macbooks have 2(4) cores but OpenMP may not be useable.

Thanks for your hint.
Cheers,

Sturla Molden

unread,

Jun 30, 2015, 5:27:17 PM6/30/15

to cython...@googlegroups.com

Jerome Kieffer <goo...@terre-adelie.org> wrote:

>> Any decent OpenMP runtime will do that automatically.
>
> As a developer on Linux I agree ... but it fails on MacOSX (using xcode5/6).
> ... does it mean MacOSX is not decent ?

On which compiler?

Clang shipped with Xcode does not support OpenMP. Apple wants us to use GCD
instead.

Intel and GCC do have OpenMP, as they do on Linux and Windows.

> Macbooks have 2(4) cores but OpenMP may not be useable.

As I said, it depends on the compiler. I have GCC 4.9 (from the gfortran
wiki) and Intel C++, in addition to Xcode and clang, and OpenMP works fine
with both of these compilers.

Compilers from Apple does not have OpenMP, though. This is presumably
because Apple has invested heavily in two competing technologies (GCD and
OpenCL, the latter is not just for GPU), and wants us to use those on Mac
OS X and iOS. As it happens, GCD performs better than OpenMP and TBB even
in Intel's own demonstrations, but unfortunately Cython cannot use it.

Sturla

Sturla Molden

unread,

Jun 30, 2015, 6:34:24 PM6/30/15

to cython...@googlegroups.com

Jerome Kieffer <goo...@terre-adelie.org> wrote:

> Cython's prange works well without linking to OpenMP ... it is just the serial range.
> It would be nice that at compile time the code knows if the compiler is
> OpenMP capable or not.

The symbol _OPENMP is defined if OpenMP is available, which you can query
from Cython by calling a C helper function:

int has_openmp()
{
#ifdef _OPENMP
return 1;
#else
return 0;
#endif
}

You still have to know if your compiler supports OpenMP, including what
compile flags and linker flags to use.

The C code cannot know if the compiler is "OpenMP capable", though. The
symbol will not be defined if the code is not compiled with -fopenmp,
-openmp or /openmp (or whatever your compiler wants). The compiler can
still be OpenMP capable, but _OPENMP will not be defined.

> IFDEF _OPENMP:
> include "ext_omp.pxi"
> ELSE:
> include "ext_nomp.pxi"

Here you probably are confusing Cython defines with and C preprocessor
defines. You cannot IFDEF a C preprocessor symbol like _OPENMP in Cython.

Sturla

Stefan Behnel

unread,

Jul 1, 2015, 1:06:44 AM7/1/15

to cython...@googlegroups.com

Sturla Molden schrieb am 30.06.2015 um 23:27:
> Compilers from Apple does not have OpenMP, though. This is presumably
> because Apple has invested heavily in two competing technologies (GCD and
> OpenCL, the latter is not just for GPU), and wants us to use those on Mac
> OS X and iOS. As it happens, GCD performs better than OpenMP and TBB even
> in Intel's own demonstrations, but unfortunately Cython cannot use it.

It doesn't really look like that's carved in stone, though. If someone came
up with a GCD implementation for prange loops, and maybe also a bit of
abstraction for things like finding out the current number of threads etc.,
this could be added.

BTW, I had to look up GCD, which does not mean Greatest Common Divisor in
this context. ;)

https://en.wikipedia.org/wiki/Grand_Central_Dispatch

From the examples, it's not clear to me how this would compete with OpenMP,
but I haven't tried it in any way.

Stefan

Jerome Kieffer

unread,

Jul 1, 2015, 3:14:09 AM7/1/15

to cython...@googlegroups.com

On Tue, 30 Jun 2015 21:27:00 +0000 (UTC)
Sturla Molden <sturla...@gmail.com> wrote:

> Jerome Kieffer <goo...@terre-adelie.org> wrote:
>
> >> Any decent OpenMP runtime will do that automatically.
> >
> > As a developer on Linux I agree ... but it fails on MacOSX (using xcode5/6).
> > ... does it mean MacOSX is not decent ?
>
> On which compiler?
>
> Clang shipped with Xcode does not support OpenMP. Apple wants us to use GCD
> instead.

This is another issue: I don't develop code for me but for other people.
PyPI only allows binary wheels for windows (by the way, with limited
authentication & signatures ... security issues ahead !).
So MacOSX users have to install a compiler on their computer to build
the code, the easiest is Xcode, which is also the most natural one in a
MacOSX environment.

> Intel and GCC do have OpenMP, as they do on Linux and Windows.

I know all that. Just that I do not want to enforce the compiler, just
use the natural one from the platform to limit the trouble for the end-user which is already suffering the compilation step:
MacOSX: Xcode
Windows: msvc
Linux: gcc

> As I said, it depends on the compiler. I have GCC 4.9 (from the gfortran
> wiki) and Intel C++, in addition to Xcode and clang, and OpenMP works fine
> with both of these compilers.

This is the reason why Fortran has been banned from all our projects:
the lack of "native" compiler on windows and MacOSX.

> Compilers from Apple does not have OpenMP, though. This is presumably
> because Apple has invested heavily in two competing technologies (GCD and
> OpenCL, the latter is not just for GPU), and wants us to use those on Mac
> OS X and iOS. As it happens, GCD performs better than OpenMP and TBB even
> in Intel's own demonstrations, but unfortunately Cython cannot use it.

My code makes heavy use of both Cython and OpenCL... but OpenCL is not
always the fastest solution to develop on. Moreover I am providing
Cython fallback when OpenCL does not work, not the other way around.

By the way, thanks for your answer in the other mail, I will try right away.

Sturla Molden

unread,

Jul 1, 2015, 3:49:39 AM7/1/15

to cython...@googlegroups.com

Stefan Behnel <stef...@behnel.de> wrote:

> It doesn't really look like that's carved in stone, though. If someone came
> up with a GCD implementation for prange loops, and maybe also a bit of
> abstraction for things like finding out the current number of threads etc.,
> this could be added.

It should be doable to implement prange on top of GCD.

The GCD threadpool is managed by the kernel. It cannot be controlled and
you cannot know (or should not care) how many threads it has. It is a
global resource in the operating system. The way we schedule work is to
enqueue work tasks, but we cannot control (or even know) how many threads
are in the kernel threadpool. In OpenMP the whole threadpool is implemented
in userspace, which means it can make more sence to manually control how
big it should be.

The GCD is in fact a threadpool associated with the kqueue, which is rather
important because it also means it also provides us with I/O completion
ports. :D

> From the examples, it's not clear to me how this would compete with OpenMP,
> but I haven't tried it in any way.

I have used both. Here is a tl;dr conparison:

In principle they are very similar. Both consist of an extension to C which
implements closures and a thread pool. (That is, a parallel block in OpenMP
is a closure, though this might not be obvious at first sight.) With GCD a
syntax extension to C is used to define a closure (or an anonymous block if
you like to call it that), but the threadpool is not a syntax extension.

There are certain differences. In GCD you have to manage the workload
yourself. In OpenMP we have a 'schedule' pragma. In GCD serial queues are
used instead of critical sections for synchronization.

As for ease of use they are about the same, but it takes some time to get
used to GCD. Code written to use OpenMP is slightly easier to read. OpenMP
pragmas are in principle non-intrusive and the same code will compile
without OpenMP. If you want conditional compilation with GCD it can be a
bit messy because the source code is affected. In both cases we can
parallelize loops without having to restructure or refactor the code,
because the body of the loop can be put in a closure (which is what
"#pragma omp parallel for" will do).

GCD also provides functions for parallel asynchronous I/O, by combining the
GCD threadpool and kqueue. Similar to I/O completion ports on Windows, the
asynch I/O functions in GCD report when an I/O operation is completed, not
when a file descriptor is ready. OpenMP has no facilities for parallel I/O.

GCD can scale better than OpenMP in several ways. One is the highly
efficient threadpool. Only 13 instructions (in assembly) are required to
enqueue a task and execute the task on the GCD threadpool. This overhead is
tiny compared to most OpenMP implementations (GNU, Intel, Microsoft), as
well as Intel TBB. If we are doing floating point computations, 13 non-fpu
instructions are so insignificant that they can be ignored almost
completely. With GCD we can in this case enqueue every single iteration of
a loop as an independent task, and the overhead is still likely to be
insignificant. Using "schedule(dynamic) chunk(1)" is in comparison not a
good idea with current OpenMP implementstions.

Another way in which GCD can scale better is the explicit task scheduling.
Because it is done manually we can better fit the task scheduling to the
problem. The smaller threadpool overhead also allows us to enqueue smaller
chunks for the same amount of overhead, which can result in better load
balancing.

When I/O is involved there is no competition, because with OpenMP we have
to homebrew an ad hoc solution of top of whatever facilities the OS
provides. If we use OpenMP and IOCP on Windows, there will actually be two
threadpools involved, thus double overhead, and this double threadpool
design might not be good for cache use. Using OpenMP with epoll or kqueue
is awkward, particularly kqueue. With GCD we have one single thread pool,
managed by the kernel like IOCP. It serves both parallel task execution and
parallel I/O, and it uses the highest performance I/O facility on the
platform (kqueue on Mac and FreeBSD).

GCD is open source. The closure extension is integrated in clang, GCC and
Intel C++. The threadpool (libdispatch) is a free library. It is currently
available on Mac, iOS and FreeBSD. On FreeBSD we currently need to
recompile the kernel to use libdispatch with kqueue. There is also a port
to Linux (libxdispatch) which uses epoll instead of kqueue. There was an
attempt to port libdispatch to Windows, but I am not sure of its current
state. Apple has a Windows port of libdispatch in iTunes, but they have not
made it open source. The lack of publicly available Windows support for GCD
limits its usefulness. It is also a problem that Apple's official
libdispatch does not support Linux (including Android).

OpenMP works with Fortran and was actually designed with Fortran in mind.
GCD can be used from Fortran, but there will be no closures so do loops
cannot be parallized without code refactoring.

Sturla

Sturla Molden

unread,

Jul 1, 2015, 4:16:28 AM7/1/15

to cython...@googlegroups.com

Jerome Kieffer <goo...@terre-adelie.org> wrote:

> MacOSX: Xcode
> Windows: msvc
> Linux: gcc

Xcode is an IDE, not a compiler. You can use gcc or icc with Xcode as well.
If you mean clang, which is shipped with Xcode, it does not have OpenMP in
Apple's build.

(But Intel has make an OpenMP compiler for clang, you just need to build
clang from source.)

> This is the reason why Fortran has been banned from all our projects:
> the lack of "native" compiler on windows and MacOSX.

To what extent are GNU, IBM, HP, Sun/Oracle, Absoft, NAG, Lahey, Portland
or Intel compilers less native than Microsoft or clang compilers? There is
also a Microsoft Fortran compiler, but it is currently owned by HP
(Compaq).

Anyhow, Fortran can be a PITA, so I understand why you might not want it.

Sturla

Sturla Molden

unread,

Jul 1, 2015, 4:23:29 AM7/1/15

to cython...@googlegroups.com

Sturla Molden <sturla...@gmail.com> wrote:

> The GCD threadpool is managed by the kernel. It cannot be controlled and
> you cannot know (or should not care) how many threads it has. It is a
> global resource in the operating system.

This, by the way, is why we cannot dynamically create a parallel queue in
GCD. We can dynamically create serial queues, but there is one global
parallel queue. This global parallel queue is not just related to kqueue,
in fact it *is* kqueue.

(This is hidden by the API though, we don't need to know how to use kqueue
to use GCD.)

Sturla

Jerome Kieffer

unread,

Jul 1, 2015, 11:04:55 AM7/1/15

to cython...@googlegroups.com

On Tue, 30 Jun 2015 22:34:12 +0000 (UTC)
Sturla Molden <sturla...@gmail.com> wrote:

> Here you probably are confusing Cython defines with and C preprocessor
> defines. You cannot IFDEF a C preprocessor symbol like _OPENMP in Cython.

That's the point. Thanks for opening my eyes.

So the solution is to have a switch at the cythonization level like a
define and set a "compile_time_env" via cythonize. I did not find how
to do this via the command line (cython).

One reference:
http://stackoverflow.com/questions/27273302/cython-conditional-compile-based-on-external-value-given-via-setuptools

and for memory:
https://github.com/kif/pyFAI/issues/214

Stefan Behnel

unread,

Jul 1, 2015, 11:43:38 AM7/1/15

to cython...@googlegroups.com

Jerome Kieffer schrieb am 30.06.2015 um 14:08:
> I try to write efficient code running everywhere ... using OpenMP
> whenever possible. Often serial code is faster than parallel code on a
> single core so I would like to select the path at _runtime_.
>
> Do you think it would be possible/interesting to have the number of
> threads available without linking to OpenMP ?

All internal references to OpenMP code in the C code that Cython generates
are protected by the _OPENMP macro, so you get a serialised loop when it's
compiled without OpenMP support.

In order to find out at runtime if it's compiled with or without OpenMP,
you can use a C header file like this, say, "openmp_fallback.h":

"""
#ifndef _OPENMP
#define omp_get_num_threads() 1
#define omp_in_parallel() 0
#endif
"""

and then do

"""
cdef extern from "openmp_fallback.h":
pass

cimport openmp
print(openmp.omp_get_num_threads())
print(openmp.omp_in_parallel())
"""

I think it would be nice if "cython.parallel" provided some aliases for these.

Stefan

Jerome Kieffer

unread,

Jul 2, 2015, 7:31:45 AM7/2/15

to cython...@googlegroups.com

Hi Stefan,

Your solution is definitely more elegant than any of the other exposed.
I tried it and it does not compile there is a

#include "openmp_fallback.h"
#include "omp.h"
[...]
#ifdef _OPENMP
#include <omp.h>
#endif

If I remove the second line it compiles on a mac (specifically using the clang version provided by apple with openmp disabled).
We are almost there ... It would be wonderful to have something like this available from cython.parallel.

Cheers,

Reply all

Reply to author

Forward