Stefan Behnel <
stef...@behnel.de> wrote:
> It doesn't really look like that's carved in stone, though. If someone came
> up with a GCD implementation for prange loops, and maybe also a bit of
> abstraction for things like finding out the current number of threads etc.,
> this could be added.
It should be doable to implement prange on top of GCD.
The GCD threadpool is managed by the kernel. It cannot be controlled and
you cannot know (or should not care) how many threads it has. It is a
global resource in the operating system. The way we schedule work is to
enqueue work tasks, but we cannot control (or even know) how many threads
are in the kernel threadpool. In OpenMP the whole threadpool is implemented
in userspace, which means it can make more sence to manually control how
big it should be.
The GCD is in fact a threadpool associated with the kqueue, which is rather
important because it also means it also provides us with I/O completion
ports. :D
> From the examples, it's not clear to me how this would compete with OpenMP,
> but I haven't tried it in any way.
I have used both. Here is a tl;dr conparison:
In principle they are very similar. Both consist of an extension to C which
implements closures and a thread pool. (That is, a parallel block in OpenMP
is a closure, though this might not be obvious at first sight.) With GCD a
syntax extension to C is used to define a closure (or an anonymous block if
you like to call it that), but the threadpool is not a syntax extension.
There are certain differences. In GCD you have to manage the workload
yourself. In OpenMP we have a 'schedule' pragma. In GCD serial queues are
used instead of critical sections for synchronization.
As for ease of use they are about the same, but it takes some time to get
used to GCD. Code written to use OpenMP is slightly easier to read. OpenMP
pragmas are in principle non-intrusive and the same code will compile
without OpenMP. If you want conditional compilation with GCD it can be a
bit messy because the source code is affected. In both cases we can
parallelize loops without having to restructure or refactor the code,
because the body of the loop can be put in a closure (which is what
"#pragma omp parallel for" will do).
GCD also provides functions for parallel asynchronous I/O, by combining the
GCD threadpool and kqueue. Similar to I/O completion ports on Windows, the
asynch I/O functions in GCD report when an I/O operation is completed, not
when a file descriptor is ready. OpenMP has no facilities for parallel I/O.
GCD can scale better than OpenMP in several ways. One is the highly
efficient threadpool. Only 13 instructions (in assembly) are required to
enqueue a task and execute the task on the GCD threadpool. This overhead is
tiny compared to most OpenMP implementations (GNU, Intel, Microsoft), as
well as Intel TBB. If we are doing floating point computations, 13 non-fpu
instructions are so insignificant that they can be ignored almost
completely. With GCD we can in this case enqueue every single iteration of
a loop as an independent task, and the overhead is still likely to be
insignificant. Using "schedule(dynamic) chunk(1)" is in comparison not a
good idea with current OpenMP implementstions.
Another way in which GCD can scale better is the explicit task scheduling.
Because it is done manually we can better fit the task scheduling to the
problem. The smaller threadpool overhead also allows us to enqueue smaller
chunks for the same amount of overhead, which can result in better load
balancing.
When I/O is involved there is no competition, because with OpenMP we have
to homebrew an ad hoc solution of top of whatever facilities the OS
provides. If we use OpenMP and IOCP on Windows, there will actually be two
threadpools involved, thus double overhead, and this double threadpool
design might not be good for cache use. Using OpenMP with epoll or kqueue
is awkward, particularly kqueue. With GCD we have one single thread pool,
managed by the kernel like IOCP. It serves both parallel task execution and
parallel I/O, and it uses the highest performance I/O facility on the
platform (kqueue on Mac and FreeBSD).
GCD is open source. The closure extension is integrated in clang, GCC and
Intel C++. The threadpool (libdispatch) is a free library. It is currently
available on Mac, iOS and FreeBSD. On FreeBSD we currently need to
recompile the kernel to use libdispatch with kqueue. There is also a port
to Linux (libxdispatch) which uses epoll instead of kqueue. There was an
attempt to port libdispatch to Windows, but I am not sure of its current
state. Apple has a Windows port of libdispatch in iTunes, but they have not
made it open source. The lack of publicly available Windows support for GCD
limits its usefulness. It is also a problem that Apple's official
libdispatch does not support Linux (including Android).
OpenMP works with Fortran and was actually designed with Fortran in mind.
GCD can be used from Fortran, but there will be no closures so do loops
cannot be parallized without code refactoring.
Sturla