OpenCL vs OpenMP

1,710 views
Skip to first unread message

Laurie

unread,
Mar 22, 2010, 7:16:29 PM3/22/10
to FDS and Smokeview Discussions
Guys,
Are there any plans currently to implement parallelism using CUDA or
OpenCL?

Surely implementing OpenMP at this point is a waste of time and
resources considering you can do the same thing with MPI currently?
GPUs perform the calculations much faster than CPUs as was proven by
Hendrik in his CUDA experiment (http://www.qfds.de/); which in turn
drives hardware costs down for the end user.

We're looking at expanding our cluster and I'm not keen on outlaying
tens of thousands of dollars only to have a CUDA or OpenCL
implementation be released in the next 6-12 months?

dr_jfloyd

unread,
Mar 22, 2010, 9:19:25 PM3/22/10
to FDS and Smokeview Discussions
Open MP /= Open MPI
Open MP: shared memory, fast communication (due to shared memory) but
limited size since memory is shared
Open MPI: distributed memory, slow communication (compared to Open
MP), but much higher limits in size

even within FDS
Open MP: parallel for global loops within a single mesh
Open MPI: parallel on a mesh by mesh basis

The development team has no current plans to support GPUs. None of us
use CUDA and we have no plans to "waste time and resources" to recode
the tens of thousands of lines of code that make up FDS.

Laurie

unread,
Mar 23, 2010, 12:55:44 AM3/23/10
to FDS and Smokeview Discussions
Thats all I needed to hear. Thanks.

On Mar 23, 11:19 am, dr_jfloyd <drjfl...@gmail.com> wrote:
> Open MP /= Open MPI
> Open MP: shared memory, fast communication (due to shared memory) but
> limited size since memory is shared
> Open MPI: distributed memory, slow communication (compared to Open
> MP), but much higher limits in size
>
> even within FDS
> Open MP: parallel for global loops within a single  mesh
> Open MPI: parallel on a mesh by mesh basis
>
> The development team has no current plans to support GPUs.  None of us
> use CUDA and we have no plans to "waste time and resources" to recode
> the tens of thousands of lines of code that make up FDS.
>
> On Mar 22, 7:16 pm, Laurie <laurieodger...@gmail.com> wrote:
>
> > Guys,
> > Are there any plans currently to implement parallelism using CUDA or
> > OpenCL?
>
> > Surely implementing OpenMP at this point is a waste of time and
> > resources considering you can do the same thing with MPI currently?
> > GPUs perform the calculations much faster than CPUs as was proven by

> > Hendrik in his CUDA experiment (http://www.qfds.de/);whichin turn

rmcdermo

unread,
Mar 23, 2010, 7:28:34 AM3/23/10
to FDS and Smokeview Discussions
It would be good to hear Christian Rogsch chime in on this. I agree
with Jason that we do not plan to rewrite thousands of lines of code.
However, during Christian's visit last week he was explaining how
eventually (I don't know the time frame) the graphics processors may
be treated in much the same way as OpenMP: with simple compiler
directives.

F-Sim.de

unread,
Mar 23, 2010, 7:54:51 AM3/23/10
to FDS and Smokeview Discussions
Just for information reasons:
As far as I know actually there is just one compiler supporting a
simple usage of Fortran code for OpenCL (in NVidias Cuda
implementation of OpenCL). Portland Group released their Fortran
compiler with Cuda support. There you just have to tag loops in
Fortran code.

As an example:
http://www.pgroup.com/resources/accel.htm

Cuda site of NVidia:
http://www.nvidia.com/object/cuda_fortran.html

I doubt that right at the moment this will work without problems but
never the less it sounds pretty interesting.

Boris

Christian Rogsch

unread,
Mar 23, 2010, 3:55:30 PM3/23/10
to FDS and Smokeview Discussions
All,

this is a very interesting discussion, thus I will follow Randy's post
to write some comments.

Please note, that if I write CUDA, this means using a GPU (Graphic
Processing Unit = Graphic Card) by a programming language, other
possibilities are AMD Stream (or FireStream), or whatever they offer.

First, OpenCL is not the "miracle cure" to use the full potential of a
GPU, it's only a language to use the GPU, nothing else. It is no
problem to write a code for OpenCL which will run as a serial program
on a GPU, the "trick" is to run it in parallel. That's what I actually
do with OpenMP.

OpenMP is only fast (e.g. uses all Cores/Prozessors of PC) when the
code has a "real" parallel structure, this means that all calculations
must be independent from each other; in FDS this means that we can
calculate each value for each cell independently from all other cells.
At the moments, this is done only in some parts of the code, not the
full code at all.

The future plan is to change the code in a way, that it works as a
"real" parallel version, thus we will have a very large speed-up using
multiple cores with OpenMP (in other words: change the subroutines in
a way, that all values are computed independently). This is the first
and basic step to all other parallelism approaches. If this is done,
then we can think about "CUDA"...

As Boris wrote in his post, there is one compiler from PGI, which
allows to "use" the GPU with compiler directives like OpenMP. I
monitor the development of this PGI compiler since they offer a first
beta-version of their "Accelerator" compiler. Actually this compiler
is not ready to "manage" a code like FDS... There are some problems,
which must be solved first by PGI (or NVIDIA, because they create the
CUDA model) before it is ready to use with FDS. One problem is, that
the PGI compiler can't call subrountines which are located in other
files; because there is no linker for this problem. If I understand it
in a correct way, the PGI compiler does only a translation from
FORTRAN to C, and then the C-Code is compiled. The compiler directives
like !$acc .... are used to translate a correct code. But to use this
compiler directives, you have to ensure, that your code is fully
parallelized, as written above.

The next "hard" thing is to ensure that the full program will run on
the GPU. If you have some parts of the code which run only on your
local core (because of the programming style) you will lose all speed
benefits, because it is very "time consuming" to copy data from local
host/core to GPU or the other way round. Here is the "trick" to use
OpenMP in a way, that only the output is done by a single thread on
the local core, all computations have to be done on the GPU, which
means that you have only to copy data from the GPU to the local core.
This "construct" must be done by OpenMP, which means that you have to
use e.g. one thread for FDS-Calculations which are done by the GPU,
the other thread has to write the output on your local machine. But
this means that you have to ensure that one thread waits until the
other has finished (you cannot write an output if your values are not
calculated).

Furthermore you have to think about the memory limitations of the GPU
(or Tesla Card, FireGL, or whatever). If you have e.g. 4 GB of memory
on your card, you can "only" calculate ca. 4 Mio cells. If you need
more cells, you have to use one card more. But this means you have to
transfer data from one card to another card (ghost cells, that what's
actually done by MPI using multiple meshes). And remember: Data-
Transfer from or to card = time consuming... If you have only the code
running at your local CPU's (cores) there is only a very small
transfer cost (based on the architecture), this means, buy more RAM -
calculate more cells (and RAM ist actually cheaper then a Tesla
card...)

If you look at these examples given at the PGI forum, manual, ... you
will find that these are very big loops which look complicated, but
they aren't complicated, which means that they will get easily a good
speed-up. If FDS would only be "one big loop" with 400 lines of code,
then it would be no problem to use this compiler from PGI, but
actually that's not the case.

So, as a short summary, the plans are as follows:
1.) "Reorganize" the FDS code in a way, that all subroutines run in
parallel; no (or only a very few) parts of the code which must run in
serial.
2.) Implement a "model" that the output is done in parallel to the
calculation, like: one thread for output, all other threads for
calculation.
3.) Wait until PGI is "ready to use" with FDS. If it is ready, try
it...

Another important thing is to ensure that FDS is independent from
Hardware and Software, which means that FDS has to run on all
platforms. So, if we develop only for CUDA, we have the problem that
this development does not follow any open standard, like OpenMP. In
other words: You can change your compiler e.g. from Intel to Sun, both
have the OpenMP 3.0 standard implemented, thus you will get a running
version of FDS. If we use this compiler directives from PGI we can
only use the PGI compiler, no other compiler. That means, we are not
independent. If adding the compiler directives for PGI is only a "one
week job", then it's OK, but if it's more difficult, you have to think
about the future...

Also it is important to have a very fast code. I have participated in
a workshop where two people from Intel tried to "tune" applications.
One of the participants brought a CG-Solver (or any similar solver, I
do not remember in detail) to this workshop, one version in Fortran
and one in C++. Only with some modifiations of the C++ code and
compiler flags this participant was able to get a C++ code compiled
which was able to run as fast as the FORTRAN code without any
modifications. And that was only a very simple program... Now think
about FDS... That's why we have no plan to change the language from
FORTRAN to C++ (or OpenCL or C or whatever)

I hope, that this post explains my actual point of view and you
understand, why we do it in this way with OpenMP and FORTRAN.

At the end one short comment: Can you remember the Cell-Prozessor
("Playstation")? If it was released, it seemed to be the "miracle
cure" for parallel programming and speed-up, but what's the actual
status? So, what will be with GPU's in 2 years? Please note, that AMD
just releases a 12-core processor (ca. 1000Euro/Processor)... and in 2
years, maybe the 24-core processor will follow...

Further comments are very welcome! Please do not hesitate to post! It
is important for me (and all other people) to know how users think
about this topic.

Kind Regards,
Christian Rogsch

F-Sim.de schrieb:

Emanuele Gissi

unread,
Mar 24, 2010, 7:13:21 AM3/24/10
to FDS and Smokeview Discussions

On 23 Mar, 20:55, Christian Rogsch <rog...@uni-wuppertal.de> wrote:
> All,
>
> this is a very interesting discussion, thus I will follow Randy's post
> to write some comments.

> ...

Thank you very much indeed for the very clear explanation.
Keeping the code follow open standards is clearly the long sighted
route to follow.
Emanuele

Laurie

unread,
Mar 25, 2010, 1:53:55 AM3/25/10
to FDS and Smokeview Discussions
Thanks for the informative post Christian, its exactly the answer I
was after.

And good to hear you are trying to keep it open/compiler independent -
its great being able to use gfortran/gcc/openmpi wrappers to compile
under Linux!

Laurie Odgers

unread,
Mar 25, 2010, 2:00:36 AM3/25/10
to fds...@googlegroups.com
One more thing though - with the Playstation 3 cell processor - the US air force is using a few thousand of these in a cluster right now!

http://arstechnica.com/security/news/2009/11/sony-still-subsidizing-us-supercomputer-efforts.ars

Laurie Odgers
Office Manager
Fire Check Consultants
PO Box 7017, Brendale Qld 4500.
Mobile: 0438 175 070
Phone: (07) 3205 2370
Fax: (07) 3889 8566
http://www.firecheck.com.au
FIRE SAFETY THROUGH INNOVATIVE DESIGN


Reply all
Reply to author
Forward
0 new messages