Surely implementing OpenMP at this point is a waste of time and
resources considering you can do the same thing with MPI currently?
GPUs perform the calculations much faster than CPUs as was proven by
Hendrik in his CUDA experiment (http://www.qfds.de/); which in turn
drives hardware costs down for the end user.
We're looking at expanding our cluster and I'm not keen on outlaying
tens of thousands of dollars only to have a CUDA or OpenCL
implementation be released in the next 6-12 months?
even within FDS
Open MP: parallel for global loops within a single mesh
Open MPI: parallel on a mesh by mesh basis
The development team has no current plans to support GPUs. None of us
use CUDA and we have no plans to "waste time and resources" to recode
the tens of thousands of lines of code that make up FDS.
On Mar 23, 11:19 am, dr_jfloyd <drjfl...@gmail.com> wrote:
> Open MP /= Open MPI
> Open MP: shared memory, fast communication (due to shared memory) but
> limited size since memory is shared
> Open MPI: distributed memory, slow communication (compared to Open
> MP), but much higher limits in size
>
> even within FDS
> Open MP: parallel for global loops within a single mesh
> Open MPI: parallel on a mesh by mesh basis
>
> The development team has no current plans to support GPUs. None of us
> use CUDA and we have no plans to "waste time and resources" to recode
> the tens of thousands of lines of code that make up FDS.
>
> On Mar 22, 7:16 pm, Laurie <laurieodger...@gmail.com> wrote:
>
> > Guys,
> > Are there any plans currently to implement parallelism using CUDA or
> > OpenCL?
>
> > Surely implementing OpenMP at this point is a waste of time and
> > resources considering you can do the same thing with MPI currently?
> > GPUs perform the calculations much faster than CPUs as was proven by
> > Hendrik in his CUDA experiment (http://www.qfds.de/);whichin turn
As an example:
http://www.pgroup.com/resources/accel.htm
Cuda site of NVidia:
http://www.nvidia.com/object/cuda_fortran.html
I doubt that right at the moment this will work without problems but
never the less it sounds pretty interesting.
Boris
this is a very interesting discussion, thus I will follow Randy's post
to write some comments.
Please note, that if I write CUDA, this means using a GPU (Graphic
Processing Unit = Graphic Card) by a programming language, other
possibilities are AMD Stream (or FireStream), or whatever they offer.
First, OpenCL is not the "miracle cure" to use the full potential of a
GPU, it's only a language to use the GPU, nothing else. It is no
problem to write a code for OpenCL which will run as a serial program
on a GPU, the "trick" is to run it in parallel. That's what I actually
do with OpenMP.
OpenMP is only fast (e.g. uses all Cores/Prozessors of PC) when the
code has a "real" parallel structure, this means that all calculations
must be independent from each other; in FDS this means that we can
calculate each value for each cell independently from all other cells.
At the moments, this is done only in some parts of the code, not the
full code at all.
The future plan is to change the code in a way, that it works as a
"real" parallel version, thus we will have a very large speed-up using
multiple cores with OpenMP (in other words: change the subroutines in
a way, that all values are computed independently). This is the first
and basic step to all other parallelism approaches. If this is done,
then we can think about "CUDA"...
As Boris wrote in his post, there is one compiler from PGI, which
allows to "use" the GPU with compiler directives like OpenMP. I
monitor the development of this PGI compiler since they offer a first
beta-version of their "Accelerator" compiler. Actually this compiler
is not ready to "manage" a code like FDS... There are some problems,
which must be solved first by PGI (or NVIDIA, because they create the
CUDA model) before it is ready to use with FDS. One problem is, that
the PGI compiler can't call subrountines which are located in other
files; because there is no linker for this problem. If I understand it
in a correct way, the PGI compiler does only a translation from
FORTRAN to C, and then the C-Code is compiled. The compiler directives
like !$acc .... are used to translate a correct code. But to use this
compiler directives, you have to ensure, that your code is fully
parallelized, as written above.
The next "hard" thing is to ensure that the full program will run on
the GPU. If you have some parts of the code which run only on your
local core (because of the programming style) you will lose all speed
benefits, because it is very "time consuming" to copy data from local
host/core to GPU or the other way round. Here is the "trick" to use
OpenMP in a way, that only the output is done by a single thread on
the local core, all computations have to be done on the GPU, which
means that you have only to copy data from the GPU to the local core.
This "construct" must be done by OpenMP, which means that you have to
use e.g. one thread for FDS-Calculations which are done by the GPU,
the other thread has to write the output on your local machine. But
this means that you have to ensure that one thread waits until the
other has finished (you cannot write an output if your values are not
calculated).
Furthermore you have to think about the memory limitations of the GPU
(or Tesla Card, FireGL, or whatever). If you have e.g. 4 GB of memory
on your card, you can "only" calculate ca. 4 Mio cells. If you need
more cells, you have to use one card more. But this means you have to
transfer data from one card to another card (ghost cells, that what's
actually done by MPI using multiple meshes). And remember: Data-
Transfer from or to card = time consuming... If you have only the code
running at your local CPU's (cores) there is only a very small
transfer cost (based on the architecture), this means, buy more RAM -
calculate more cells (and RAM ist actually cheaper then a Tesla
card...)
If you look at these examples given at the PGI forum, manual, ... you
will find that these are very big loops which look complicated, but
they aren't complicated, which means that they will get easily a good
speed-up. If FDS would only be "one big loop" with 400 lines of code,
then it would be no problem to use this compiler from PGI, but
actually that's not the case.
So, as a short summary, the plans are as follows:
1.) "Reorganize" the FDS code in a way, that all subroutines run in
parallel; no (or only a very few) parts of the code which must run in
serial.
2.) Implement a "model" that the output is done in parallel to the
calculation, like: one thread for output, all other threads for
calculation.
3.) Wait until PGI is "ready to use" with FDS. If it is ready, try
it...
Another important thing is to ensure that FDS is independent from
Hardware and Software, which means that FDS has to run on all
platforms. So, if we develop only for CUDA, we have the problem that
this development does not follow any open standard, like OpenMP. In
other words: You can change your compiler e.g. from Intel to Sun, both
have the OpenMP 3.0 standard implemented, thus you will get a running
version of FDS. If we use this compiler directives from PGI we can
only use the PGI compiler, no other compiler. That means, we are not
independent. If adding the compiler directives for PGI is only a "one
week job", then it's OK, but if it's more difficult, you have to think
about the future...
Also it is important to have a very fast code. I have participated in
a workshop where two people from Intel tried to "tune" applications.
One of the participants brought a CG-Solver (or any similar solver, I
do not remember in detail) to this workshop, one version in Fortran
and one in C++. Only with some modifiations of the C++ code and
compiler flags this participant was able to get a C++ code compiled
which was able to run as fast as the FORTRAN code without any
modifications. And that was only a very simple program... Now think
about FDS... That's why we have no plan to change the language from
FORTRAN to C++ (or OpenCL or C or whatever)
I hope, that this post explains my actual point of view and you
understand, why we do it in this way with OpenMP and FORTRAN.
At the end one short comment: Can you remember the Cell-Prozessor
("Playstation")? If it was released, it seemed to be the "miracle
cure" for parallel programming and speed-up, but what's the actual
status? So, what will be with GPU's in 2 years? Please note, that AMD
just releases a 12-core processor (ca. 1000Euro/Processor)... and in 2
years, maybe the 24-core processor will follow...
Further comments are very welcome! Please do not hesitate to post! It
is important for me (and all other people) to know how users think
about this topic.
Kind Regards,
Christian Rogsch
F-Sim.de schrieb:
On 23 Mar, 20:55, Christian Rogsch <rog...@uni-wuppertal.de> wrote:
> All,
>
> this is a very interesting discussion, thus I will follow Randy's post
> to write some comments.
> ...
Thank you very much indeed for the very clear explanation.
Keeping the code follow open standards is clearly the long sighted
route to follow.
Emanuele
And good to hear you are trying to keep it open/compiler independent -
its great being able to use gfortran/gcc/openmpi wrappers to compile
under Linux!