Fwd: accULL

61 views
Skip to first unread message

Ruymán Reyes

unread,
May 17, 2013, 4:59:04 AM5/17/13
to acc...@googlegroups.com

Background:
Ricardo is trying to install release 0.2 on his computer, but his platform is AMD only. Someone from the "packaging department" (Juanjo, Lucas) could please explain how to set up the parameters?

On Thu, May 16, 2013 at 3:31 PM, Ricardo Nobre <rjfnob...@gmail.com> wrote:
Hi Ruyman,

In my computer I only have an AMD GPU (HD 7750).
The OpenCL implementation I have installed (under /opt/AMDAPP) is the AMD-APP-SDK-v2.8-lnx64.
Both the "lib" and "include" folders are inside "/opt/AMDAPP".

How should I setup the env-parameters.sh file?
Keeping in mind I don't have CUDA (my GPU only supports OpenCL).

By default the file looks as follows:
# CUDA and OpenCL PATH
export PATH=/usr/local/cuda/bin/:$PATH
export CUDADIR=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:$LD_LIBRARY_PATH
export CPPFLAGS="-I/usr/local/cuda/include":$CPPFLAGS
export LDFLAGS="-L/usr/local/cuda/lib":$LDFLAGS


Best regards,
Ricardo


2013/5/14 Ricardo Nobre <rjfnob...@gmail.com>
Hi Ruyman, 

Thanks for your quick and complete answer!

Do you think collapse(3) would lead to significant better performance when compared with collapse(2)?
Why?

Does you source to source compiler perform automatic loop tiling in order to improve thread- and data-level (good use of SIMD units)parallelism?
Or do I have to set tiling parameters through pragmas?


Regards,
Ricardo


2013/5/13 Ruymán Reyes <rre...@ull.es>
Hi Ricardo,

Thank you for your interest in accULL.
accULL is an experimental implementation of OpenACC, thus, you'll may
experience some problems with complex codes. Release 0.2 is currently
available and works reasonably well. A new release (0.3) is to be
expected soon (<1month), with many fixes for large codes and a more
user-friendly interface.
The major drawback of our implementation is that it will always
attempt to generate a GPU kernel (CUDA and/or OpenCL), no matter if
the code is really parallelizable or not (i.e the independent clause
is always implicit).
Other than that, it is a good (and free) approach to OpenACC with
support for many of the 1.0 features.
Up-to-date details are in http://accull.wordpress.com/


With respect to performance, depending on the code you may or may not
get better performance. In simpler codes where the PGI or CAPS
compiler cannot perform complex code transformations, accULL has the
same or better performance due to less overhead and better scheduling
of loop iterations. However, there are no polyhedral transformation in
accULL thus CAPS and particularly PGI will outperform in situations
were this is an advantage.

Bear in mind that accULL produces clean CUDA/OpenCL source code, thus
it is suitable for further optimisation. This does not happen with
PGI/CAPS.
The initial idea of the framework was to leverage the development
effort rather than completely replace the CUDA and OpenCL languages.


> Can you suggest me the pragmas to use for best performance using accULL so I
> can test it myself?
>
You will need to change the addressing from [] to bare pointers. This
will help other OpenACC implementations as well, in particular CAPS.
For example (not tested!):

val = obstacles[i*n1*n2+j*n2+k]

With respect to the directive, I think that:

#pragma acc kernels loop collapse(2) copy(potential[0:n1*n2*n3])
copyin(obstacles[0:n1*n2*n3]) private(acc, val)

should work.

It will help too if you declare acc and val within the loop as here
(assuming double):

#pragma acc kernels loop collapse(2) copy(potential[0:n1*n2*n3])
copyin(obstacles[0:n1*n2*n3])
>     for (i = 1; i < (X - 1); i++) {
>       for (j = 1; j < (Y - 1); j++) {
>         for (k = 1; k < (Z - 1); k++) {
>          double val = obstacles[i][j][k];
>          double acc  = potential[i-1][j][k] + potential[i+1][j][k] +
> potential[i][j-1][k] + potential[i][j+1][k] + potential[i][j][k-1] +
> potential[i][j][k+1];
>          potential[i][j][k] = acc * (1/6);


In this case, collapse(3) would work on CUDA but not in OpenCL due to
current implementation limitations. If you think you'll need this, I
can try to push the developers to put it on the next release - not
sure if they'll make it on time.

Best regards,

    Ruyman Reyes,

>
> Best regards,
> Ricardo



Reply all
Reply to author
Forward
0 new messages