GPU performance (ocl mode)

84 views

Skip to first unread message

Maxim Yurkin

unread,

Jul 3, 2024, 9:49:25 AM7/3/24

to adda-d...@googlegroups.com

Dear colleagues,

Let me share with you the tests of current `ocl` mode in ADDA (including that with `OCL_BLAS`) on various GPUs (vs.
`seq` mode on different CPUs). This has been done together with Michel Gross. Look inside the attached file for details.
While the ocl mode of ADDA is definitely not completely mature, these data shows how to squeeze maximum performance out
of curent version of ADDA, and what are realistic expectations of GPU acceleration. The general conclusion are:

1) the main bottleneck is 3D FFT rather than moving memory to-from a GPU
2) `OCL_BLAS` helps a lot for fast GPUs, because it accelerates BLAS operations (but not because it removes the memory
transfers)
3) for fast GPUs, the bottleneck is related to memory bandwidth (for 3D FFT calculation) rather than pure computational
power (TFLOPs). Thus, switching to single precision (https://github.com/adda-team/adda/issues/119) is not expected to
provide huge gains (factors of up to 64 based on TFLOPs values for some GPUs) but rather close to two-times acceleration
(based on memory bandwidth).
4) there exist other issues (https://github.com/adda-team/adda/issues/226, https://github.com/adda-team/adda/issues/248)
that may cause major drop of performance for some problems.

As a side note, we have never seriously considered CUDA, not to be limited by Nvidia GPUs. However, CUDA FFT routines
showed themselves to be about 1.5 times faster than clFFT (in a limited number of tests). However, I guess that
systematic comparison of those two should have been performed by others.

If you have access to a modern GPU and wish to play with it, you're more than welcome to submit your performance results
in reply to this message.

Maxim.