ADDA - GPU FP32 vs FP64

Michael Kottas

unread,

Aug 30, 2017, 1:09:02 PM8/30/17

to ADDA questions and answers

Dear all,

I am currently exploring the possibility of running ADDA in GPU instead of CPU. The following point is stated in the OpenCL wiki of the ADDA project:

The implementation works on modern GPUs at current state. The main limitation is the required support of double precision floating point operations.

This is backed up by Huntemann et al. 2011. This article also notes that:

We expect a total speedup up to a factor of 20/40 at DP/SP with small code customizations on a more recent GPU which has full native support of DP calculations like e.g., the Nvidia Tesla C2050/C2070.

I have run the latest version of ADDA (Feb 2017) in different GPU's with the following results:

- nVidia M60 --> time: 7508s (2× GM204 chips, 7365 GFLOPS FP32, 230.1 GFLOPS FP64)
- nVidia K80 --> time: 10130s (2× GK210 chips, 5591 GFLOPS FP32, 1864 GFLOPS FP64)
- nVidia GTX960 --> time: 12526s (GM204 chip, 2308 GFLOPS FP32, 72.1 GFLOPS FP64)
- AMD R7 370 --> time: 29533s (Pitcairn Pro, 1977 GFLOPS FP32, 124.8 GFLOPS FP64)

It is evident that the FP64 (DP) performance is completely irrelevant to the overall performance of the ADDA algorithm. Judging from the above results, the best solution for runnning ADDA is currently the nVidia GTX1080 Ti, since the AMD cards seem to fall behind considerably despite similar FP32 and FP64 performance.

My questions are:

- How is the accuracy of the result impacted by the GPU runs (due to SP)?
- Is there a way to take advantage of the DP performance of Tesla GPUs, as is stated in Huntemann et al. 2011, in order to speed up the calculations significantly?
- Is there a way to utilize two (2) or more GPUs in one (1) Motherboard, in order to speed up the process?

Best regards,

Michael Kottas
National Observatory of Athens

Maxim Yurkin

unread,

Aug 31, 2017, 5:11:06 PM8/31/17

to adda-d...@googlegroups.com, Michael Kottas

Dear Michael,

Thanks for your interest in OpenCL mode of ADDA and for testing it on different GPUs. Can you please additionally provide the used command line to execute ADDA and, ideally, the resulting log files. The latter contained detailed timing information in the end, which may be relevant for our discussion.

As a first wild guess, you may spend major fraction of simulation time for scattered fields - and they are not GPU accelerated - see https://github.com/adda-team/adda/issues/226 . Then the time would be determined by the used CPU, which is probably not the same for all runs, is it? In general, one should look, first, at "Internal fields" or "matvec products" in timing.

But even then, by default only the matrix-vector product is accelerated by the GPU and copying of input to the GPU memory (and output back) is required at each run. We noticed previously that this copying may become noticeable on fast GPUs comparing with the computational part (mainly using clFFT library). And Nvidia cards are usually faster in this copying, which may explain the difference. We have an experimental option OCL_BLAS in Makefile (works only with bicg iterative solver), which may help with this issue, but this option is not documented (see - https://github.com/adda-team/adda/commit/3466ed78faa8d4b116eea025906a4e9accf742e4 ), so try it if you are ready to experiment a little bit.

Also, in terms of performance of different GPUs, it is a good idea to test clFFT library directly. There may be even some benchmarks at the corresponding website. The problem is that this performance may be very different from pure Gflops that you mentioned.

By the way, current version of ADDA wikis are at https://github.com/adda-team/adda/tree/wiki, although they are also a bit outdated.

And coming back to your specific questions:

- How is the accuracy of the result impacted by the GPU runs (due to SP)?

For most cases using single precision should not introduce significant errors (mind that convergence of the iterative solver is only 1e-5). However, current version of ADDA doesn't have a simple option to use SP - https://github.com/adda-team/adda/issues/119. So currently GPU is always used in double precision (even if it is slow). Even then we noticed some inaccuracies (at least, on simple gaming GPUs), probably due to some shortcuts in arithmetic functions. But they are also usually negligible.

- Is there a way to take advantage of the DP performance of Tesla GPUs, as is stated in Huntemann et al. 2011, in order to speed up the calculations significantly?

Should be possible, but see all the issues discussed above.

- Is there a way to utilize two (2) or more GPUs in one (1) Motherboard, in order to speed up the process?

Not at the moment. There is a separate issue for that - https://github.com/adda-team/adda/issues/185 , but that seems rather complicated to implement. It may be possible to run two instances of ADDA in parallel (two independent runs), each choosing a different GPU. But that is probably not what you want.

Maxim.

P.S. This answer has been forwarded to your e-mail address for your convenience. However, if you want to continue the discussion please reply to the group's e-mail. You are also advised to check the corresponding discussion thread at http://groups.google.com/group/adda-discuss and/or subscribe to it to receive automatic notifications, since other answers/comments will not be necessarily forwarded to your e-mail.

Michael Kottas

unread,

Sep 4, 2017, 10:29:38 AM9/4/17

to ADDA questions and answers, mike....@gmail.com

Dear Yurkin,

Thank you for the immediate reply and your insightful suggestions. Your guess is 100% correct. The "internal fields" take up more than 95% of the total time.

However, the CPU for the AMD R7 370 run, is far superior than the CPU for the nVidia GTX960. Shouldn't the R7 370 time, be less than the GTX960's, if the CPU assumption holds true? Regarding the memory bandwidth, the metrics also seem to favor the R7 370 (179.2GB/s vs 112GB/s of GTX960), which does not translate into real world performance.

At this point, I can only assume that the reason behind the 2x performance of GTX960 over R7 370, may be some hardware acceleration for a set of operations added by nVidia to the GTX960's chip.

Hopefully, the attached log files will shed some light into this mystery.

Best regards,
Michael

log_gpu_r7_370.txt

log_gpu_gtx_960.txt

log_gpu_k80.txt

log_gpu_m60.txt

Davide Ori

unread,

Sep 4, 2017, 2:42:38 PM9/4/17

to ADDA questions and answers, mike....@gmail.com

Hi Michael.

First I wouldn't say your E3 is far superior than the i5 CPU. They are similar generations with similar clock speeds. In this particular situation you are not taking advantage of multithreading, thus the double thread count does not give you any advantage.

Second. The E3 as ECC memory support. That can cause some overhead in the memory copy operations leaving it behind the i5 overall performances.

The bottleneck in memory copy is not in the GPU memory bandwidth which you have reported but rather in the PCI communication between the GPU and the motherboard and the CPU memory writing which is affected by RAM performances and ECC overhead.

Lastly, I would point my attention to the logs. The R7 logs shows more than double the number of iterations of the other files. It is not like the 960 is 2 times faster, it seems the r7 is doing two times the work for whatever reason.

It also seems that your computations are really unstable. The numerical method found a really hard time to converge.

It appears that all of the additional work put on the R7 system shoulders is due to the y-polarization state. If you wanna have a more fair comparison I would suggest to rerun them, but just using the x-polarization state which required roughly 2800 iterations in every system.

Davide

Maxim Yurkin

unread,

Sep 28, 2017, 6:04:36 AM9/28/17

to ADDA questions and answers

One more duplicate to get this message into the correct thread. Sorry for that.

<=============>

Michael, here are a few notes based on the log files.

First, my guess was actually completely incorrect, since "internal fields" is the GPU time (mostly), while "scattered fields" is purely CPU. So your performance is based on the particular GPUs (and their communication speed with the CPU).

Then, you are (mostly) using the version of 1.3b4, which is the latest release. But there have been a few changes since that time. So, since you are interested in the best GPU performance, I recommend you to test the current source (download from github and compile yourself), which is labeled as 1.3b6. Moreover, the run for K80 was done with version 1.2, which is significantly outdated with respect to GPU performance (so, please, at least rerun it with the same version as the rest).

Comparing the three Nvidia cards, the overall performance seems logical, except that huge FP64 performance of K80 doesn't help. This is either due to old version of ADDA (see above) or due to the fact that clFFT is not only about FLOPs but also some memory copying and indirect addressing inside the GPU (as I mentioned above, but better try to find some benchmarks at their website). The other good thing is that the calculations for M60 and GTX960 agree within the machine precision (I judge from iteration residuals). For K80 there are some minor differences, but that is expected due to the older version.

Concerning the AMD card - the major issue is (as Davide noted) is the larger number of iterations. Looking at the progress of iteration residuals, one can see that they slowly grow from machine-precision (last shown digits) levels and can be due to some minor hardware differences between implementation of FP64 in AMD (I am not sure, if it is always detrimental). But for the Y-polarization the difference in residuals are mostly within 10%, and number of iterations to converge is almost the same (2889 vs. 2819). Unfortunately, for X-polarization the stagnation for AMD starts at iteration 1708 (when residual is 4.7e-3 - not that bad already) and increases the number of iterations a lot. We encountered similar issues before with different GPUs, but they can cause any problems only when the iteration count is already large (convergence is slow). One of the workarounds for a specific problematic run is to try a different iterative solver. Also, usually the final calculated quantities (e.g., Qext or Mueller matrix) are very close, but please check it.

Another issue with AMD is the GPU memory. The log states that R7 370 has only 2GB of RAM (is that correct?), while ADDA occupies 2.02GB of GPU RAM at its peak. Maybe (although I do not see any reason for that) the latter is a bit inaccurate (should have caused some errors during memory allocation), but anyway indicates the potential problem. Working at (or above) its available memory (especially if the same GPU is also used by OS for some mundane tasks) can cause some corruption of memory (as I remember, most gaming GPUs do not give any guarantees that such corruption won't happen) leading to larger iteration count described above. Moreover, it can also cause slow calculation speed due to some memory buffering. Even clFFT may automatically choose slower but less memory consuming method in this case.

To conclude, I recommend the following to objectively test the performance of ADDA at different GPUs.
1) Use the same version, preferably, the latest source from github.
2) Take a smaller problem both in terms of memory and iteration count. The latter is easily adjusted by decreasing the refractive index. Or you can use some simple command line, like
adda_ocl -grid 128 -m 1.2 0 -size 8 -ntheta 10
(the latter is to neglect the time for scattered fields), and you may increase size to 20 to have some more iterations (to decrease the effect of initialization).
3) Compare not only the runtimes, but also the calculated quantities and log files (including iteration count and residuals) to pinpoint artifacts.

I would appreciate, if you share with us your further benchmarks in this direction.

Best regards,
Maxim

Michael Kottas

unread,

Oct 2, 2017, 11:15:26 AM10/2/17

to ADDA questions and answers

Hello Davide,

Thank you for pointing out that the iterative solver gets stuck on the AMD R7 370. Maxim has provided an answer regarding this phenomenon (total on-board memory was marginally not enough).

Maxim has also clarified that the "internal fields" depict GPU performance, but is worth mentioning for future reference, that the E3 is about twice as fast as the i5 when running ADDA on MPI. This is due to the fact that the Xeon E3 v5 has twice as much flops per cycle than the i5, despite having the same physical cores and similar clock. This holds true for the Xeon E3 v5 compared to the Core i7, too.

Furthermore, the E3 system does indeed use ECC RAM, but I do not think that it bottlenecks the system in any perceivable way. There are numerous tests out there that debunk this ECC "myth".

Michael

Michael Kottas

unread,

Oct 2, 2017, 11:31:32 AM10/2/17

to ADDA questions and answers

Hello Maxim,

Thank you very much for all your comments and suggestions.

Despite having downloaded the latest ADDA version (v1.3b6), I was running it on Windows due to better GPU drivers availability. Strangely, the win64 is still running on the v1.3b4. I have repeated the tests on Linux (Ubuntu) and the log files now identify the version as v1.3b6. Unfortunately, no ADDA version seems to utilize the FP64 GPU performance (keeping the original question relevant).

You were absolutely right on blaming the available on-board GPU memory for the AMD's poor performance. In smaller size parameters, the AMD keeps up perfectly with the nVidia cards, as far as the FP32 performance is concerned. The calculated quantities are also very similar between AMD and nVidia (depending on the iterative solver).

Best regards,
Michael

Maxim Yurkin

unread,

Oct 4, 2017, 1:39:51 AM10/4/17

to adda-d...@googlegroups.com

Michael, I will try to make some intermediate conclusions (at least, from ADDA side). We can't really answer at this time how efficiently ADDA utilizes FP64 GPU performance. One way is to estimate the theoretical flops performed by ADDA and divide by runtime, and then compare it to the GPU peak performance (we never did it).

Second approach is comparison of different cards, but here I do not see the complete data. The GTX960 and M60 times scales approximately with FP64 performance (strange here is that one is double the same chip as the other, but the performance is three times different), and AMD is somewhere near both in time and FP64 performance (but I do not see the latest runtime). The largest remaining question is the K80 card - you seem to imply that even on the latest version the times are not as small as expected (again, it would be nice to see the recent data). But if that is the case, there may be the case that ADDA can employ only so much of FP64 GPU performance, given all other limitations (e.g. memory transfer).

Overall, I think that, apart from pinpointing some bugs, the reasonable question is what is the ADDA performance on different cards (potentially to predict what is the best cost-to-value card for such task) and how can we improve it.
The first question can be answered by a concise table with benchmark data (with the same version and moderate memory requirements, as discussed above and in previous messages). Do you have such table?
The second doesn't have a simple answer. With current version of ADDA we are limited to trying OCL_BLAS (and bicg iterative solver, see my first answer in this thread) - have you tried that?

Apart from that, ADDA is probably as good as the underlying clFFT - so benchmarking this library directly (of finding some results on the web) seems to be a good idea.

Maxim.

Reply all

Reply to author

Forward