OCL Wall Time

55 views
Skip to first unread message

Ismail Ibrahim

unread,
Aug 29, 2023, 12:00:00 PM8/29/23
to ADDA questions and answers
I recently have setup a Linux system and successfully compiled all 3 versions of ADDA (OCL, SEQ, and MPI).  All of the reccomended libraries were installed as well. (clFFT, clBLAS, FTTW3, etc) My question is, has anyone had issues where the OCL version is not running faster than sequential? For some reason the wall time is almost the exact same. This computer is running a Quadro P6000 so it should compute faster than an i7-3770. The MPI version with 4 cores currently runs the fastest but OCL should in theory be faster. I can confirm it does detect the GPU when running (Screenshot attached). I tested it on complex shapes (custom built cell models). If anyone has a suggestion, I would appreciate it!  I am not sure if I forgot to install something or if I need different drivers for the GPU (I did update them)

Screenshot from 2023-08-29 11-56-07.png

Ismail Ibrahim

unread,
Aug 29, 2023, 1:22:52 PM8/29/23
to ADDA questions and answers
For additional context the wall time for this particular model was about 6,400 seconds while MPI (with n= 4) was 2,500 seconds with sequential version wall time was 6,800 seconds

Maxim Yurkin

unread,
Aug 29, 2023, 5:47:28 PM8/29/23
to adda-d...@googlegroups.com, Ismail Ibrahim
Dear Ismail ,

First, I do not think that changing any libraries will help. Usually, if one forgets something, ADDA either doesn't compile or somehow fails during the run. And I also agree that some visible acceleration is expected for your hardware, although I am not aware if anybody tested this specific GPU and the acceleration will always be much smaller than one can hope based on raw flops. Moreover, this GPU is not efficient for double calculations (have theoretical performance of 1/32 of the single-precision one), and ADDA does not yet supports single-precision calculations. Still, the double-precision speed of this GPU is large.

Second, here are a couple of ideas to understand what is going on:
1) look at the timing section of the log file,
2) try some simple shapes, like '-shape box'

So please share the above results (maybe, complete log files), I will then try to find some explanation.

Maxim.

P.S. This answer has been forwarded to your e-mail address for your convenience. However, if you want to continue the discussion please reply to the group's e-mail. You are also advised to check the corresponding discussion thread at http://groups.google.com/group/adda-discuss and/or subscribe to it to receive automatic notifications, since other answers/comments will not be necessarily forwarded to your e-mail.
--
You received this message because you are subscribed to the Google Groups "ADDA questions and answers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to adda-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/adda-discuss/50657183-f727-49bf-aa36-0ffe1b2c0708n%40googlegroups.com.


Ismail Ibrahim

unread,
Aug 30, 2023, 11:43:42 AM8/30/23
to ADDA questions and answers
Dear Dr. Yurkin,
Thank you for the prompt response. I have run a quick test of a simple shape (box) for all 3 ADDA versions. The speed came out to approximately the same. I will attach the log files. Perhaps I can send the log files of the two spheres geometry file I was testing on earlier?

Speed test box.zip

Ismail Ibrahim

unread,
Aug 30, 2023, 11:56:21 AM8/30/23
to ADDA questions and answers
Additionally here is the log files for a simulation run on two spheres. The two spheres were created using a MATLAB code with the output being a geometry file.dat. The scat_param file is also included. 

log files.rar

Maxim Yurkin

unread,
Aug 30, 2023, 6:38:43 PM8/30/23
to adda-d...@googlegroups.com
Dear Ismail,

The box tests does not help, since two of them lead to a single dipole (due to size << wavelength and fixed dpl), while the third one does not have '-size 10' in the command line and hence is different. And for all of them most of the time is spent on initialization.

However, the log files for two spheres does offer an explanation. Usually, the bottleneck of the DDA simulation is the iterative solution of a (huge) linear system. In particular, all discussions of scaling with number of dipoles and various improvements focus on this part. This corresponds to "Internal fields" in the timing block of the log file. In your case it is 818, 442, and 57 s for seq, mpi, and ocl modes. So, indeed ocl shows more than 10 times acceleration (limited by the double precision, as I mentioned before). Also, it shows that parallel scaling is not that good (acceleration of about 2 for 4 processes), which is probably limited by memory access speed. But in your case this is not a bottleneck, but a calculation of scattered fields on a large grid is. Hence, you suffer from the known issue: https://github.com/adda-team/adda/issues/226 - the calculation of these scattered fields is not GPU-accelerated. Before this issue is solved, the only workaround is to use smaller number of scattering angles (if that is feasible for your task).

Overall, it's a pity that you cannot use full power of your GPU for ADDA simulations. After the above issue and https://github.com/adda-team/adda/issues/119 are solved, it may provide more than 100 times acceleration in comparison with seq version. But I can not give any estimate for when it will happen (volunteers are welcome).

Maxim.



On 30.08.2023 18:56, Ismail Ibrahim wrote:
Additionally here is the log files for a simulation run on two spheres. The two spheres were created using a MATLAB code with the output being a geometry file.dat. The scat_param file is also included. 

--
You received this message because you are subscribed to the Google Groups "ADDA questions and answers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to adda-discuss...@googlegroups.com.

Ismail Ibrahim

unread,
Sep 4, 2023, 1:59:49 PM9/4/23
to ADDA questions and answers
Dear Dr. Yurkin,

Thank you very much for the explanation. I am mostly glad nothing is wrong with the GPU and that I did not compile things incorrectly.  The long wall time is not a huge problem, as mentioned was mostly curious as to the difference in run times so I am glad there is an explanation for it. 

Reply all
Reply to author
Forward
0 new messages