Unexpectedly Long Time of Epsilon Inverse Computation

26 views
Skip to first unread message

Rui Liu

unread,
Jun 17, 2025, 1:18:02 PMJun 17
to BerkeleyGW Help
Dear BerkeleyGW Team,

I am currently running BerkeleyGW on Lawrencium/Einsteinium using four H100 GPUs. Although the calculation runs successfully, I've noticed that the computation of the Epsilon Inverse step takes unexpectedly long—approximately two to three times longer than the Valence Block computation. Interestingly, I do not encounter this issue when running the same script on Perlmutter. For your reference, I've attached my run scripts (https://docs.google.com/document/d/1yw1281GnntVcp6sQqqCmXqiZd24qdChMOwSb00xGMfA/edit?usp=sharing) and the corresponding output logs (https://docs.google.com/document/d/19WjJxmJQ50DiK8_pfDK-A1F5RQHgJQYpZLZ20_ZlcMA/edit?usp=sharing). Could you please provide any insights or suggestions on why this discrepancy occurs and how I might resolve it? 

Additionally, to enable GPU support on Einsteinium, I manually built several required libraries instead of using the existing modules. The versions I compiled are as follows:

Open MPI 4.1.6 
HDF5 1.14.3 
FFTW 3.3.10 
LAPACK 3.12.1 
ScaLAPACK 2.2.2 

Since there was no existing BerkeleyGW compilation configuration for Einsteinium, I adapted the one from Perlmutter (see: https://github.com/csruiliu/dcgm-profiling/blob/main/berkeleyGW/lrc-arch-gpu.mk). If you notice anything potentially problematic with my build or configuration, please let me know. 

As I'm not an expert with BerkeleyGW, it's quite possible that I've misconfigured or incorrectly set something up. I would greatly appreciate any comments, suggestions, or recommendations you might have. Thank you very much for your support and assistance.


Mauro Del Ben

unread,
Jun 17, 2025, 2:29:23 PMJun 17
to Rui Liu, BerkeleyGW Help
Hi Rui,

The investing step (LU decomposition + triangular inversion) is performed using the ScaLAPACK implementation, therefore there is not much we can do other than making sure to use the most performant compiled version usually provided as a module by the staff managing the cluster. That's most likely the difference compared to the Perlmutter run. Also, you are running on a single MPI rank and the matrix that is inverted is ~11k, so it's quite a lot of work for a single CPU and if your BLAS is not threaded it could take a significant amount of time. 

Other than that the GPU seems to be doing the right thing, good to see the Donkey performing well on H100, and thanks for testing ;)

Best

-M


--
You received this message because you are subscribed to the Google Groups "BerkeleyGW Help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to help+uns...@berkeleygw.org.
To view this discussion visit https://groups.google.com/a/berkeleygw.org/d/msgid/help/a00001ad-402a-4e63-b9c6-afaff852282en%40berkeleygw.org.

Rui Liu

unread,
Jun 17, 2025, 4:04:37 PMJun 17
to Mauro Del Ben, BerkeleyGW Help
Hi Mauro,

Thanks for your quick response and insight comments. Just to clarify, the investing step is using CPU instead of GPU for computation? 

Best,
Rui
Reply all
Reply to author
Forward
0 new messages