batched v. 3.15.9 crashing for h2/graphene

85 views
Skip to first unread message

Yasmine Al-Hamdani

unread,
Nov 27, 2022, 7:58:17 PM11/27/22
to qmc...@googlegroups.com
Dear QMCPACK developers,

I'm running the batched version of QMCPACK (v. 3.15.9) on Summit and I have successfully run the diamond example available online. Using a modified nexus script, I also successfully ran a test calculation on a small hydrogen molecule (scf-nscf-conv-opt). However, using the same script (for test purposes) and only changing the physical system to H2/graphene (with the same unit cell and ccECP pseudopotentials as my H2 calculation), the "opt" calculation crashes (see attached output, error log, and input). 

As everything except the physical system is the same between the two calculations, I wonder if it is a memory allocation problem? Not clear to me how the memory should be distributed across the GPU and CPU cores on a node on Summit. 

I tried a number of things including running it with the same machine settings and executable as available in:
/gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.15.0/qmcpack-gpu-summit.sub
It didn't work and crashed in the same way. 

I'm still wrapping my head around how the tags in the new version work so apologies in advance if it's a simple error on my part!

 All the best,

Yasmine




nexus_h2gr_h2script.py
qmcpack-opt-h2gr-crashes.tar.gz

Yasmine Al-Hamdani

unread,
Nov 30, 2022, 12:01:03 PM11/30/22
to qmcpack
A further note: I realised in the files I sent the real version of qmcpack is being called. I did normally use qmcpack_complex  (and once more now to check) and it crashes in the same manner. 

Cheers,

Yasmine

Ye Luo

unread,
Nov 30, 2022, 12:03:38 PM11/30/22
to Yasmine Al-Hamdani, qmcpack
I will try to reproduce the failure on my end and see if I can pinpoint the issue.
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory


--
You received this message because you are subscribed to the Google Groups "qmcpack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qmcpack+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/qmcpack/c03d4c75-a54d-4e8b-8208-2e6ba0d6afd9n%40googlegroups.com.

Yasmine Al-Hamdani

unread,
Dec 19, 2022, 6:32:57 AM12/19/22
to qmcpack
Dear Ye,

I wonder if there's any update on this? Even knowing that you do/don't get the error I see would be useful! Or if there are any basic mistakes you have spotted...
Thanks in advance.

Best wishes,

Yasmine 

Ye Luo

unread,
Dec 23, 2022, 9:59:58 PM12/23/22
to Yasmine Al-Hamdani, qmcpack
Could you put your orbital hdf5 in the world work space https://docs.olcf.ornl.gov/data/index.html of your project cph005 on Summit and point me to it?
In such a way, I can skip the orbital generating step and move on to troubleshooting qmcpack?
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

On Wed, Nov 30, 2022 at 11:01 AM Yasmine Al-Hamdani <yas.al...@gmail.com> wrote:

Yasmine Al-Hamdani

unread,
Jan 4, 2023, 5:45:59 AM1/4/23
to qmcpack
Dear Ye Luo,

I have put the scf and nscf calculations including the hdf5 orbital in the directory: /gpfs/alpine/cph005/world-shared/yasmine-h2gr/runs/
You will find directories pwNSCF_h2gr and pwSCF_h2gr. 

The directory with the relevant hdf5 file should be:
/gpfs/alpine/cph005/world-shared/yasmine-h2gr/runs/pwNSCF_h2gr/pwscf_output/pwscf.save 

Thanks in advance.

Best wishes,

Yasmine

Ye Luo

unread,
Jan 23, 2023, 7:04:43 PM1/23/23
to qmcpack
Could you try `/gpfs/alpine/mat151/world-shared/opt/qmcpack/develop-20230118`?
If your run still goes bad. Please grant read permission on all the files under
chmod -R o+r /gpfs/alpine/cph005/world-shared/yasmine-h2gr/runs/pwNSCF_h2gr/
so I can just use your h5 orbital file.
Ye

Yasmine Al-Hamdani

unread,
Apr 3, 2023, 10:19:22 AM4/3/23
to qmcpack
Dear Ye,

Thanks for your effort and time. We were able to find a work-around to the issue although I should note:
1. We tried running with the executable you shared, but we met the same issue.
2. By reducing the meshfactor (to poor non-production quality values) we saw that the calculations were able to go ahead. 
3. We tried using the gpusharing option but it made no difference as far as we could see.

Noting that it seemed to be a memory bottleneck, we saw that even if we only wanted to work only on one twist, the .h5 wavefunction file (computed on a full set of k-points in quantum espresso) was too big. Therefore, we found that computing the orbitals in independent nscf calculations for each k-point, starting from the scf calculation of the wavefunction with the whole k-point set provided small enough .h5 files. We also had to limit the plane-wave cut-off in the pwscf calculations to ensure we had a small enough wavefunction file. 

With that protocol, we were able to run production level calculations for other (bigger) systems.

Cheers,

Yasmine
Reply all
Reply to author
Forward
0 new messages