Dear QMCPACK developers,
I am trying to run Spin-orbit coupling QMC calculations on Polaris using batched driver of QMCPACK. I am using the precompiled executable in: /soft/applications/qmcpack/develop-20241118//build_polaris_Clang18_offload_cuda_cplx/bin
Also using the job script provided in /soft/applications/qmcpack/develop-20241118/qmcpack-polaris.job.
Despite lowering the memory requirements significantly, I find that job fails after it starts the VMC accumulation.
I am able to run this calculation in Baseline (OLCF local CPU cluster) with no issues.
Here are the last few lines from the output file with meshfactor=1.0 using debug queue in Polaris:
<code>
=========================================================
Start VMCBatched
File Root dmc.s000
=========================================================
==============================================================
--- Memory usage report : VMCBatched before initialization ---
==============================================================
Available memory on node 0, free + buffers : 178134 MiB
Memory footprint by rank 0 on node 0 : 61632 MiB
Device memory allocated via OpenMP offload : 0 MiB
Device memory allocated via CUDA allocator : 0 MiB
Free memory on the default device : 38648 MiB
==============================================================
VMCBatched Driver running with
total_walkers = 512
walkers_per_rank = [128(x4)]
num_crowds = 8
on rank 0, walkers_per_crowd = [16(x8)]
steps = 1
blocks = 5
===================================================================
--- Memory usage report : VMCBatched after initialLogEvaluation ---
===================================================================
Available memory on node 0, free + buffers : 176221 MiB
Memory footprint by rank 0 on node 0 : 62045 MiB
Device memory allocated via OpenMP offload : 127 MiB
Device memory allocated via CUDA allocator : 0 MiB
Free memory on the default device : 38366 MiB
===================================================================<code>
I have tried reducing meshfactor of the splines, but it made no effect to the outcome.
dmc.err file is the following:
<code>
Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3".
Due to MODULEPATH changes, the following have been reloaded:
1) cray-mpich/8.1.28
QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 75
QMCPACK ERROR Primitive cell ion 1 vs supercell ion 1 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 2 vs supercell ion 2 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 3 vs supercell ion 3 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 4 vs supercell ion 4 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 5 vs supercell ion 5 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 6 vs supercell ion 6 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 7 vs supercell ion 7 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 8 vs supercell ion 11 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 9 vs supercell ion 8 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 10 vs supercell ion 12 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 11 vs supercell ion 9 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 12 vs supercell ion 13 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 13 vs supercell ion 10 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 14 vs supercell ion 14 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 15 vs supercell ion 15 atomic number not matching: 0 vs 16
QMCPACK ERROR Primitive cell ion 16 vs supercell ion 16 atomic number not matching: 0 vs 16
x3005c0s13b0n0.hsn.cm.polaris.alcf.anl.gov: rank 2 died from signal 11 and dumped core
<code>
Manual and the workshop materials say that the primitive/supercell ERROR lines are expected, because the file converted using convertpw4qmc does not contain the ionic species information. I get the same error above in Baseline, but it does not affect the calculation.
You can find my input/output files as attached.
Thank you,
Kayahan