Job failure in Polaris with no user feedback

12 views
Skip to first unread message

saritas...@gmail.com

unread,
Jan 15, 2025, 1:17:00 PMJan 15
to qmcpack
Dear QMCPACK developers, 

I am trying to run Spin-orbit coupling QMC calculations on Polaris using batched driver of QMCPACK. I am using the precompiled executable in: /soft/applications/qmcpack/develop-20241118//build_polaris_Clang18_offload_cuda_cplx/bin

Also using the job script provided in /soft/applications/qmcpack/develop-20241118/qmcpack-polaris.job. 

Despite lowering the memory requirements significantly, I find that job fails after it starts the VMC accumulation. 

I am able to run this calculation in Baseline (OLCF local CPU cluster) with no issues. 

Here are the last few lines from the output file with meshfactor=1.0 using debug queue in Polaris: 

<code>
=========================================================
  Start VMCBatched
  File Root dmc.s000
=========================================================
==============================================================
--- Memory usage report : VMCBatched before initialization ---
==============================================================
Available memory on node 0, free + buffers :  178134 MiB
Memory footprint by rank 0 on node 0       :   61632 MiB
Device memory allocated via OpenMP offload :       0 MiB
Device memory allocated via CUDA allocator :       0 MiB
Free memory on the default device          :   38648 MiB
==============================================================
VMCBatched Driver running with
             total_walkers     = 512
             walkers_per_rank  = [128(x4)]
             num_crowds        = 8
  on rank 0, walkers_per_crowd = [16(x8)]

                         steps = 1
                        blocks = 5

===================================================================
--- Memory usage report : VMCBatched after initialLogEvaluation ---
===================================================================
Available memory on node 0, free + buffers :  176221 MiB
Memory footprint by rank 0 on node 0       :   62045 MiB
Device memory allocated via OpenMP offload :     127 MiB
Device memory allocated via CUDA allocator :       0 MiB
Free memory on the default device          :   38366 MiB
===================================================================<code>

I have tried reducing meshfactor of the splines, but it made no effect to the outcome. 

dmc.err file is the following:

<code>
Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3".


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.28

QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 75
QMCPACK ERROR Primitive cell ion 1 vs supercell ion 1 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 2 vs supercell ion 2 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 3 vs supercell ion 3 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 4 vs supercell ion 4 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 5 vs supercell ion 5 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 6 vs supercell ion 6 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 7 vs supercell ion 7 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 8 vs supercell ion 11 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 9 vs supercell ion 8 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 10 vs supercell ion 12 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 11 vs supercell ion 9 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 12 vs supercell ion 13 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 13 vs supercell ion 10 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 14 vs supercell ion 14 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 15 vs supercell ion 15 atomic number not matching: 0 vs 16
QMCPACK ERROR Primitive cell ion 16 vs supercell ion 16 atomic number not matching: 0 vs 16
x3005c0s13b0n0.hsn.cm.polaris.alcf.anl.gov: rank 2 died from signal 11 and dumped core
<code>

Manual and the workshop materials say that the primitive/supercell ERROR lines are expected, because the file converted using convertpw4qmc does not contain the ionic species information. I get the same error above in Baseline, but it does not affect the calculation. 

You can find my input/output files as attached. 

Thank you,
Kayahan
soc_baseline.zip
soc_error_polaris.zip

Paul R. C. Kent

unread,
Jan 15, 2025, 1:32:25 PMJan 15
to qmcpack
Please can you create an issue on GitHub with this info? It is easiest to track and discuss problems there. https://github.com/QMCPACK/qmcpack/issues

saritas...@gmail.com

unread,
Jan 15, 2025, 2:13:47 PMJan 15
to qmcpack
Thank you Paul, just submitted the issue. 

saritas...@gmail.com

unread,
Jan 15, 2025, 2:14:08 PMJan 15
to qmcpack
Reply all
Reply to author
Forward
0 new messages