Problem with qmcpack v 4.1.0 using GPUs

Andrea Zen

unread,

Dec 15, 2025, 8:50:00 PM12/15/25

to qmcpack

Hi,
I am observing an issue with qmcpack v 4.1.0 employing GPUs and run on the Leonardo cluster, on the Booster partition, at CINECA (https://docs.hpc.cineca.it/hpc/leonardo.html#system-architecture),
I am trying to understand if the problem comes from my compilation, the way I run the code, or something else.

Let me explain the problem. I was running the GPU version, and I was noticing it goes slower than I expected. So, I tried a small system (water-methane complex) and tried both the GPU and the CPU-only version of the code, and got that the CPU-only version is way faster despite having the same resources and not employing the GPUs.
I am attaching the outputs of the two calculations.
They are performed using two nodes of the Booster partition, having 4 GPUs per node and 32 CPUs per node, so I used 4 mpi rasks per node (so, 8 in total) and 8 OMP threads per mpi task.

The timing are:
GPU:
Timer Inclusive_time Exclusive_time Calls Time_per_call
Total 517.1148 2.5066 1 517.114793363
DMCBatched 190.5414 190.5414 1 190.541434383
Startup 0.1267 0.1267 1 0.126677281
VMCBatched 323.9401 323.9401 1 323.940080429

CPU-only:
Timer Inclusive_time Exclusive_time Calls Time_per_call
Total 147.9119 0.0558 1 147.911921762
DMCBatched 82.2949 82.2949 1 82.294926232
Startup 0.1188 0.1188 1 0.118750077
VMCBatched 65.4425 65.4425 1 65.442487005

As you can see, CPU-only is way faster.

This is how I compiled the GPU version:

module load cmake #/4.1.2
module load ninja
module load gcc/12.2.0
module load cuda/12.2
module load openmpi/4.1.6--gcc--12.2.0-cuda-12.2
module load fftw/3.3.10--openmpi--4.1.6--gcc--12.2.0-spack0.22
module load hdf5/1.14.3--openmpi--4.1.6--gcc--12.2.0-spack0.22
module load boost/1.85.0--openmpi--4.1.6--gcc--12.2.0
module load openblas/0.3.26--gcc--12.2.0
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
-DQMC_COMPLEX=OFF -DQMC_MIXED_PRECISION=OFF \
-DQMC_GPU="cuda" -DQMC_GPU_ARCHS=sm_80 \
../qmcpack-4.1.0
make -j 32

While I got the CPU-only version using
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
-DQMC_COMPLEX=OFF -DQMC_MIXED_PRECISION=OFF \
../qmcpack-4.1.0
make -j 32

Can somebody help me with this?

Best,
Andrea Zen

dmc_CPUonly.out

dmc_GPU.out

Andrea Zen

unread,

Dec 15, 2025, 8:59:14 PM12/15/25

to qmcpack

Please find attached the inputs to reproduce the calculation

TEST_DMC_TEMPLATE_long.zip

Ye Luo

unread,

Dec 15, 2025, 9:08:49 PM12/15/25

to Andrea Zen, qmcpack

Please remove -DQMC_GPU="cuda"

Our CMake will set cuda and OpenMP for you.

When you enforce cuda only, OpenMP offload feature is turned off. Yon can check qmcpack printout if specific features are on or off.

Ye

On Dec 15, 2025, at 7:50 PM, Andrea Zen <zen.an...@gmail.com> wrote:

-DQMC_GPU="cuda"

Andrea Zen

unread,

Dec 16, 2025, 12:53:17 AM12/16/25

to qmcpack

I removed the flag and tried recompiling, but it fails with the following error:

/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0_real/src/config.h:39:29: error: array section does not have mappable type in 'map' clause
39 | #define PRAGMA_OFFLOAD(x) _Pragma(x)
| ^~~~~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Particle/SoaDistanceTableABOMPTarget.h:144:36: note: in expansion of macro 'PRAGMA_OFFLOAD'
144 | ~SoaDistanceTableABOMPTarget() { PRAGMA_OFFLOAD("omp target exit data map(delete : this[:1])") }
| ^~~~~~~~~~~~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Particle/DistanceTable.h:41:29: note: static field 'qmcplusplus::DistanceTable::DIM' is not mappable
41 | static constexpr unsigned DIM = OHMMS_DIM;
| ^~~
/leonardo/prod/spack/06/install/0.22/linux-rhel8-icelake/gcc-8.5.0/gcc-12.2.0-lkcazt4letxjj4s7nlhzryoyivevsatz/lib/gcc/x86_64-pc-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/basic_string.h:139:33: note: static field 'std::__cxx11::basic_string<char>::npos' is not mappable
139 | static const size_type npos = static_cast<size_type>(-1);
| ^~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
32 | static constexpr size_t alignment = ALIGN;
| ^~~~~~~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
[ 18%] Linking CXX static library libcontainer_testing.a
make[2]: *** [src/Particle/CMakeFiles/qmcparticle_omptarget.dir/build.make:76: src/Particle/CMakeFiles/qmcparticle_omptarget.dir/createDistanceTableAAOMPTarget.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 18%] Built target container_testing
make[2]: *** [src/Particle/CMakeFiles/qmcparticle_omptarget.dir/build.make:90: src/Particle/CMakeFiles/qmcparticle_omptarget.dir/createDistanceTableABOMPTarget.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:4020: src/Particle/CMakeFiles/qmcparticle_omptarget.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 18%] Linking CXX static library libformic_utils.a
[ 18%] Linking CXX executable ../../../../bin/ppconvert
[ 18%] Built target formic_utils
[ 18%] Built target ppconvert
[ 18%] Linking CXX executable ../../bin/convertpw4qmc
[ 18%] Built target convertpw4qmc
[ 18%] Linking CXX static library libcatch_main_no_mpi.a
[ 18%] Built target catch_main_no_mpi
[ 18%] Linking CXX static library libcatch_main.a
[ 18%] Built target catch_main
make: *** [Makefile:146: all] Error 2

Andrea Zen

unread,

Dec 16, 2025, 12:53:17 AM12/16/25

to qmcpack

Hi Ye,

I removed the flag and tried recompiliing, but I get the following error:

Ye Luo

unread,

Dec 16, 2025, 1:03:45 AM12/16/25

to Andrea Zen, qmcpack

Please use llvm 17 or above. I guess you missed multiple warnings about gcc not good for OpenMP offload.

Ye

On Dec 15, 2025, at 11:53 PM, Andrea Zen <andrea...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "qmcpack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qmcpack+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/qmcpack/f9840b9f-447b-421e-b886-d154cfe5d539n%40googlegroups.com.

Paul R. C. Kent

unread,

Dec 16, 2025, 1:14:57 AM12/16/25

to qmcpack

From the module names, it looks like you are building with gcc. For performant GPU offload one needs to use a recent version of the llvm compiler (newer is better). This is a bit hidden in e.g. config/build_alcf_polaris_Clang.sh, but the recipe for the Perlmutter machine is clear config/build_nersc_perlmutter_Clang.sh . These are both A100 machines (same 4xA100 per node as Leonardo), so good speedups should be realizable.

Reply all

Reply to author

Forward