Problem with qmcpack v 4.1.0 using GPUs

9 views
Skip to first unread message

Andrea Zen

unread,
Dec 15, 2025, 8:50:00 PM (6 days ago) Dec 15
to qmcpack
Hi,
I am observing an issue with qmcpack v 4.1.0 employing GPUs and run on the Leonardo cluster, on the Booster partition, at CINECA (https://docs.hpc.cineca.it/hpc/leonardo.html#system-architecture),
I am trying to understand if the problem comes from my compilation, the way I run the code, or something else.

Let me explain the problem. I was running the GPU version, and I was noticing it goes slower than I expected. So, I tried a small system (water-methane complex) and tried both the GPU and the CPU-only version of the code, and got that the CPU-only version is way faster despite having the same resources and not employing the GPUs.
I am attaching the outputs of the two calculations. 
They are performed using two nodes of the Booster partition, having 4 GPUs per node and 32 CPUs per node, so I used 4 mpi rasks per node (so, 8 in total) and 8 OMP threads per mpi task.  

The timing are:
GPU:
Timer         Inclusive_time  Exclusive_time  Calls       Time_per_call
Total          517.1148     2.5066              1     517.114793363
  DMCBatched   190.5414   190.5414              1     190.541434383
  Startup        0.1267     0.1267              1       0.126677281
  VMCBatched   323.9401   323.9401              1     323.940080429

CPU-only:
Timer         Inclusive_time  Exclusive_time  Calls       Time_per_call
Total          147.9119     0.0558              1     147.911921762
  DMCBatched    82.2949    82.2949              1      82.294926232
  Startup        0.1188     0.1188              1       0.118750077
  VMCBatched    65.4425    65.4425              1      65.442487005

As you can see, CPU-only is way faster.

This is how I compiled the GPU version: 

module load cmake  #/4.1.2
module load ninja
module load gcc/12.2.0
module load cuda/12.2
module load openmpi/4.1.6--gcc--12.2.0-cuda-12.2
module load fftw/3.3.10--openmpi--4.1.6--gcc--12.2.0-spack0.22
module load hdf5/1.14.3--openmpi--4.1.6--gcc--12.2.0-spack0.22
module load boost/1.85.0--openmpi--4.1.6--gcc--12.2.0
module load openblas/0.3.26--gcc--12.2.0
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
  -DQMC_COMPLEX=OFF -DQMC_MIXED_PRECISION=OFF \
  -DQMC_GPU="cuda" -DQMC_GPU_ARCHS=sm_80 \
  ../qmcpack-4.1.0
make -j 32

While I got the CPU-only version using 
cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
  -DQMC_COMPLEX=OFF -DQMC_MIXED_PRECISION=OFF \
  ../qmcpack-4.1.0
make -j 32

Can somebody help me with this?

Best,
Andrea Zen



dmc_CPUonly.out
dmc_GPU.out

Andrea Zen

unread,
Dec 15, 2025, 8:59:14 PM (6 days ago) Dec 15
to qmcpack
Please find attached the inputs to reproduce the calculation
TEST_DMC_TEMPLATE_long.zip

Ye Luo

unread,
Dec 15, 2025, 9:08:49 PM (6 days ago) Dec 15
to Andrea Zen, qmcpack
Please remove   -DQMC_GPU="cuda"
Our CMake will set cuda and OpenMP for you.
When you enforce cuda only, OpenMP offload feature is turned off. Yon can check qmcpack printout if specific features are on or off.
Ye

On Dec 15, 2025, at 7:50 PM, Andrea Zen <zen.an...@gmail.com> wrote:

  -DQMC_GPU="cuda"

Andrea Zen

unread,
Dec 16, 2025, 12:53:17 AM (6 days ago) Dec 16
to qmcpack
I removed the flag and tried recompiling, but it fails with the following error:

/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0_real/src/config.h:39:29: error: array section does not have mappable type in 'map' clause
   39 |   #define PRAGMA_OFFLOAD(x) _Pragma(x)
      |                             ^~~~~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Particle/SoaDistanceTableABOMPTarget.h:144:36: note: in expansion of macro 'PRAGMA_OFFLOAD'
  144 |   ~SoaDistanceTableABOMPTarget() { PRAGMA_OFFLOAD("omp target exit data map(delete : this[:1])") }
      |                                    ^~~~~~~~~~~~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Particle/DistanceTable.h:41:29: note: static field 'qmcplusplus::DistanceTable::DIM' is not mappable
   41 |   static constexpr unsigned DIM = OHMMS_DIM;
      |                             ^~~
/leonardo/prod/spack/06/install/0.22/linux-rhel8-icelake/gcc-8.5.0/gcc-12.2.0-lkcazt4letxjj4s7nlhzryoyivevsatz/lib/gcc/x86_64-pc-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/basic_string.h:139:33: note: static field 'std::__cxx11::basic_string<char>::npos' is not mappable
  139 |       static const size_type    npos = static_cast<size_type>(-1);
      |                                 ^~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
   32 |   static constexpr size_t alignment = ALIGN;
      |                           ^~~~~~~~~
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
/leonardo_scratch/fast/EUHPC_R04_130/azen/qmcpack-4.1.0/src/Platforms/CPU/SIMD/Mallocator.hpp:32:27: note: static field 'qmcplusplus::Mallocator<double, 64>::alignment' is not mappable
[ 18%] Linking CXX static library libcontainer_testing.a
make[2]: *** [src/Particle/CMakeFiles/qmcparticle_omptarget.dir/build.make:76: src/Particle/CMakeFiles/qmcparticle_omptarget.dir/createDistanceTableAAOMPTarget.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 18%] Built target container_testing
make[2]: *** [src/Particle/CMakeFiles/qmcparticle_omptarget.dir/build.make:90: src/Particle/CMakeFiles/qmcparticle_omptarget.dir/createDistanceTableABOMPTarget.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:4020: src/Particle/CMakeFiles/qmcparticle_omptarget.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 18%] Linking CXX static library libformic_utils.a
[ 18%] Linking CXX executable ../../../../bin/ppconvert
[ 18%] Built target formic_utils
[ 18%] Built target ppconvert
[ 18%] Linking CXX executable ../../bin/convertpw4qmc
[ 18%] Built target convertpw4qmc
[ 18%] Linking CXX static library libcatch_main_no_mpi.a
[ 18%] Built target catch_main_no_mpi
[ 18%] Linking CXX static library libcatch_main.a
[ 18%] Built target catch_main
make: *** [Makefile:146: all] Error 2

Andrea Zen

unread,
Dec 16, 2025, 12:53:17 AM (6 days ago) Dec 16
to qmcpack
Hi Ye,
I removed the flag and tried recompiliing, but I get the following error:

Ye Luo

unread,
Dec 16, 2025, 1:03:45 AM (6 days ago) Dec 16
to Andrea Zen, qmcpack
Please use llvm 17 or above. I guess you missed multiple warnings about gcc not good for OpenMP offload.
Ye

On Dec 15, 2025, at 11:53 PM, Andrea Zen <andrea...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "qmcpack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qmcpack+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/qmcpack/f9840b9f-447b-421e-b886-d154cfe5d539n%40googlegroups.com.

Paul R. C. Kent

unread,
Dec 16, 2025, 1:14:57 AM (6 days ago) Dec 16
to qmcpack
From the module names, it looks like you are building with gcc. For performant GPU offload one needs to use a recent version of the llvm compiler (newer is better). This is a bit hidden in e.g. config/build_alcf_polaris_Clang.sh, but the recipe for the Perlmutter machine is clear config/build_nersc_perlmutter_Clang.sh . These are both A100 machines (same 4xA100 per node as Leonardo), so good speedups should be realizable.
Reply all
Reply to author
Forward
0 new messages