running cuda-enabled cp2k on multiple nodes with aprun/HPC

Ada Sedova

unread,

Sep 12, 2017, 5:27:12 PM9/12/17

to cp2k

Hi,

I just built cp2k 4.1 using DBCSR and CUDA_PW on a cray HPC system with one K20X GPU per node. The single aprun call seems to always launch 3 apruns as can be seen with ps, and one always stops very quickly while the others seem to keep running. I am testing in an interactive qsub, but it seems this is nonstandard behavior and may be problematic in a batched setting. And at any rate, this does not seem like correct mpi behavior based on other programs I have run.

I wonder if the build was not completely successful? I'm testing with H2O-32.inp from the tests/benchmark directory but this also happened with C.inp.

Thanks,

Ada

Andreas Glöss

unread,

Sep 13, 2017, 11:12:51 AM9/13/17

to cp2k

Dear Ada,

Can you please post the aprun-line and the requested resources for the interactive reservation. Also check that 'ldd [your_executable]' shows the same library paths on the compile-node and the compute-node.

Best regards,
Andreas

Ada Sedova

unread,

Sep 13, 2017, 12:42:28 PM9/13/17

to cp...@googlegroups.com

Thanks, Andreas.

Here are the various interactive requests:

qsub -I -A stf006 -l walltime=00:40:00 -l nodes=1

qsub -I -A stf006 -l walltime=00:40:00 -l nodes=2

qsub -I -A stf006 -l walltime=00:40:00 -l nodes=3

qsub -I -A stf006 -l walltime=00:40:00 -l nodes=4

Here is an example of the environment set-up in the work dir aprun call:

module load cudatoolkit

module load fftw

module swap PrgEnv-pgi/5.2.82 PrgEnv-gnu

export CRAY_CUDA_PROXY=1

aprun -n 3 /ccs/home/adaa/cp2k-titan-gnu4.9.3/cp2k/exe/titan_gnu_cuda/cp2k.psmp -i H2O-32.inp -o H2O-32.out &

(Also with various other values for -n, and also with/without export OMP_NUM_THREADS=8, as in the CUDA profiling part of the website)

The ldd paths to the executable are identical within a qsub on the work node and on the compile node, except the hexadecimal addresses.

By the way, I built 4.1 on Eos, also a Cray but with no GPUs, and with psmp but no CUDA, and there are no problems running on multiple nodes.

Thanks again,

Ada

--
You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/zlt69l2xaqc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cp2k+unsubscribe@googlegroups.com.
To post to this group, send email to cp...@googlegroups.com.
Visit this group at https://groups.google.com/group/cp2k.
For more options, visit https://groups.google.com/d/optout.

Andreas Glöss

unread,

Sep 14, 2017, 1:44:18 AM9/14/17

to cp2k

Dear Ada,

Can't see any obvious mistake, and here at CSCS/PizDaint we only have pure SLURM (srun), so let's start systematic:

1) Swap the order of module load/unload, like this:

module swap PrgEnv-pgi/5.2.82 PrgEnv-gnu
module load fftw
module load cudatoolkit

, maybe there is a mistake in Cray's modules, and distclean & recompile CP2K.

2) Start an interactive session for just one node with:

qsub -I -A stf006 -l nodes=1,walltime=00:40:00

3) Start CP2K using 4 MPI-Ranks and 1 OMP-Thread/Rank with:

module swap PrgEnv-pgi/5.2.82 PrgEnv-gnu
module load cudatoolkit

module load fftw

export CRAY_CUDA_PROXY=1
export OMP_NUM_THREADS=1
aprun -n 4 /ccs/home/adaa/cp2k-titan-gnu4.9.3/cp2k/exe/titan_gnu_cuda/cp2k.psmp -i H2O-32.inp -o H2O-32.out

4) Use ps to determine the number of cp2k.psmp processes running (not aprun processes) with:
ps -ef| grep cp2k.psmp

What's the outcome - 4 or more?

Best regards,
Andreas

Ada Sedova

unread,

Sep 14, 2017, 11:27:33 AM9/14/17

to cp...@googlegroups.com

Hi,

I get 3 cp2k processes when I try this, plus the grep process. I also got the "[1]+ Stopped aprun -n 4 (etc.)" message again with this config, while the job keeps running. I actually had this same thing happen using another program yesterday, so it may just be something in the interactive mode on Titan, with multiple processes and the GPU, which I haven't noticed yet. I will do some more research. The jobs complete correctly, it seems, and so far, 10 steps of H2O-32 finishes in about 28 minutes on 1 node, and about 18 minutes on 2 nodes. I may need to tweak the number of processes I ask for, and the threading, etc. Any advice on this would be great.

Now, I was wondering about the maximal acceleration for cp2k. Does the use of libcusmm take the place of libsmm, or should I still rebuild with libsmm as well? Also, what about ELPA and the other optional libraries that increase performance? I really am mostly concerned with linear-scaling DFT for AIMD. But I am also concerned with dispersion corrections, so I don't know how well the two work together. Also, I did not build with libsci_acc, which I probably should do. Can you tell me everything that should be included in the build to get maximally-scaling DFT-MD?

Thanks so much,

Ada

--

Reply all

Reply to author

Forward