CP2K scaling with Intel ONEAPI MPI + ethernet

186 views
Skip to first unread message

Tat

unread,
Jan 31, 2022, 11:37:03 AM1/31/22
to cp2k
Dear all,
we are trying to improve the suboptimal scaling of CP2K we're experiencing on a linux cluster with several physical nodes: the execution on 2 or more nodes appears to be significantly slower than on a single one. 
The system has nodes with 32-core Xeon Silver processors with hyperthreading, Gigabit ethernet and the execution is done according to the parameters provided by the plan.sh script, i.e.

for 1 node:
mpirun -np 16 -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch -genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD -genv OMP_NUM_THREADS=4 ~/cp2k-8.2/exe/Linux-x86-64-intelx/cp2k.psmp job.inp

for 2 nodes:
mpirun -r ssh -perhost 16 -host linux1,linux2 -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch -genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD -genv OMP_NUM_THREADS=4 ~/cp2k-8.2/exe/Linux-x86-64-intelx/cp2k.psmp job.inp

CP2K PSMP was compiled using Intel ONEAPI mpiifort 2021.3.0.

What could be done to improve the performance? Can network communication or SSH cause the bottleneck? 
Any suggestions or references would be much appreciated.
Thanks &regards,

Attila

abin

unread,
Apr 25, 2022, 6:41:03 AM4/25/22
to cp2k
Try to switch to InfiniBand network. 
Just forget the ethernet adapters if you are working heavy MPI workloads. 

hf.p...@gmail.com

unread,
May 7, 2022, 3:00:49 PM5/7/22
to cp2k
Like abin suggested, try using an HPC fabric such as InfiniBand or Omnipath. The main point here is about latency is too high over Ethernet for the communication patterns common in MD applications. Note, it is mostly about latency and not the transfer bandwidth (though bw is most advertised like "100G" or 200/400G). I conducted similar experiments in the past with CP2K and QE, and strong-scaling performance diminished quickly (often two nodes already needed a longer time to solution when compared to a single node/system). Of course, this depends on the workload but for instance reductions and all2all comms are the bottleneck in general.
Reply all
Reply to author
Forward
0 new messages