Dear all,
we are trying to improve the suboptimal scaling of CP2K we're experiencing on a linux cluster with several physical nodes: the execution on 2 or more nodes appears to be significantly slower than on a single one.
The system has nodes with 32-core Xeon Silver processors with hyperthreading, Gigabit ethernet and the execution is done according to the parameters provided by the plan.sh script, i.e.
for 1 node:
mpirun -np 16 -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch -genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD -genv OMP_NUM_THREADS=4 ~/cp2k-8.2/exe/Linux-x86-64-intelx/cp2k.psmp job.inp
for 2 nodes:
mpirun -r ssh -perhost 16 -host linux1,linux2 -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch -genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD -genv OMP_NUM_THREADS=4 ~/cp2k-8.2/exe/Linux-x86-64-intelx/cp2k.psmp job.inp
CP2K PSMP was compiled using Intel ONEAPI mpiifort 2021.3.0.
What could be done to improve the performance? Can network communication or SSH cause the bottleneck?
Any suggestions or references would be much appreciated.
Thanks ®ards,
Attila