So I finally got decent performance with gfortran, openmpi, and openblas across inifiniband. Now I find that the use of openmp and
half the number of mpi processes seems to give better performance for the 64 molecule H2O test case. Is that reasonable? I recompiled everything including BLAS, scalapack, etc without -fopenmp etc. to make the popt version.
I find in seconds:
1 node 16 MPI procs psmp OMP_NUM_THREADS=1 834
1 node 16 MPI procs popt OMP_NUM_THREADS=1 836
2 nodes 16 MPI procs psmp OMP_NUM_THREADS=2 266
2 nodes 32 MPI procs popt OMP_NUM_THREADS=1 430
4 nodes 64 MPI procs popt OMP_NUM_THREADS=1 331
4 nodes 32 MPI procs psmp OMP_NUM_THREADS=2 189
4 nodes 64 MPI procs psmp OMP_NUM_THREADS=4 166
So you see there is no overhead using psmp built with openmp and setting threads to 1.
Using OMP THREADS greatly improves performance over just increasing mpi processes
This may be because this machine has only 1 GB memory per core, but even 4 threads is better than 2, so it seems openmp
is more efficient than mpi.
Still room for improvement though. Any ideas of how to tweak out better performance?
Ron