I am working with a installation of nwchem that does not seem to be running as efficiently as I would normally expect given the hardware I am working with.
The cluster I am working with is mostly using Xeon E5-2669A v4 2.40 GHz nodes dual socket 22 core/socket, with 512GB RAM.
The build made by the local research computing team uses the following setup for building nwchem. I've already built out my own version of this using the ARMCI_NETWORK=MPI-PR rather than OPENIB and includes additional variables like BLAS_SIZE and SCALAPACK.
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export ARMCI_NETWORK=OPENIB
export NWCHEM_TARGET=LINUX64
export MPI_LOC=/usr/mpi/intel/mvapich2-2.3a
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpifort -lmpi"
export MKL64=/nas/longleaf/apps/intel/17.2/intel/compilers_and_libraries_2017/linux/mkl/lib/intel64
export BLASOPT="-L${MKL64} -Wl,--start-group -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -Wl,--end-group"
export LAPACK_LIB=-lmkl_lapack95_lp64
cd $NWCHEM_TOP/src
make nwchem_config
make FC=ifort CC=icc >& make.log &