Bug report: Massive slowdown/hang when using more than 2 MPI processes

71 views
Skip to first unread message

Johannes Stöckelmaier

unread,
Jan 21, 2024, 8:40:23 AMJan 21
to NWChem Forum
Dear Community!

When I run a calculation of three protein fragments in implicit solvation (please find the attached files below), I receive the following error:

      Screening Tolerance Information
      -------------------------------
          Density screening/tol_rho:  1.00D-11
          AO Gaussian exp screening on grid/accAOfunc:  16
          CD Gaussian exp screening on grid/accCDfunc:  20
          XC Gaussian exp screening on grid/accXCfunc:  20
          Schwarz screening/accCoul:  1.00D-08

{    0,  123}:  On entry to PDLASRT parameter number    9 had an illegal value

The calculation did not crash but continue to be stuck until the process was killed by SLURM after reaching its time limit. I was using mpi with 127 processes. Further investigation showed, that the error message seems to be connected to Scalapack (https://github.com/Reference-ScaLAPACK/scalapack/issues/74).

I further tested the input script on my private workstation. Using runmpi -n (1/2/3/4/5), it passes on to write out the "Superposition of Atomic Density Guess" block which should come next.
This works very fast with one or two mpi processes and gets very very slow using 3 or more processes.

Has anyone a idea how I can avoid this error/problem? Can I choose to not use Scalapack on runtime or is it necessary to recompile NWChem using OpenBlas instead of Scalapack?

Thank you
Johannes
protein_fragments.nw
protein_fragments.log
protein_fragments.pdb

Edoardo Aprà

unread,
Jan 23, 2024, 5:10:15 PMJan 23
to NWChem Forum
Can you run containers (e.g. apptainer/singularity, docker or shifter) on the cluster you are using?
I was able to start the SCF  and get a few cycles using the apptainer/singularity image
I would recommend using containers instead of the spack module installation that generated the input file you posted

Johannes Stöckelmaier

unread,
Feb 11, 2024, 2:11:54 PMFeb 11
to NWChem Forum
Dear Mr. Aprà!

Thank you for your answer.

I tried using singularity containers. Interestingly, the input file linked in my initial post did pass, but other numerically different but similar jobs failed with the same error as mentioned above. This happened o me with a freshly created singularity container (https://github.com/edoapra/nwchem-singularity/tree/master/nwchem-dev.ompi41x.ifx). I did not see the issue with scaLAPACK disabled on a normal compilation, but then NWChem was much slower, so that's also not an option for me. I did not test it in methodically, but from a gut feeling, the mentioned error happens more likely with more processes being used.

Edoardo Aprà

unread,
Feb 11, 2024, 2:13:31 PMFeb 11
to NWChem Forum
I am not 100% sure I understand how you are using NWChem. 
Could you please post the script you are using to run NWChem?

Reply all
Reply to author
Forward
0 new messages