Error in parallel REST calculations on multiple nodes using PLUMED2.4.2 and Gromacs 2018.1

329 views
Skip to first unread message

xianch...@gmail.com

unread,
Oct 10, 2018, 12:09:24 PM10/10/18
to PLUMED users

Hello, PLUMED, community,

 

I have some problems when using Plumed 2.4.2 in combination with Gromacs 2018.1 for multi-node parallel REST calculation of membrane and protein systems.

(All atoms of the protein system are treated as “hot” atoms (solute), and other lipids, ions and water molecular are treated as solvent. The computing resource we use is 32 CPUs per node. Here we select two replicas for testing.)

 

1.      When we try to use two nodes for REST calculation, and some errors have occurred. Below is the submission script we used:

#!/bin/bash

#BSUB -n 64

#BSUB -J 295K-insert

#BSUB -q privateq-zw

#BSUB -R "span[ptile=32]"

#BSUB -o %J.out

#BSUB -e %J.err

nrep=2

mpirun -np 64  gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -replex 100 -nsteps 50000 -hrex -s topol.tpr -reseed 175320


The error message is as follows:

starting mdrun 'DMPC and protein'

50000 steps,    100.0 ps.

starting mdrun 'DMPC and protein'

50000 steps,    100.0 ps.

step 0 imb F 25% pme/F 0.39 imb F 12% pme/F 0.44 step 100, will finish Wed Oct 10 11:08:17 2018

imb F 23% pme/F 0.37

step 200 Turning on dynamic load balancing, because the performance loss due to load imbalance is 8.1 %.

 

imb F 26% pme/F 0.36

step 200 Turning on dynamic load balancing, because the performance loss due to load imbalance is 10.2 %.

 

[32:c01n05] unexpected disconnect completion event from [31:c02n06]

Fatal error in MPI_Allreduce: Internal MPI error!, error stack:

MPI_Allreduce(1628)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff0d1fcd7c, 

count=3, MPI_FLOAT, MPI_SUM, comm=0x84000007) failed

MPIR_Allreduce_impl(1469): fail failed

MPIR_Allreduce_intra(954): fail failed

MPIC_Sendrecv(581).......: fail failed

MPIC_Wait(270)...........: fail failed

PMPIDI_CH3I_Progress(850): fail failed

(unknown)(): Internal MPI error!

[59:c01n05] unexpected disconnect completion event from [27:c02n06]

 

 

2.      For testing, we also use a node to perform REST calculation, the calculation is completed normally. Below is the submission script we used:

#!/bin/bash

#BSUB -n 32

#BSUB -J 295K-insert

#BSUB -q privateq-zw

#BSUB -R "span[ptile=32]"

#BSUB -o %J.out

#BSUB -e %J.err

nrep=2

mpirun -np 32  gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -replex 100 -nsteps 50000 -hrex -s topol.tpr -reseed 175320


3.      We also perform multi-node parallel REMD calculation and the calculation is completed normally. Below is the submission script we used:

#!/bin/bash

#BSUB -n 64

#BSUB -J 295K-insert

#BSUB -q privateq-zw

#BSUB -R "span[ptile=32]"

#BSUB -o %J.out

#BSUB -e %J.err

nrep=2

mpirun -np 64  gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -replex 100 -nsteps 50000 -s remd.tpr -reseed 175320

 

 

Thanks in advance for any assistance!

 

xian

Giovanni Bussi

unread,
Oct 12, 2018, 11:29:40 AM10/12/18
to plumed...@googlegroups.com
Can you try a larger replex stride?

It should be a multiple of the neighbor list update stride

Giovanni


--
You received this message because you are subscribed to the Google Groups "PLUMED users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plumed-users...@googlegroups.com.
To post to this group, send email to plumed...@googlegroups.com.
Visit this group at https://groups.google.com/group/plumed-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/1c767a62-caeb-413c-b988-55124f18772d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

xianch...@gmail.com

unread,
Oct 14, 2018, 12:27:56 PM10/14/18
to PLUMED users

Dear Giovanni Bussi

Thanks for your response and suggestion.

We took your advice, and set the “nstlist” to 10, and “replex” to 1000. REST indeed started running, but it broke down after ~320ps, with the same error as before:

vol 0.84  imb F  2% pme/F 0.57 vol 0.83  imb F  1% pme/F 0.52 [32:c02n02] unexpected disconnect completion event from [31:c02n05]

Fatal error in MPI_Allreduce: Internal MPI error!, error stack:

MPI_Allreduce(1628)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff714661fc, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000007) failed

MPIR_Allreduce_impl(1469): fail failed

MPIR_Allreduce_intra(954): fail failed

MPIC_Sendrecv(581).......: fail failed

MPIC_Wait(270)...........: fail failed

PMPIDI_CH3I_Progress(850): fail failed

(unknown)(): Internal MPI error!

The submission script we use is as follows:

gmx_mpi mdrun -v -plumed plumed.dat -multi 2 -nstlist 10 -replex 1000 -hrex -s topol.tpr -reseed 175320

 

We then increased the “replex” to 2000, with “nstlist”=10, the following is the submission script:

gmx_mpi mdrun -v -plumed plumed.dat -multi 2 -nstlist 10 -replex 2000 -hrex -s topol.tpr -reseed 175320

Now the REST calculation went longer, but after 2.4ns, it broke down again with the same error:

vol 0.81  imb F  1% pme/F 0.45 step 1208100, will finish Sun Oct 21 10:45:24 2018

vol 0.81  imb F  2% pme/F 0.45 [12:c04n03] unexpected disconnect completion event from [44:c01n05]

Fatal error in PMPI_Bcast: Invalid buffer pointer, error stack:

PMPI_Bcast(2667).........: MPI_Bcast(buf=0x7ffff3ac597c, count=12, MPI_BYTE, root=0, comm=0x84000006) failed

MPIR_Bcast_impl(1804)....: fail failed

MPIR_Bcast(1832).........: fail failed

I_MPIR_Bcast_intra(2056).: Failure during collective

MPIR_Bcast_intra(1670)...: Failure during collective

MPIR_Bcast_intra(1638)...: fail failed

MPIR_Bcast_knomial(2274).: fail failed

MPIC_Recv(419)...........: fail failed

MPIC_Wait(270)...........: fail failed

PMPIDI_CH3I_Progress(850): fail failed

(unknown)(): Internal MPI error!

 

Then we set “nstlist” to 1, and “replex” to 2000. Following is the submission script:

gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -nstlist 1 -replex 1000 -hrex -s topol.tpr -reseed 175320

This time the whole 20ns test run of REST finished normally with no break-down. This setting, however, obviously reduce the computational efficiency significantly. We can’t figure out what is the reason of the break-downs when “nstlist” >1, and how to deal with it. Will that be due to some inappropriate settings in our .mdp file (we attach it here as well)?

Thank you in advance for the precious comments and help.


Xian


在 2018年10月12日星期五 UTC+8下午11:29:40,Giovanni Bussi写道:
grompp.mdp

Giovanni Bussi

unread,
Oct 14, 2018, 12:57:31 PM10/14/18
to plumed...@googlegroups.com
Hi,

there might be some problem with hrex + gromacs 2018 that I am not aware of (I never used gmx 2018).

Maybe you can try NVE (constant volume), or someone else maybe has experienced this problem.

Giovanni

xianch...@gmail.com

unread,
Oct 14, 2018, 9:57:58 PM10/14/18
to PLUMED users
Dear Professor Bussi:

Thank you for the prompt reply. We would like to go back to try an earlier version of the code. What would you recommend to be the best combo of PLUMED+GROMACS that we should use?

xian

在 2018年10月15日星期一 UTC+8上午12:57:31,Giovanni Bussi写道:

Giovanni Bussi

unread,
Oct 15, 2018, 12:29:24 PM10/15/18
to plumed...@googlegroups.com
Hi,

in my lab I am still using gmx 5 due to compiler problems with GPUs. However, I tried gmx 2016 without any problem with -hrex. I never tried gmx 2018. So I would say gmc 2016 should be fine. If you confirm that your system works with gmx 2016 and not with 2018, it would be great if you can send us (even privately) your tpr files so that we can fix gromacs 2018.

In addition, I have to say that I tend to use always NVT (constant volume, sorry I wrote NVE by mistake).

Giovanni



xianch...@gmail.com

unread,
Nov 10, 2018, 10:14:57 AM11/10/18
to PLUMED users
Dear Professor Bussi:

Thank you very much for your reply, and sorry for replying you so long. After testing, for the NPT ensemble system, we found that multi-node parallel REST calculation can run normally through plumed2.4.2 and gmx2016.5. However, in the previous multi-node parellel REST calculation, we used plumed2.4.2 and gmx2018.1, which will cause unknown interrupts in the calculation process. 

I also sent you the tpr file we used in the calculation as an attachment. Thanks.

Xian

在 2018年10月16日星期二 UTC+8上午12:29:24,Giovanni Bussi写道:
topol3.tpr
topol0.tpr
topol1.tpr
topol2.tpr

Giovanni Bussi

unread,
Nov 12, 2018, 5:52:27 AM11/12/18
to plumed...@googlegroups.com
Hi,

thanks for your reply. I guess there is a but in gmx 2018 + hrex.

I opened a new issue right now, will check it when I will have time (https://github.com/plumed/plumed2/issues/410)

Giovanni


xianch...@gmail.com

unread,
Dec 18, 2018, 10:16:57 AM12/18/18
to PLUMED users
Dear Professor Bussi:

I am using Gromacs2016.5 with  Plumed2.4.2  for hrex calculation. After restart I have found some errors:

vol 0.93  imb F  1% vol 0.97  imb F  1% vol 0.98  imb F  0% vol 0.96  imb F  0% vol 0.96  imb F  0% imb F  2% vol 0.98  imb F  2% imb F  7% imb F  2% vol 0.98  imb F  1% imb F  5% imb F  1% vol 0.94  imb F  1% imb F 13% vol 0.91  imb F  2% imb F  9% vol 0.99  imb F  1% vol 0.97  imb F  0% vol 0.98  imb F  3% vol 0.98  imb F  2% imb F  1% vol 0.96  imb F  1% [108:c02n03] unexpected disconnect completion event from [44:c01n05]
Fatal error in PMPI_Bcast: Invalid buffer pointer, error stack:
PMPI_Bcast(2667).........: MPI_Bcast(buf=0x7fffee3aecfc, count=12, MPI_BYTE, root=0, comm=0x84000002) failed
MPIR_Bcast_impl(1804)....: fail failed
MPIR_Bcast(1832).........: fail failed
I_MPIR_Bcast_intra(2056).: Failure during collective
MPIR_Bcast_intra(1670)...: Failure during collective
MPIR_Bcast_intra(1638)...: fail failed
MPIR_Bcast_knomial(2274).: fail failed
MPIC_Recv(419)...........: fail failed
MPIC_Wait(270)...........: fail failed
PMPIDI_CH3I_Progress(850): fail failed
(unknown)(): Internal MPI error!


In my calculation, I start the NPT simulation with:
mpirun -np 192  gmx_mpi mdrun -v -plumed plumed.dat -multi 24 -nstlist 10 -replex 1000 -nsteps 60000000 -hrex -s topol.tpr

the plumed.dat is empty.

And then, I restart with:
mpirun -np 192  gmx_mpi mdrun -v -plumed plumed.dat -multi 24 -nstlist 10 -replex 1000 -nsteps 60000000 -hrex -s topol.tpr -cpi state.cpt -append

We can’t figure out what is the reason of the break-downs when performing restart calculation, and how to deal with it. 


Thank you in advance for the precious comments and help.


Xian

    

在 2018年11月12日星期一 UTC+8下午6:52:27,Giovanni Bussi写道:

Giovanni Bussi

unread,
Dec 18, 2018, 10:52:08 AM12/18/18
to plumed...@googlegroups.com
Hi,

Can you check if the restart files are in sync (same timestep)?

In case they are, there is clearly a bug somewhere. Can you confirm that:
1. This happens reproducibly every time you restart
2. This does NOT happen if you use 1 MPI process per replica (I am guessing this from your previous posts)

Thanks

Giovanni


davtya...@gmail.com

unread,
Jan 11, 2019, 1:53:34 PM1/11/19
to PLUMED users
Hello everyone,

I an encountering a different but possibly related problem when trying to run REST simulations of a small peptide (15 residues) using Plumed-2.5.0 + Gromacs/2018.4 and Charmm36m force-field.

I will attach the scripts that I use to setup the system. It consists of 15-residue peptide, water and ions. First I do minimization, NVT and NPT equilibration + one NPT equilibration where the protein is allowed to move. Then use the resulting final configuration I generate the modified topology files for 10 REST replicas.

If I run a regular MD simulation with any of the modified topology file, everything seems to run normally. However, when I try to run a test REST simulation (e.g. using the command below) it only proceeds up to the first exchange attempt.

mpirun -np 120 gmx_mpi mdrun -v -plumed plumed.dat -multi 10 -replex 500 -hrex -nsteps 500000

note: (the plumed.dat file is empty)

At the first attempt multiple error of these kind occur and the simulation crashes:

step 500: One or more water molecules can not be settled.

Check for bad contacts and/or reduce the timestep if appropriate.


step 500: One or more water molecules can not be settled.

Check for bad contacts and/or reduce the timestep if appropriate.


Step 500, time 1 (ps)  LINCS WARNING in simulation 3

relative constraint deviation after LINCS:

rms 0.405994, max 0.405994 (between atoms 205 and 206)

bonds that rotated more than 30 degrees:

 atom 1 atom 2  angle  previous, current, constraint length

    205    206   90.0    0.1111   0.1562      0.1111


I am attaching a full log file just in case.

Have anyone encountered this problem and is it related to the particular Plumed+Gromacs combination that I am using?

Thank you in advance,

Aram
_cmd.sh
cmap_fix_sc.sh
md.mdp
slurm-4741791.out
mark_hot_atoms.sh
processed_marked.top
processed.top
Reply all
Reply to author
Forward
0 new messages