Hello, PLUMED, community,
I have some problems when using Plumed 2.4.2 in combination with Gromacs 2018.1 for multi-node parallel REST calculation of membrane and protein systems.
(All atoms of the protein system are treated as “hot” atoms (solute), and other lipids, ions and water molecular are treated as solvent. The computing resource we use is 32 CPUs per node. Here we select two replicas for testing.)
1. When we try to use two nodes for REST calculation, and some errors have occurred. Below is the submission script we used:
#!/bin/bash
#BSUB -n 64
#BSUB -J 295K-insert
#BSUB -q privateq-zw
#BSUB -R "span[ptile=32]"
#BSUB -o %J.out
#BSUB -e %J.err
nrep=2
mpirun -np 64 gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -replex 100 -nsteps 50000 -hrex -s topol.tpr -reseed 175320
The error message is as follows:
starting mdrun 'DMPC and protein'
50000 steps, 100.0 ps.
starting mdrun 'DMPC and protein'
50000 steps, 100.0 ps.
step 0 imb F 25% pme/F 0.39 imb F 12% pme/F 0.44 step 100, will finish Wed Oct 10 11:08:17 2018
imb F 23% pme/F 0.37
step 200 Turning on dynamic load balancing, because the performance loss due to load imbalance is 8.1 %.
imb F 26% pme/F 0.36
step 200 Turning on dynamic load balancing, because the performance loss due to load imbalance is 10.2 %.
[32:c01n05] unexpected disconnect completion event from [31:c02n06]
Fatal error in MPI_Allreduce: Internal MPI error!, error stack:
MPI_Allreduce(1628)......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff0d1fcd7c,
count=3, MPI_FLOAT, MPI_SUM, comm=0x84000007) failed
MPIR_Allreduce_impl(1469): fail failed
MPIR_Allreduce_intra(954): fail failed
MPIC_Sendrecv(581).......: fail failed
MPIC_Wait(270)...........: fail failed
PMPIDI_CH3I_Progress(850): fail failed
(unknown)(): Internal MPI error!
[59:c01n05] unexpected disconnect completion event from [27:c02n06]
2. For testing, we also use a node to perform REST calculation, the calculation is completed normally. Below is the submission script we used:
#!/bin/bash
#BSUB -n 32
#BSUB -J 295K-insert
#BSUB -q privateq-zw
#BSUB -R "span[ptile=32]"
#BSUB -o %J.out
#BSUB -e %J.err
nrep=2
mpirun -np 32 gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -replex 100 -nsteps 50000 -hrex -s topol.tpr -reseed 175320
3. We also perform multi-node parallel REMD calculation and the calculation is completed normally. Below is the submission script we used:
#!/bin/bash
#BSUB -n 64
#BSUB -J 295K-insert
#BSUB -q privateq-zw
#BSUB -R "span[ptile=32]"
#BSUB -o %J.out
#BSUB -e %J.err
nrep=2
mpirun -np 64 gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -replex 100 -nsteps 50000 -s remd.tpr -reseed 175320
Thanks in advance for any assistance!
xian
--
You received this message because you are subscribed to the Google Groups "PLUMED users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plumed-users...@googlegroups.com.
To post to this group, send email to plumed...@googlegroups.com.
Visit this group at https://groups.google.com/group/plumed-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/1c767a62-caeb-413c-b988-55124f18772d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Dear Giovanni Bussi,
Thanks for your response and suggestion.
We took your advice, and set the “nstlist” to 10, and “replex” to 1000. REST indeed started running, but it broke down after ~320ps, with the same error as before:
vol 0.84 imb F 2% pme/F 0.57 vol 0.83 imb F 1% pme/F 0.52 [32:c02n02] unexpected disconnect completion event from [31:c02n05]
Fatal error in MPI_Allreduce: Internal MPI error!, error stack:
MPI_Allreduce(1628)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff714661fc, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000007) failed
MPIR_Allreduce_impl(1469): fail failed
MPIR_Allreduce_intra(954): fail failed
MPIC_Sendrecv(581).......: fail failed
MPIC_Wait(270)...........: fail failed
PMPIDI_CH3I_Progress(850): fail failed
(unknown)(): Internal MPI error!
The submission script we use is as follows:
gmx_mpi mdrun -v -plumed plumed.dat -multi 2 -nstlist 10 -replex 1000 -hrex -s topol.tpr -reseed 175320
We then increased the “replex” to 2000, with “nstlist”=10, the following is the submission script:
gmx_mpi mdrun -v -plumed plumed.dat -multi 2 -nstlist 10 -replex 2000 -hrex -s topol.tpr -reseed 175320
Now the REST calculation went longer, but after 2.4ns, it broke down again with the same error:
vol 0.81 imb F 1% pme/F 0.45 step 1208100, will finish Sun Oct 21 10:45:24 2018
vol 0.81 imb F 2% pme/F 0.45 [12:c04n03] unexpected disconnect completion event from [44:c01n05]
Fatal error in PMPI_Bcast: Invalid buffer pointer, error stack:
PMPI_Bcast(2667).........: MPI_Bcast(buf=0x7ffff3ac597c, count=12, MPI_BYTE, root=0, comm=0x84000006) failed
MPIR_Bcast_impl(1804)....: fail failed
MPIR_Bcast(1832).........: fail failed
I_MPIR_Bcast_intra(2056).: Failure during collective
MPIR_Bcast_intra(1670)...: Failure during collective
MPIR_Bcast_intra(1638)...: fail failed
MPIR_Bcast_knomial(2274).: fail failed
MPIC_Recv(419)...........: fail failed
MPIC_Wait(270)...........: fail failed
PMPIDI_CH3I_Progress(850): fail failed
(unknown)(): Internal MPI error!
Then we set “nstlist” to 1, and “replex” to 2000. Following is the submission script:
gmx_mpi mdrun -v -plumed plumed.dat -multi $nrep -nstlist 1 -replex 1000 -hrex -s topol.tpr -reseed 175320
This time the whole 20ns test run of REST finished normally with no break-down. This setting, however, obviously reduce the computational efficiency significantly. We can’t figure out what is the reason of the break-downs when “nstlist” >1, and how to deal with it. Will that be due to some inappropriate settings in our .mdp file (we attach it here as well)?
Thank you in advance for the precious comments and help.
Xian
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/f511a90d-b51d-42cb-8ad7-af437460e78e%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/70e5034c-2e1b-4cd6-b75d-e762d11aba1e%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/88891c49-96e4-4409-b57f-b6d2304bc991%40googlegroups.com.
We can’t figure out what is the reason of the break-downs when performing restart calculation, and how to deal with it.
Thank you in advance for the precious comments and help.
Xian
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/978ad6d0-671d-4b08-a345-46231e4c853d%40googlegroups.com.
mpirun -np 120 gmx_mpi mdrun -v -plumed plumed.dat -multi 10 -replex 500 -hrex -nsteps 500000
step 500: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
step 500: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Step 500, time 1 (ps) LINCS WARNING in simulation 3
relative constraint deviation after LINCS:
rms 0.405994, max 0.405994 (between atoms 205 and 206)
bonds that rotated more than 30 degrees:
atom 1 atom 2 angle previous, current, constraint length
205 206 90.0 0.1111 0.1562 0.1111