Restarting REST2 calculation ERROR

julia...@rwth-aachen.de

unread,

Apr 15, 2020, 9:45:43 AM4/15/20

to PLUMED users

Dear PLUMED community,

I am trying to restart my REST2 calculation (20 nodes, 480 cores, 16 replicas) with this command: srun gmx_mpi mdrun -plumed plumed.dat -s topol.tpr -multidir rep0 rep1 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep10 rep11 rep12 rep13 rep14 rep15 -replex 10000 -hrex -cpi state.cpt -append

and after some calculation steps the simulation crashes with the following error message:

simulation part is not equal for all subsystems subsystem 0: 4 subsystem 1: 4 subsystem 2: 4 subsystem 3: 4 subsystem 4: 4 subsystem 5: 4 subsystem 6: 4 subsystem 7: 4 subsystem 8: 4 subsystem 9: 3 subsystem 10: 3 subsystem 11: 4 subsystem 12: 3 subsystem 13: 4 subsystem 14: 4 subsystem 15: 4 ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 0 (out of 480) ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 90 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 60 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 270 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 420 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 30 (out of 480) ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 240 (out of 480) ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 300 (out of 480) Fatal error: The 16 subsystems are not compatible ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 210 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 450 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 150 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 390 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 330 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 180 (out of 480) Fatal error: ------------------------------------------------------- Program: gmx mdrun, version 2018.3 Source file: src/gromacs/mdlib/main.cpp (line 115) MPI rank: 120 (out of 480) Fatal error: The 16 subsystems are not compatible

When looking at every log file they all stoppped at the same simulation step. What might be the issue here? Thanks a lot for your help (I am using GROMACS 2018.3/plumed).

Best,

Benjamin

Giovanni Bussi

unread,

Apr 15, 2020, 10:31:41 AM4/15/20

to plumed...@googlegroups.com

Do all replicas simulate the same number of atoms?

Giovanni

--
You received this message because you are subscribed to the Google Groups "PLUMED users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plumed-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/2c837cf8-876b-4947-bbe6-17b8ae0b5c07%40googlegroups.com.

Carlo Camilloni

unread,

Apr 15, 2020, 10:33:05 AM4/15/20

to plumed...@googlegroups.com

The is an issue with the restart from checkpoint, it looks that checkpoint for simulation 9 ,10 and 12 is from a different time

Carlo

To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/CAPLm8Z%2BFd6sOU1xCVemDOP8TUzbKMewd2KmTe-zLHiLB4vGFcQ%40mail.gmail.com.

bejoo...@gmail.com

unread,

Apr 15, 2020, 3:36:17 PM4/15/20

to PLUMED users

Dear both,

all replicas have the same number of atoms. I also talked to an expert in my group and he thinks the issue occurs when the cluster forces the simulation to terminate after one day as it is the maximum running time for jobs on our cluster and suggested me to use -maxh 23, so that all checkpoint files have the same time stamp. I will let
You know if this changes my issue.