Checkpoint file inconsistency in PT-MetaD-WTE simulations

77 views

Skip to first unread message

Panagiotis Koukos

unread,

Aug 30, 2022, 3:59:32 AM8/30/22

to PLUMED users

Dear PLUMED users,

We have been running into an issue while trying to perform PT-MetaD-WTE simulations.

The problem boils down to this:

The simulations proceed smoothly and the first SLURM job concludes. When the next SLURM job picks up from where the previous stopped, it will _sometimes_ fail with the following error message:

"Fatal error:

The 11 subsystems are not compatible"

and crash.

We have pinpointed the issue to incompatibilities between the checkpoint files written by gromacs to disk. Specifically, some replicas appear to be a few steps behind. This is one example:

rep00/prod.cpt.txt:step = 867786425
rep01/prod.cpt.txt:step = 867786425
rep02/prod.cpt.txt:step = 867786401
rep03/prod.cpt.txt:step = 867786401
rep04/prod.cpt.txt:step = 867786425
rep05/prod.cpt.txt:step = 867786425
rep06/prod.cpt.txt:step = 867786425
rep07/prod.cpt.txt:step = 867786401
rep08/prod.cpt.txt:step = 867786401
rep09/prod.cpt.txt:step = 867786425
rep10/prod.cpt.txt:step = 867786425

It is not always the same replicas that appear to be falling behind. Moreover, whenever the problem arises, the same number of steps is also missing from the "prev" checkpoint file gromacs writes to disk.

We have searched the mailing list for previous mentions of this issue and got some hits, where the suggestion was to make use of the gromacs flag `-maxh` to allow the job to gracefully terminate before the job scheduler of the cluster kills it abruptly. This hasn't helped in our case.

This issue is present in 3 systems we are simulating, across two different clusters and with two different versions of GROMACS/PLUMED (gromacs 2021.4 + plumed 2.7.2 and gromacs 2021.5 + plumed 2.8.0).

Importantly, we have simulated the same systems in the past with gromacs 2019.1 + plumed 2.5.2 without issues.

Has anyone experienced this issue, and if yes how did you solve it?

In case it can be of some help in debugging this, our job job submission files look like this:

#!/bin/bash
#SBATCH --job-name XXXXX_PT-METAD-WTE
#SBATCH --ntasks=220
#SBATCH --nodes=11
#SBATCH --ntasks-per-node=20
#SBATCH --cpus-per-task=1
#SBATCH --time=48:00:00
#SBATCH --account=XXXXXXX
#SBATCH --partition=compute

#load gromacs module and dependencies
module purge
module load gnu/8
module load intel/18
module load intelmpi/2018
module load openblas/0.3.6/gnu
module load fftw/3.3.9
module load python/3.7.6
module load gromacs/2021.4-plumed-2.7.2

export I_MPI_FABRICS=shm:dapl

srun gmx_mpi mdrun -plumed plumed.dat -multidir rep00 rep01 rep02 rep03 rep04 rep05 rep06 rep07 rep08 rep09 rep10 -replex 100 -deffnm prod -pin on -maxh 47.5

Thank you,

Panagiotis Koukos

Reply all

Reply to author

Forward

0 new messages