Dear Users:
when I run plumed hrex2-2.1 with gromacs 4.6.7, it exits cleanly on the first job, but all continuation jobs have an error at the end.
In this test, I set the max time on the node to be 10 minutes and ran mdrun with -maxh 0.05 (which is 3 minutes). Note that I have the same issue when setting wallclock limit to be 1 day and running with -maxh 23.8 or, alternatively, setting wallclock limit to 8 hours and running with -maxh 7.9
I presume that the difference is in the loading of .cpt files (and perhaps some issue that the next checkpoint still needs to be written? although for the quick test that I show below I set the checkpoint interval to 1 minute, so there should have been time to clean up, i.e. -cpt 1)
Finally, note that with standard gromacs replica exchange (no plumed) and a walltime limit of 24h and -cpt 60 and -maxh 23.8 there is never any such error, regardless of whether or not I load in checkpoint files.
Here is the end of the output from the first submission:
...
...
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
1000000000 steps, 2000000.0 ps.
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
Step 10930: Run time exceeded 0.050 hours, will terminate the run
gcq#175: "Pump Up the Volume Along With the Tempo" (Jazzy Jeff)
###############
However, all subsequent submissions end like this (tested on two separate clusters):
...
...
starting mdrun 'alanine dipeptide in water'
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
1000000000 steps, 2000000.0 ps (continuing from step 85470, 170.9 ps).
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
Step 96510: Run time exceeded 0.050 hours, will terminate the run
[r104-n61:21329] *** Process received signal ***
[r104-n61:21329] Signal: Segmentation fault (11)
[r104-n61:21329] Signal code: (128)
[r104-n61:21329] Failing at address: (nil)
[r104-n61:21332] *** Process received signal ***
[r104-n61:21330] *** Process received signal ***
[r104-n61:21330] Signal: Segmentation fault (11)
[r104-n61:21330] Signal code: (128)
[r104-n61:21330] Failing at address: (nil)
[r104-n61:21332] Signal: Segmentation fault (11)
[r104-n61:21332] Signal code: (128)
[r104-n61:21332] Failing at address: (nil)
[r104-n61:21331] *** Process received signal ***
[r104-n61:21334] *** Process received signal ***
[r104-n61:21334] Signal: Segmentation fault (11)
[r104-n61:21334] Signal code: (128)
[r104-n61:21334] Failing at address: (nil)
[r104-n61:21333] *** Process received signal ***
[r104-n61:21333] Signal: Segmentation fault (11)
[r104-n61:21333] Signal code: (128)
[r104-n61:21333] Failing at address: (nil)
[r104-n61:21331] Signal: Segmentation fault (11)
[r104-n61:21331] Signal code: (128)
[r104-n61:21331] Failing at address: (nil)
[r104-n61:21336] *** Process received signal ***
[r104-n61:21336] Signal: Segmentation fault (11)
[r104-n61:21336] Signal code: (128)
[r104-n61:21336] Failing at address: (nil)
[r104-n61:21335] *** Process received signal ***
[r104-n61:21335] Signal: Segmentation fault (11)
[r104-n61:21335] Signal code: (128)
[r104-n61:21335] Failing at address: (nil)
[r104-n61:21332] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2adda86de710]
[r104-n61:21332] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2addaa57e99a]
[r104-n61:21332] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21332] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21332] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21332] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21332] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21332] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21332] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21332] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2addaa535d5d]
[r104-n61:21332] [10] mdrun_mpi() [0x496829]
[r104-n61:21332] *** End of error message ***
[r104-n61:21331] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b96daa0c710]
[r104-n61:21333] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2abda8ccf710]
[r104-n61:21333] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2abdaab6f99a]
[r104-n61:21333] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21333] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21333] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21333] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21333] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21333] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21333] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21333] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2abdaab26d5d]
[r104-n61:21333] [10] mdrun_mpi() [0x496829]
[r104-n61:21333] *** End of error message ***
[r104-n61:21336] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b3917522710]
[r104-n61:21336] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b39193c299a]
[r104-n61:21336] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21336] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21336] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21336] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21336] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21336] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21336] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21336] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b3919379d5d]
[r104-n61:21336] [10] mdrun_mpi() [0x496829]
[r104-n61:21336] *** End of error message ***
[r104-n61:21334] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2af6477e2710]
[r104-n61:21334] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2af64968299a]
[r104-n61:21334] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21334] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21334] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21334] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21334] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21334] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21334] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21334] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2af649639d5d]
[r104-n61:21334] [10] mdrun_mpi() [0x496829]
[r104-n61:21334] *** End of error message ***
[r104-n61:21330] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b12ddf16710]
[r104-n61:21330] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b12dfdb699a]
[r104-n61:21330] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21330] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21330] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21330] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21330] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21330] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21330] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21330] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b12dfd6dd5d]
[r104-n61:21330] [10] mdrun_mpi() [0x496829]
[r104-n61:21330] *** End of error message ***
[r104-n61:21329] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b5cbf9eb710]
[r104-n61:21329] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b5cc188b99a]
[r104-n61:21329] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21329] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21329] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21329] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21329] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21329] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21329] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21329] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b5cc1842d5d]
[r104-n61:21329] [10] mdrun_mpi() [0x496829]
[r104-n61:21329] *** End of error message ***
[r104-n61:21335] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2ba3cbe9c710]
[r104-n61:21335] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2ba3cdd3c99a]
[r104-n61:21335] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21335] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21335] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21335] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21335] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21335] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21335] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21335] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2ba3cdcf3d5d]
[r104-n61:21335] [10] mdrun_mpi() [0x496829]
[r104-n61:21335] *** End of error message ***
[r104-n61:21331] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b96dc8ac99a]
[r104-n61:21331] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21331] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21331] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21331] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21331] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21331] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21331] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21331] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b96dc863d5d]
[r104-n61:21331] [10] mdrun_mpi() [0x496829]
[r104-n61:21331] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21329 on node r104-n61 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------