plumed hrex2-2.1 with gromacs 4.6.7 doesn't exit cleanly when loading .cpt files

67 views
Skip to first unread message

Chris Neale

unread,
Jan 14, 2015, 1:25:58 PM1/14/15
to plumed...@googlegroups.com
Dear Users:

when I run plumed hrex2-2.1 with gromacs 4.6.7, it exits cleanly on the first job, but all continuation jobs have an error at the end.

In this test, I set the max time on the node to be 10 minutes and ran mdrun with -maxh 0.05 (which is 3 minutes). Note that I have the same issue when setting wallclock limit to be 1 day and running with -maxh 23.8 or, alternatively, setting wallclock limit to 8 hours and running with -maxh 7.9

I presume that the difference is in the loading of .cpt files (and perhaps some issue that the next checkpoint still needs to be written? although for the quick test that I show below I set the checkpoint interval to 1 minute, so there should have been time to clean up, i.e. -cpt 1)

Finally, note that with standard gromacs replica exchange (no plumed) and a walltime limit of 24h and -cpt 60 and -maxh 23.8 there is never any such error, regardless of whether or not I load in checkpoint files.

Here is the end of the output from the first submission:

...
...
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps.
1000000000 steps, 2000000.0 ps.

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

Step 10930: Run time exceeded 0.050 hours, will terminate the run

gcq#175: "Pump Up the Volume Along With the Tempo" (Jazzy Jeff)


###############


However, all subsequent submissions end like this (tested on two separate clusters):

...
...
starting mdrun 'alanine dipeptide in water'
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
starting mdrun 'alanine dipeptide in water'
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).
1000000000 steps, 2000000.0 ps (continuing from step 85470,    170.9 ps).

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run

Step 96510: Run time exceeded 0.050 hours, will terminate the run
[r104-n61:21329] *** Process received signal ***
[r104-n61:21329] Signal: Segmentation fault (11)
[r104-n61:21329] Signal code:  (128)
[r104-n61:21329] Failing at address: (nil)
[r104-n61:21332] *** Process received signal ***
[r104-n61:21330] *** Process received signal ***
[r104-n61:21330] Signal: Segmentation fault (11)
[r104-n61:21330] Signal code:  (128)
[r104-n61:21330] Failing at address: (nil)
[r104-n61:21332] Signal: Segmentation fault (11)
[r104-n61:21332] Signal code:  (128)
[r104-n61:21332] Failing at address: (nil)
[r104-n61:21331] *** Process received signal ***
[r104-n61:21334] *** Process received signal ***
[r104-n61:21334] Signal: Segmentation fault (11)
[r104-n61:21334] Signal code:  (128)
[r104-n61:21334] Failing at address: (nil)
[r104-n61:21333] *** Process received signal ***
[r104-n61:21333] Signal: Segmentation fault (11)
[r104-n61:21333] Signal code:  (128)
[r104-n61:21333] Failing at address: (nil)
[r104-n61:21331] Signal: Segmentation fault (11)
[r104-n61:21331] Signal code:  (128)
[r104-n61:21331] Failing at address: (nil)
[r104-n61:21336] *** Process received signal ***
[r104-n61:21336] Signal: Segmentation fault (11)
[r104-n61:21336] Signal code:  (128)
[r104-n61:21336] Failing at address: (nil)
[r104-n61:21335] *** Process received signal ***
[r104-n61:21335] Signal: Segmentation fault (11)
[r104-n61:21335] Signal code:  (128)
[r104-n61:21335] Failing at address: (nil)
[r104-n61:21332] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2adda86de710]
[r104-n61:21332] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2addaa57e99a]
[r104-n61:21332] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21332] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21332] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21332] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21332] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21332] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21332] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21332] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2addaa535d5d]
[r104-n61:21332] [10] mdrun_mpi() [0x496829]
[r104-n61:21332] *** End of error message ***
[r104-n61:21331] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b96daa0c710]
[r104-n61:21333] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2abda8ccf710]
[r104-n61:21333] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2abdaab6f99a]
[r104-n61:21333] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21333] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21333] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21333] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21333] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21333] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21333] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21333] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2abdaab26d5d]
[r104-n61:21333] [10] mdrun_mpi() [0x496829]
[r104-n61:21333] *** End of error message ***
[r104-n61:21336] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b3917522710]
[r104-n61:21336] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b39193c299a]
[r104-n61:21336] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21336] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21336] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21336] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21336] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21336] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21336] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21336] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b3919379d5d]
[r104-n61:21336] [10] mdrun_mpi() [0x496829]
[r104-n61:21336] *** End of error message ***
[r104-n61:21334] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2af6477e2710]
[r104-n61:21334] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2af64968299a]
[r104-n61:21334] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21334] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21334] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21334] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21334] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21334] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21334] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21334] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2af649639d5d]
[r104-n61:21334] [10] mdrun_mpi() [0x496829]
[r104-n61:21334] *** End of error message ***
[r104-n61:21330] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b12ddf16710]
[r104-n61:21330] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b12dfdb699a]
[r104-n61:21330] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21330] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21330] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21330] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21330] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21330] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21330] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21330] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b12dfd6dd5d]
[r104-n61:21330] [10] mdrun_mpi() [0x496829]
[r104-n61:21330] *** End of error message ***
[r104-n61:21329] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b5cbf9eb710]
[r104-n61:21329] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b5cc188b99a]
[r104-n61:21329] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21329] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21329] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21329] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21329] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21329] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21329] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21329] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b5cc1842d5d]
[r104-n61:21329] [10] mdrun_mpi() [0x496829]
[r104-n61:21329] *** End of error message ***
[r104-n61:21335] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2ba3cbe9c710]
[r104-n61:21335] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2ba3cdd3c99a]
[r104-n61:21335] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21335] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21335] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21335] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21335] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21335] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21335] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21335] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2ba3cdcf3d5d]
[r104-n61:21335] [10] mdrun_mpi() [0x496829]
[r104-n61:21335] *** End of error message ***
[r104-n61:21331] [ 1] /lib64/libc.so.6(fwrite+0x4a) [0x2b96dc8ac99a]
[r104-n61:21331] [ 2] mdrun_mpi(_ZN4PLMD5OFile7llwriteEPKcm+0xfd) [0xc33dcd]
[r104-n61:21331] [ 3] mdrun_mpi(_ZN4PLMD5OFile6printfEPKcz+0x24c) [0xc33a3c]
[r104-n61:21331] [ 4] mdrun_mpi(_ZN4PLMD10PlumedMainD1Ev+0x61b) [0xb2ce2b]
[r104-n61:21331] [ 5] mdrun_mpi(_ZN4PLMD10PlumedMainD0Ev+0xa) [0xb2c7fa]
[r104-n61:21331] [ 6] mdrun_mpi(plumedmain_finalize+0x10) [0xb32380]
[r104-n61:21331] [ 7] mdrun_mpi(cmain+0x2619) [0x4b3089]
[r104-n61:21331] [ 8] mdrun_mpi(main+0x56) [0x4b9b66]
[r104-n61:21331] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b96dc863d5d]
[r104-n61:21331] [10] mdrun_mpi() [0x496829]
[r104-n61:21331] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21329 on node r104-n61 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Giovanni Bussi

unread,
Jan 16, 2015, 2:09:27 AM1/16/15
to plumed...@googlegroups.com
Hi Chris,

could you check with your inputs if the same happens with:

1. mdrun with -plumed without -hrex
2. mdrun with plumed with the standard (no hrex) patch (in case you have it installed on your system)

This would be very useful for me to fix the problem.

Thanks!

Giovanni


--
You received this message because you are subscribed to the Google Groups "PLUMED users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plumed-users...@googlegroups.com.
To post to this group, send email to plumed...@googlegroups.com.
Visit this group at http://groups.google.com/group/plumed-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/35754b66-24b0-4b96-8932-95ead3e5e8d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Neale

unread,
Jan 16, 2015, 4:11:26 PM1/16/15
to plumed...@googlegroups.com
Dear Giovanni:

I get the same error at shutdown in both cases you listed. I compiled standard plumed-2.1.1 for your suggested test part 2.

I also did one other test, which was to use standard plumed-2.1.1 and still run under mpi but stop doing replica exchange (i.e., remove the -multi and -replex flags to mdrun_mpi) -- this time no errors on shutdown.

In case it matters, my plumed.dat (for mdrun -plumed flag) exists but is empty. Also, here is my .mdp file:

constraints = all-bonds
lincs-iter =  1
lincs-order =  6
constraint_algorithm =  lincs
integrator = sd
dt = 0.002 
tinit = 0
nsteps = 1000000000
nstcomm = 1
nstxout = 1000000000
nstvout = 1000000000
nstfout = 1000000000
nstxtcout = 25000
nstenergy = 25000
nstlist = 10
nstlog=0 ; reduce log file size
ns_type = grid
rlist = 1.0
rvdw = 1.0
rcoulomb = 1.0
coulombtype = PME
ewald-rtol = 1e-5
optimize_fft = yes
fourierspacing = 0.12
fourier_nx = 0
fourier_ny = 0
fourier_nz = 0
pme_order = 4
tc_grps             =  System           
tau_t               =  1.0               
ld_seed             =  -1               
ref_t = 300
gen_temp = 300
gen_vel = yes
unconstrained_start = no
gen_seed = -1
Pcoupl = no


Thank you,
Chris.

Giovanni Bussi

unread,
Jan 21, 2015, 2:47:02 AM1/21/15
to plumed...@googlegroups.com
Hi Chris,

I tried on my laptop (Mac). It seems that also with plain gromacs 467 (no plumed patch) there are problems. Namely, when using -cpi -replex and -maxh the simulation does not complete correctly. In that case no error is shown, but anyway the simulation does not quit.

Can you confirm that?

If this the case, then the problem might be with plumed complaining about the non-clean exit, but should be harmless.

Thanks!

Giovanni

Inviato da iPhone

Chris Neale

unread,
Jan 21, 2015, 1:04:23 PM1/21/15
to plumed...@googlegroups.com
Dear Giovanni:

what do you mean precisely when you say that gromacs doesn't complete correctly? On some clusters I find that 

mpirun -np 8 mdrun_mpi 
echo finished

will never get to the "echo finished" although gromacs indicates that it has completed.

However, if I run like this then it seems to work:

mpirun -np 8 mdrun_mpi &
wait
echo finished

I always figured it was an mpirun issuem but maybe you are right that it is something about gromacs.

I'm happy to do more testing, just want to know exactly what to look for.

Thank you,
Chris.
Reply all
Reply to author
Forward
0 new messages