Reloading a restart file fails if some molecule properties are removed before writing the restart

25 views
Skip to first unread message

julien

unread,
Jan 21, 2013, 6:52:37 AM1/21/13
to sire-de...@googlegroups.com
Hi Chris,

I am working on a script that creates several temporary molecule properties (buffered coordinates of past snapshots from a GPU simulation).
However it is not necessary to save all the buffered coordinates in the restart file. It would be good to remove these properties before writing a restart to keep the size of the restart
reasonable for protein simulations, or to deal with simulations where a large number of coordinates have been buffered.

The code to remove buffered coordinates looks like this:

--> end of the main block of the script

    (...)
    print "Simulation took %d s " % ( s2 - s1)
    print "Potential energy = ", system.energy(), " Kinetic energy = ", moves[0].kineticEnergy()

    if buffer_freq > 0:
        system = clearBuffers(system)

    print "Saving restart"
    Sire.Stream.save( [system, moves], restart_file )

and here is the relevant subroutine

def clearBuffers( system ):

    print "Clearing buffers..."

    mols = system[MGName("all")].molecules()
    molnums = mols.molNums()

    for molnum in molnums:
        mol = mols.molecule(molnum).molecule()
        molprops = mol.propertyKeys()
        print molprops
        editmol = mol.edit()
        for molprop in molprops:
            if molprop.startsWith("buffered_"):
                print "Removing property %s " % molprop
                editmol.removeProperty( PropertyName(molprop) )
        mol = editmol.commit()
        system.update(mol)

    return system


*** This is a sample output with the call to clearBuffers() commented out, starting from scratch

julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run$ python siregpumd-openmm.py 1
 ### Starting script on unknown ###
New run. Loading input and creating restart
Applying flexibility and zmatrix templates...
Creating force fields...
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1
Generated random seed number 510365
Saving restart
Loading required Sire Python modules...............Done!
Loaded a restart file on wich we have performed 0 moves.
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1
Generated random seed number 123301
There are 3217 atoms in the group
Setup took 2 s
Potential energy =  -8773.96 kcal mol-1  Kinetic energy =  2134.89 kcal mol-1
Running MD simulation

Cycle =  1

 Time to write coordinates 1532 ms

Cycle =  2

 Time to write coordinates 1707 ms
Simulation took 36 s
Potential energy =  -8853.75 kcal mol-1  Kinetic energy =  1454.94 kcal mol-1
Saving restart
julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run$ ll -lth
total 8.5M
-rw-r--r-- 1 julien michel 3.0M Jan 21 11:35 traj000000001.dcd
-rw-r--r-- 1 julien michel 4.7M Jan 21 11:35 sim_restart.s3
-rw-r--r-- 1 julien michel  123 Jan 21 11:35 moves.1.dat
-rw-r--r-- 1 julien michel  16K Jan 21 11:35 .siregpumd-openmm.py.swp
drwxr-xr-x 2 julien michel 4.0K Jan 21 11:35 ./
-rwxr-xr-x 1 julien michel  26K Jan 21 11:35 siregpumd-openmm.py*
-rw-r--r-- 1 julien michel 488K Jan 21 11:28 SYSTEM.top
-rw-r--r-- 1 julien michel 230K Jan 21 11:28 SYSTEM.crd
drwxr-xr-x 3 julien michel   67 Jan 21 11:15 ../
-rw-r--r-- 1 julien michel 1015 Jan 21 10:39 gpujob.sh
julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run$

*** Note that the size of the restart is 4.7MB

*** Restarting the job works

julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run$ python siregpumd-openmm.py 1
 ### Starting script on unknown ###
Loading required Sire Python modules...............Done!
Loaded a restart file on wich we have performed 40000 moves.
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1
Generated random seed number 681199
There are 3217 atoms in the group
Setup took 1 s
Potential energy =  -8853.75 kcal mol-1  Kinetic energy =  1454.94 kcal mol-1
Running MD simulation

Cycle =  1

 Time to write coordinates 1072 ms

Cycle =  2

 Time to write coordinates 4044 ms
Simulation took 39 s
Potential energy =  -8932.4 kcal mol-1  Kinetic energy =  1415.62 kcal mol-1
Saving restart
julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run$ ll -lth
total 16M
-rw-r--r-- 1 julien michel 3.0M Jan 21 11:38 traj000000002.dcd
-rw-r--r-- 1 julien michel 8.3M Jan 21 11:38 sim_restart.s3
-rw-r--r-- 1 julien michel  123 Jan 21 11:38 moves.2.dat
drwxr-xr-x 2 julien michel 4.0K Jan 21 11:38 ./
-rw-r--r-- 1 julien michel  40K Jan 21 11:38 .siregpumd-openmm.py.swp
-rwxr-xr-x 1 julien michel  26K Jan 21 11:38 siregpumd-openmm.py*
-rw-r--r-- 1 julien michel 3.0M Jan 21 11:35 traj000000001.dcd
-rw-r--r-- 1 julien michel  123 Jan 21 11:35 moves.1.dat
-rw-r--r-- 1 julien michel 488K Jan 21 11:28 SYSTEM.top
-rw-r--r-- 1 julien michel 230K Jan 21 11:28 SYSTEM.crd
drwxr-xr-x 3 julien michel   67 Jan 21 11:15 ../
-rw-r--r-- 1 julien michel 1015 Jan 21 10:39 gpujob.sh

*** Now the size of the restart file has grown to 8.3MB
*** After a third iteration, the size of the restart is 8.4 MB
*** After a fourth iteration, the size of the restart is 8.3 MB


*** Now, starting from scratch, but calling clearbuffers() before saving the restart
julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run/nobufferinrestart$ python siregpumd-openmm.py 1 ; ll -lth
 ### Starting script on unknown ###
New run. Loading input and creating restart
Applying flexibility and zmatrix templates...
Creating force fields...
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1
Generated random seed number 489545
Saving restart
Loading required Sire Python modules...............Done!
Loaded a restart file on wich we have performed 0 moves.
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1
Generated random seed number 104768
There are 3217 atoms in the group
Setup took 2 s
Potential energy =  -8773.96 kcal mol-1  Kinetic energy =  2159.78 kcal mol-1
Running MD simulation

Cycle =  1

 Time to write coordinates 1468 ms

Cycle =  2

 Time to write coordinates 2804 ms
Simulation took 37 s
Potential energy =  -8783.95 kcal mol-1  Kinetic energy =  1429.98 kcal mol-1
Clearing buffers...
Saving restart
total 8.7M
-rw-r--r-- 1 julien michel 3.0M Jan 21 11:44 traj000000001.dcd
-rw-r--r-- 1 julien michel 5.0M Jan 21 11:44 sim_restart.s3
-rw-r--r-- 1 julien michel  123 Jan 21 11:44 moves.1.dat
drwxr-xr-x 2 julien michel  145 Jan 21 11:44 ./
-rwxr-xr-x 1 julien michel  26K Jan 21 11:43 siregpumd-openmm.py*
drwxr-xr-x 4 julien michel   83 Jan 21 11:43 ../
-rw-r--r-- 1 julien michel 488K Jan 21 11:28 SYSTEM.top
-rw-r--r-- 1 julien michel 230K Jan 21 11:28 SYSTEM.crd
-rw-r--r-- 1 julien michel 1015 Jan 21 10:39 gpujob.sh

* The size of the restart has actually increased  !

* And if I try to restart a job, I get the following error

julien@node006:~/projects/nautilus/hydrationstudy/purewater/simulation0/run/nobufferinrestart$ python siregpumd-openmm.py 1 ; ll -lth
 ### Starting script on unknown ###
Loading required Sire Python modules...............Done!
Traceback (most recent call last):
  File "siregpumd-openmm.py", line 744, in <module>
    system, moves = Sire.Stream.load( restart_file )
  File "/home/common/sire-dev/lib/python2.6/site-packages/Sire/Stream/__init__.py", line 48, in load
    return _pvt_load(data)
UserWarning: Exception 'SireFF::missing_function' thrown by the thread 'master'.
There are no internal functions for the molecule! The molecule will have an internal energy of zero!
Thrown from FILE: /home/julien/software/siredev/corelib/src/libs/SireMM/internalparameters.cpp, LINE: 2966, FUNCTION: void SireMM::InternalParameters::updateState()
__Backtrace__
(  0) /home/common/sire-dev/lib/libSireError.so.0 ([0x7f2141f4f48b] ++0x2b)
  -- SireError::getBackTrace()

(  1) /home/common/sire-dev/lib/libSireError.so.0 ([0x7f2141f4c7a7] ++0x97)
  -- SireError::exception::exception(QString, QString)

/home/common/sire-dev/lib/libSireMM.so.0(+0x2c448e) [0x7f213afba48e]
(  3) /home/common/sire-dev/lib/libSireMM.so.0 ([0x7f213af9c2c8] ++0x4b8)
  -- SireMM::InternalParameters::updateState()

(  4) /home/common/sire-dev/lib/libSireMM.so.0 ([0x7f213afacda5] ++0x2c5)
  -- operator>>(QDataStream&, SireMM::InternalParameters&)
(...)

Now my questions :-)


1) Why does the size of the restart file increases after the second iteration ?

2)  Why the restart file is larger after removing several moleculeproperties from the system ?

3) Why removing some molecule properties causes subsequent loading of the restart file to fail ?

If that is helpful I can send you a tarball with the input and script, but this might be a fair bit of work as to run the openmm code with my latest branch you will also need to get a very recent version of openmm (I am using revision 3537) from the developers svn repository.
 

Christopher Woods

unread,
Jan 21, 2013, 9:31:00 AM1/21/13
to Sire Developers

  Hi Julien,

Good description of what you are doing ;-). To answer the questions;

1) Why does the size of the restart file increases after the second iteration ?

The "moves" object contains copies of all of the molecules as well. The "sampler" object in Moves saves a copy of the last version of the MoleculeGroup that the move object last sampled. This is to increase efficiency during runtime, and normally doesn't increase the size of the restart file. This is because all data that is the same is deduplicated, i.e. the molecules in the "sampler" in the "moves" are the same as the molecules in the "system", so only one copy of the each molecule is saved in the restart file. In your case, you are editing the molecules in the "system" before saving the restart file. As the molecules in "system" now don't match the molecules in "sampler"/"moves", two copies of each molecule are now saved into the restart file, hence why the file is larger.

A good test to see if this is the right diagnosis is to save the "system" and "moves" objects separately, e.g. Sire.Stream.save( system, "sim_system.s3" ), Sire.Stream.save( moves, "sim_moves.s3" ). Then see if the sizes of the sim_system.s3 files from the two versions of the script are different.

To fix this, you can clear out all molecules from the "moves" object before you save, e.g. by calling sampler.setGroup( new_group ). Alternatively, don't save the moves object, but instead delete it at the end of each iteration and recreate it at the start. You can save move statistics by just writing out the statistics to a file at the end of each iteration.

2)  Why the restart file is larger after removing several moleculeproperties from the system ? 

For the same reason as above - you have removed them from the molecules in "system", but not from "moves", so the restart file contains now both the old version and new version of the molecules.

3) Why removing some molecule properties causes subsequent loading of the restart file to fail ? 

This is confusing and I think you have exposed a bug. The code is failing because "InternalParameters::updateState()" is called when the InternalParameters object that represents the internal bond, angle, dihedral parameters of a molecule is loaded from the restart file, and this object has got into a strange state. "updateState" sees that the object is in a weird state (there are no parameters for the molecule) and so it raises the exception that you saw. I am surprised that this was not caught sooner in the code. Could you get Sire to recalculate the energy of the system after you remove the buffered properties? e.g. system.mustNowRecalculateFromScratch() and system.energy(). Check that the potential energy is the same as before the buffered properties were removed. I suspect that calculating the energy will trigger the exception, as if the internal parameters are missing for the molecule, then the exception should be detected and thrown then.

  Cheers,

  Christopher

julien

unread,
Jan 21, 2013, 5:37:24 PM1/21/13
to sire-de...@googlegroups.com
Hi Chris, 

Thanks for the clear explanation on 1) & 2). The easiest solution is to not stream moves, as I need to recreate it every time a restart is loaded anyway (the gpu device on which the move potentially changes every time the job is resubmitted).

For 3) I have done the tests you suggested, i.e

    print "Simulation took %d s " % ( s2 - s1)
    print "Potential energy = ", system.energy(), " Kinetic energy = ", moves[0].kineticEnergy()

    if buffer_freq > 0:
        system = clearBuffers(system)

    system.mustNowRecalculateFromScratch()
    print system.energy()

    print "Saving restart"
    Sire.Stream.save( [system], restart_file )

* The output does not suggest the energy of the system has changed

julien@node005:~/projects/nautilus/hydrationstudy/purewater/simulation0/run/newversion$ python siregpumd-openmm.py 1
 ### Starting script on unknown ### 
New run. Loading input and creating restart
Applying flexibility and zmatrix templates...
Creating force fields... 
Saving restart
Loading required Sire Python modules...............Done!
Loaded a restart file 
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1 
Generated random seed number 430642 
There are 3217 atoms in the group 
Setup took 3 s 
Potential energy =  -8773.96 kcal mol-1  Kinetic energy =  2120.59 kcal mol-1
Running MD simulation 

Cycle =  1 

 Time to write coordinates 1810 ms 

Cycle =  2 

 Time to write coordinates 1631 ms 
Simulation took 38 s 
Potential energy =  -8917.81 kcal mol-1  Kinetic energy =  1458.68 kcal mol-1
Clearing buffers...
-8917.81 kcal mol-1
Saving restart


* However...surprise surprise ! 
the restart file is now only 690KB

julien@node005:~/projects/nautilus/hydrationstudy/purewater/simulation0/run/newversion$ ll
total 4460
drwxr-xr-x 2 julien michel     145 Jan 21 22:18 ./
drwxr-xr-x 5 julien michel      69 Jan 21 22:11 ../
-rw-r--r-- 1 julien michel    1015 Jan 21 22:08 gpujob.sh
-rw-r--r-- 1 julien michel     123 Jan 21 22:18 moves.1.dat
-rw-r--r-- 1 julien michel  688526 Jan 21 22:18 sim_restart.s3
-rwxr-xr-x 1 julien michel   26338 Jan 21 22:17 siregpumd-openmm.py*
-rw-r--r-- 1 julien michel  235017 Jan 21 22:08 SYSTEM.crd
-rw-r--r-- 1 julien michel  499573 Jan 21 22:08 SYSTEM.top
-rw-r--r-- 1 julien michel 3094996 Jan 21 22:18 traj000000001.dcd

* And jobs can be restarted

julien@node005:~/projects/nautilus/hydrationstudy/purewater/simulation0/run/newversion$ python siregpumd-openmm.py 1
 ### Starting script on unknown ### 
Loading required Sire Python modules...............Done!
Loaded a restart file 
Setting up moves...
Created a MD move that uses OpenMM for all molecules on GPU 1 
Generated random seed number 779581 
There are 3217 atoms in the group 
Setup took 0 s 
Potential energy =  -8917.81 kcal mol-1  Kinetic energy =  1458.68 kcal mol-1
Running MD simulation 

Cycle =  1 

 Time to write coordinates 1653 ms 

Cycle =  2 

 Time to write coordinates 2008 ms 
Simulation took 38 s 
Potential energy =  -8974.5 kcal mol-1  Kinetic energy =  1470 kcal mol-1
Clearing buffers...
-8974.5 kcal mol-1
Saving restart

julien@node005:~/projects/nautilus/hydrationstudy/purewater/simulation0/run/newversion$ ll -lth
total 7.4M
-rw-r--r-- 1 julien michel 3.0M Jan 21 22:23 traj000000002.dcd
-rw-r--r-- 1 julien michel 673K Jan 21 22:23 sim_restart.s3
-rw-r--r-- 1 julien michel  123 Jan 21 22:23 moves.2.dat
drwxr-xr-x 2 julien michel 4.0K Jan 21 22:22 ./
-rw-r--r-- 1 julien michel 3.0M Jan 21 22:18 traj000000001.dcd
-rw-r--r-- 1 julien michel  123 Jan 21 22:18 moves.1.dat
-rwxr-xr-x 1 julien michel  26K Jan 21 22:17 siregpumd-openmm.py*
drwxr-xr-x 5 julien michel   69 Jan 21 22:11 ../
-rw-r--r-- 1 julien michel 1015 Jan 21 22:08 gpujob.sh
-rw-r--r-- 1 julien michel 230K Jan 21 22:08 SYSTEM.crd
-rw-r--r-- 1 julien michel 488K Jan 21 22:08 SYSTEM.top
 


* So it looks like there is a bug indeed, and a workaround is to call 

system.mustNowRecalculateFromScratch()

before writing the restart. 

Thanks again


Christopher Woods

unread,
Jan 22, 2013, 5:44:28 AM1/22/13
to Sire Developers

  Hi Julien,

Excellent news that the restart file is smaller. It is nice to know that Sire is behaving semi-predictably ;-)

The crash on re-read is definitely a bug caused by the InternalParameter getting into a weird state after updating the molecules. The "mustNowRecalculateFromScratch" has reset the state, hence fixing the problem. These are hard bugs to find and fix as they are related to the optimisations used to speed up MC energy calculations when only parts of the system are changed. "mustNowRecalculateFromScratch" resets the state so that the energy is calculated fresh, and is a good way to debug energy leaking and weird state errors like the one you experienced.

  Best wishes,

  Christopher

Reply all
Reply to author
Forward
0 new messages