Multi-walker metadynamics on LAMMPS: errors with sum_hills and restart functionality

396 views
Skip to first unread message

Pranav Sharma

unread,
Sep 2, 2021, 10:34:15 PM9/2/21
to PLUMED users

Hello,

I am currently running multi-walker well-tempered metadynamics using the PLUMED plug-in for LAMMPS, on a supercomputing cluster. It is a two collective variable system with 224 walkers, all reading/writing hills to their respective HILLS.# file in the parent directory. For additional context, I am running 12 of these 224-walker metadynamics with 12 slightly different versions of the system. Attached is an example of an input file: 


“““

RESTART 

UNITS LENGTH=A ENERGY=eV


#distance between COM beads

d1: DISTANCE ATOMS=1475,1476

c: COORDINATION GROUPA=616,627,638,649,660,671,682,693,704,715,726,737 GROUPB=1353,1364,1375,1386,1397,1408,1419,1430,1441,1452,1463,1474 SWITCH={GAUSSIAN R_0=6.0581 D_0=3.4}


# Walls free energy - position

UPPER_WALLS ARG=d1 AT=75 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=uwalld1

LOWER_WALLS ARG=d1 AT=2 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=lwalld1

UPPER_WALLS ARG=c AT=12 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=uwallc

LOWER_WALLS ARG=c AT=0 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=lwallc


METAD ...

  LABEL=restraint

  ARG=d1,c SIGMA=0.5,0.5 HEIGHT=0.01 PACE=500

  GRID_MIN=0,0 GRID_MAX=80,12.5

  BIASFACTOR=6 TEMP=300

  WALKERS_N=224

  WALKERS_ID=1

  WALKERS_DIR=../

  WALKERS_RSTRIDE=1000 

... METAD


PRINT ARG=d1,c,restraint.bias STRIDE=500  FILE=COLVAR

”””


All 224 walkers in the 12 systems have a fairly similar input file. I was able to run 11 of these 12 systems, for six days of wall time (in increments of 48 hours), resulting in each walker having a hills file that is around 150,000 KB. To postprocess and use plumed’s sum_hills facility, I create code that concatenates all of these hills files into one, resulting in a ~30-33 GB file for each system. For each of the 11 working systems, when I run sum_hills, even though I allocate an entire core to the command on the supercomputer and give it 48 hours to run, after about 40 minutes all of my sum_hills commands return the same output/error:


PLUMED: PLUMED is starting

PLUMED: Version: 2.8.0-dev (git: 9b35296e4) compiled on Aug  5 2021 at 09:41:35

PLUMED: Please cite these papers when using PLUMED [1][2]

PLUMED: For further information see the PLUMED web page at http://www.plumed.org

PLUMED: Root: /home1/08278/pbs12/plumed2/build/lib/plumed

PLUMED: For installed feature, see /home1/08278/pbs12/plumed2/build/lib/plumed/src/config/config.txt

PLUMED: Molecular dynamics engine: 

PLUMED: Precision of reals: 8

PLUMED: Running over 1 node

PLUMED: Number of threads: 1

PLUMED: Cache line size: 512

PLUMED: Number of atoms: 1

PLUMED: File suffix: 

PLUMED: Timestep: 0.000000

PLUMED: KbT has not been set by the MD engine

PLUMED: It should be set by hand where needed

PLUMED: Relevant bibliography:

PLUMED:   [1] The PLUMED consortium, Nat. Methods 16, 670 (2019)

PLUMED:   [2] Tribello, Bonomi, Branduardi, Camilloni, and Bussi, Comput. Phys. Commun. 185, 604 (2014)

PLUMED: Please read and cite where appropriate!

PLUMED: Finished setup

PLUMED: Action FAKE

PLUMED:   with label d1

PLUMED: Action FAKE

PLUMED:   with label c

PLUMED: Action FUNCSUMHILLS

PLUMED:   with label @2

PLUMED:   with arguments d1 c

PLUMED:   Output format is %14.9f

PLUMED:   hillsfile  : sumHills.txt

PLUMED:   Doing only one integration: no stride 

PLUMED:   mintozero: bias/histogram will be translated to have the minimum value equal to zero

PLUMED:   output file for fes/bias  is :  whole.dat

PLUMED: 

PLUMED:   Now calculating...

PLUMED: 

PLUMED:   reading hills: 

PLUMED:   doing serialread 

PLUMED:   opening file sumHills.txt


WARNING: IFile closed in the middle of reading. seems strange!


./submit0.sh: line 8: 74678 Aborted                 /home1/08278/pbs12/plumed2/build/bin/plumed sum_hills --hills sumHills.txt --mintozero --kt 0.0258724832 --min 0,0 --max 80,4.5 --bin 801,46 --outfile whole.dat

terminate called after throwing an instance of 'PLMD::Plumed::std_bad_alloc'


PLUMED:                                              Cycles        Total      Average      Minimum      Maximum

PLUMED: 0 Summing hills                                    1  2505.670629  2505.670629  2505.670629 2505.670629



Some of the systems are outputting the following: 


PLUMED:                                               Cycles        Total      Average      Minimum      Maximum

PLUMED:                                                    1     0.001051     0.001051     0.001051     0.001051


None of my sum_hills commands returned free-energy surfaces. I tried to reduce my file size by ⅔, running only the sum_hills commands on files from the first 48-hour run (a file of about size 10 GB). I received similar errors, and no free-energy surfaces.


Could you shed some light on why my sum_hills jobs could be crashing, even though they do start, read in at least the header of the aggregate hills files (as shown by the output recognizing my c and d1 CVs) and have a single core at their disposal for many hours? Is it because of the large file sizes?


Related to this issue, I was curious on how to pick the rate of hill deposition for our walkers; as shown in the sample input file above, WALKERS_RSTRIDE=1000. My first priority is to reach convergence fast in terms of wall-time, meaning that I want only a couple more 48-hr restarts to reach convergence. As a note, some of these systems are fairly complex systems that took 2-3 months to converge to sensible values using other metadynamics codes out there (albeit with about a tenth the number of walkers, less efficient code, and worse computers). 


My second priority is to minimize the storage size of our HILLS files, both because of somewhat limited hard-drive space and because it might be what is causing issues with our sum_hills command. This reduction in storage might be available by reducing WALKERS_RSTRIDE from 1000 to say 2000, for example. However, I don’t want to do this if it means increasing wall time to convergence to double. Is there any advice/guidance you can give on this issue? Does increasing the rate of hill deposition always lead to faster convergence and decreased wall time, or is there some middle point that is optimal? 



Lastly, I have a final question. As I mentioned earlier, 11 of 12 systems successfully restarted (twice). However, I could not get 1 of the 12 systems to restart whenever I restarted the other systems (fully) because of the following error:


...

PLUMED:   Restarting from ..//HILLS.204:      259034 Gaussians read

PLUMED:   Restarting from ..//HILLS.205:      259434 Gaussians read

PLUMED:   Restarting from ..//HILLS.206:      258622 Gaussians read

PLUMED:   Restarting from ..//HILLS.207:      259020 Gaussians read

PLUMED:   Restarting from ..//HILLS.208:

PLUMED: 

PLUMED: ################################################################################

PLUMED: 

PLUMED: 

PLUMED: +++ PLUMED error

PLUMED: +++ at IFile.cpp:207, function PLMD::IFile &PLMD::IFile::scanField()

PLUMED: +++ assertion failed: fields[i].read

PLUMED: +++ message follows +++

PLUMED: field multivariate was not read: all the fields need to be read otherwise you could miss important infos

PLUMED: 

PLUMED: ################################################################################

PLUMED: 


For some reason, out of the 224 HILLS files, HILLS.208 is causing all other walkers to crash, every time I attempt to restart the 224-walker set. An interesting facet of this is that the HILLS.208 walker seems to keep going; even though the remaining 223 walker files have a size indicative of 48 hours running wall-time (~40,000 KB), HILLS.208 is around 120,000 KB and was being appended to for all ~144 hours of running. I tried the solution proposed here (https://groups.google.com/g/plumed-users/c/0VZexz8tD2c?pli=1) and another support post , to remove any rows without 8 entries (in our case), from HILLS.208 as well as the other 223 files. I did this before both restarts, and it didn’t prevent only 208 from continuing and the others crashing. 


Any help on any of these issues would be greatly appreciated. Thank you for reading through this message. Please let me know if there is any other information needed.


Sincerely,

Pranav Sharma







Reply all
Reply to author
Forward
0 new messages