Hello,
I am currently running multi-walker well-tempered metadynamics using the PLUMED plug-in for LAMMPS, on a supercomputing cluster. It is a two collective variable system with 224 walkers, all reading/writing hills to their respective HILLS.# file in the parent directory. For additional context, I am running 12 of these 224-walker metadynamics with 12 slightly different versions of the system. Attached is an example of an input file:
“““
RESTART
UNITS LENGTH=A ENERGY=eV
#distance between COM beads
d1: DISTANCE ATOMS=1475,1476
c: COORDINATION GROUPA=616,627,638,649,660,671,682,693,704,715,726,737 GROUPB=1353,1364,1375,1386,1397,1408,1419,1430,1441,1452,1463,1474 SWITCH={GAUSSIAN R_0=6.0581 D_0=3.4}
# Walls free energy - position
UPPER_WALLS ARG=d1 AT=75 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=uwalld1
LOWER_WALLS ARG=d1 AT=2 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=lwalld1
UPPER_WALLS ARG=c AT=12 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=uwallc
LOWER_WALLS ARG=c AT=0 KAPPA=10 EXP=2 EPS=1 OFFSET=0 LABEL=lwallc
METAD ...
LABEL=restraint
ARG=d1,c SIGMA=0.5,0.5 HEIGHT=0.01 PACE=500
GRID_MIN=0,0 GRID_MAX=80,12.5
BIASFACTOR=6 TEMP=300
WALKERS_N=224
WALKERS_ID=1
WALKERS_DIR=../
WALKERS_RSTRIDE=1000
... METAD
PRINT ARG=d1,c,restraint.bias STRIDE=500 FILE=COLVAR
”””
All 224 walkers in the 12 systems have a fairly similar input file. I was able to run 11 of these 12 systems, for six days of wall time (in increments of 48 hours), resulting in each walker having a hills file that is around 150,000 KB. To postprocess and use plumed’s sum_hills facility, I create code that concatenates all of these hills files into one, resulting in a ~30-33 GB file for each system. For each of the 11 working systems, when I run sum_hills, even though I allocate an entire core to the command on the supercomputer and give it 48 hours to run, after about 40 minutes all of my sum_hills commands return the same output/error:
PLUMED: PLUMED is starting
PLUMED: Version: 2.8.0-dev (git: 9b35296e4) compiled on Aug 5 2021 at 09:41:35
PLUMED: Please cite these papers when using PLUMED [1][2]
PLUMED: For further information see the PLUMED web page at http://www.plumed.org
PLUMED: Root: /home1/08278/pbs12/plumed2/build/lib/plumed
PLUMED: For installed feature, see /home1/08278/pbs12/plumed2/build/lib/plumed/src/config/config.txt
PLUMED: Molecular dynamics engine:
PLUMED: Precision of reals: 8
PLUMED: Running over 1 node
PLUMED: Number of threads: 1
PLUMED: Cache line size: 512
PLUMED: Number of atoms: 1
PLUMED: File suffix:
PLUMED: Timestep: 0.000000
PLUMED: KbT has not been set by the MD engine
PLUMED: It should be set by hand where needed
PLUMED: Relevant bibliography:
PLUMED: [1] The PLUMED consortium, Nat. Methods 16, 670 (2019)
PLUMED: [2] Tribello, Bonomi, Branduardi, Camilloni, and Bussi, Comput. Phys. Commun. 185, 604 (2014)
PLUMED: Please read and cite where appropriate!
PLUMED: Finished setup
PLUMED: Action FAKE
PLUMED: with label d1
PLUMED: Action FAKE
PLUMED: with label c
PLUMED: Action FUNCSUMHILLS
PLUMED: with label @2
PLUMED: with arguments d1 c
PLUMED: Output format is %14.9f
PLUMED: hillsfile : sumHills.txt
PLUMED: Doing only one integration: no stride
PLUMED: mintozero: bias/histogram will be translated to have the minimum value equal to zero
PLUMED: output file for fes/bias is : whole.dat
PLUMED:
PLUMED: Now calculating...
PLUMED:
PLUMED: reading hills:
PLUMED: doing serialread
PLUMED: opening file sumHills.txt
WARNING: IFile closed in the middle of reading. seems strange!
./submit0.sh: line 8: 74678 Aborted /home1/08278/pbs12/plumed2/build/bin/plumed sum_hills --hills sumHills.txt --mintozero --kt 0.0258724832 --min 0,0 --max 80,4.5 --bin 801,46 --outfile whole.dat
terminate called after throwing an instance of 'PLMD::Plumed::std_bad_alloc'
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 0 Summing hills 1 2505.670629 2505.670629 2505.670629 2505.670629
Some of the systems are outputting the following:
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 0.001051 0.001051 0.001051 0.001051
None of my sum_hills commands returned free-energy surfaces. I tried to reduce my file size by ⅔, running only the sum_hills commands on files from the first 48-hour run (a file of about size 10 GB). I received similar errors, and no free-energy surfaces.
Could you shed some light on why my sum_hills jobs could be crashing, even though they do start, read in at least the header of the aggregate hills files (as shown by the output recognizing my c and d1 CVs) and have a single core at their disposal for many hours? Is it because of the large file sizes?
Related to this issue, I was curious on how to pick the rate of hill deposition for our walkers; as shown in the sample input file above, WALKERS_RSTRIDE=1000. My first priority is to reach convergence fast in terms of wall-time, meaning that I want only a couple more 48-hr restarts to reach convergence. As a note, some of these systems are fairly complex systems that took 2-3 months to converge to sensible values using other metadynamics codes out there (albeit with about a tenth the number of walkers, less efficient code, and worse computers).
My second priority is to minimize the storage size of our HILLS files, both because of somewhat limited hard-drive space and because it might be what is causing issues with our sum_hills command. This reduction in storage might be available by reducing WALKERS_RSTRIDE from 1000 to say 2000, for example. However, I don’t want to do this if it means increasing wall time to convergence to double. Is there any advice/guidance you can give on this issue? Does increasing the rate of hill deposition always lead to faster convergence and decreased wall time, or is there some middle point that is optimal?
Lastly, I have a final question. As I mentioned earlier, 11 of 12 systems successfully restarted (twice). However, I could not get 1 of the 12 systems to restart whenever I restarted the other systems (fully) because of the following error:
...
PLUMED: Restarting from ..//HILLS.204: 259034 Gaussians read
PLUMED: Restarting from ..//HILLS.205: 259434 Gaussians read
PLUMED: Restarting from ..//HILLS.206: 258622 Gaussians read
PLUMED: Restarting from ..//HILLS.207: 259020 Gaussians read
PLUMED: Restarting from ..//HILLS.208:
PLUMED:
PLUMED: ################################################################################
PLUMED:
PLUMED:
PLUMED: +++ PLUMED error
PLUMED: +++ at IFile.cpp:207, function PLMD::IFile &PLMD::IFile::scanField()
PLUMED: +++ assertion failed: fields[i].read
PLUMED: +++ message follows +++
PLUMED: field multivariate was not read: all the fields need to be read otherwise you could miss important infos
PLUMED:
PLUMED: ################################################################################
PLUMED:
For some reason, out of the 224 HILLS files, HILLS.208 is causing all other walkers to crash, every time I attempt to restart the 224-walker set. An interesting facet of this is that the HILLS.208 walker seems to keep going; even though the remaining 223 walker files have a size indicative of 48 hours running wall-time (~40,000 KB), HILLS.208 is around 120,000 KB and was being appended to for all ~144 hours of running. I tried the solution proposed here (https://groups.google.com/g/plumed-users/c/0VZexz8tD2c?pli=1) and another support post , to remove any rows without 8 entries (in our case), from HILLS.208 as well as the other 223 files. I did this before both restarts, and it didn’t prevent only 208 from continuing and the others crashing.
Any help on any of these issues would be greatly appreciated. Thank you for reading through this message. Please let me know if there is any other information needed.
Sincerely,
Pranav Sharma