haMSMs with progress coordinates?

185 views
Skip to first unread message

Hayden Scheiber

unread,
Nov 7, 2023, 6:48:26 PM11/7/23
to westpa-users
Hello WESTPA community,

I'm interested in implementing history augmented Markov State Modeling (haMSM) to calculate rate constants from transient data. I am a bit confused by the documentation on the haMSM module for WESTPA. The tutorial mentions that

"dimensionality reduction is highly system-specific, so no general procedure is distributed with the plugin. Instead, the user is required to define a function which takes in an array of full-atomic coordinates of shape (n_segments, n_atoms, 3), perform the desired dimensionality reduction, and then return the reduced coordinates in an array of shape (n_segments, n_features)."

I am studying protein/protein binding with a 2-dimensional progress coordinate. The pcoords are (1) protein-protein distance and (2) solvent accessible surface area of the binding site. Would it not make sense to use my already-computed progress coordinates as the reduced coordinates for haMSM? Or do I need more features for effective clustering? The docs seems to suggest using either PCA or VAMP on the relative coordinates for the reduced coordinates.

Thanks,
Hayden

Daniel Zuckerman

unread,
Nov 8, 2023, 6:50:57 PM11/8/23
to westpa-users
Hayden, that's a very reasonable question.

My suggestion would be to use a more complete set of coordinates for clustering (which can also include your progress coordinates).  The reason is that, although we all try to build the most intuitive progress coordinates, there is no guarantee that that those coordinates are ideal for separating kinetically distinct regions of configuration space.  And what are the chances to perform that task well with only two coordinates?

More generally, the haMSM will provide the true rate only if trajectories are steady-state distributed within clusters (bins/states).  Because typical WE simulations do not reach steady state (a flux vs time graph should tell you), there is benefit to using smaller clusters based on transient data.  See https://pubs.acs.org/doi/10.1021/acs.jctc.0c00273

You can also restart from a calculated haMSM steady state estimate.  See https://pubs.acs.org/doi/10.1021/acs.jctc.1c01154

As always, I would suggest using multiple runs to check the reliability of any calculations you do.  

I would be more than happy to discuss this and review your findings in a data club.

--Dan

From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of Hayden Scheiber <hayden...@gmail.com>
Sent: Tuesday, November 7, 2023 3:48 PM
To: westpa-users <westpa...@googlegroups.com>
Subject: [EXTERNAL] [westpa-users] haMSMs with progress coordinates?
 
--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/b01c6f80-791c-4a35-b03c-c9fc654ca5b5n%40googlegroups.com.

Hayden Scheiber

unread,
Nov 8, 2023, 8:47:34 PM11/8/23
to westpa-users
Thank you Dan, that clears things up. I also sent you a direct email regarding the data club.
I suppose in my case, the strategy is to throw together a mixture of features describing each protein's configuration separately (PCA or VAMP on the C-alpha atoms seems appropriate, how how many dimensions to reduce by?), as well as features that describe the orientation and contact of the two proteins relative to each other. I have a few collective variables in mind for those.


Some technical questions I have regarding use of the haMSM module in WESTPA:
  • I save my protein coordinates as trajectories to per iteration HDF5 files, while the restart files (also saved to per iteration HDF5 files) contain full system data. Does the haMSM module play nicely with the HDF5 framework? If so, how to choose the struct_filetype field for the haMSM plugin in west.cfg when using the HDF5 framework?
  • The tutorial 7.7 example provides a restart_overrides.py file with two functions, processCoordinates and reduceCoordinates, yet the westpa docs only mention the need for processCoordinates. Are both functions needed?
  • It is more efficient for me to compute all of the reduced coordinates needed for haMSM during the progress coordinates calculation stage (don't need to re-load coordinates into MDAnalysis/mdtraj) and save these as auxiliary data sets. Is it possible to skip the processCoordinates function and just pull reduced coordinates from westpa aux datasets for haMSM analysis? How would this be done?

Thanks,
Hayden


Hayden Scheiber

unread,
Nov 14, 2023, 1:09:26 PM11/14/23
to westpa-users
Another technical question I have regarding the haMSM module. In the haMSM documentation, there is the following paragraph:

"Also in this file, west.data.data_refs.basis_state MUST point to $WEST_SIM_ROOT/{basis_state.auxref} and not a subdirectory if restarts are being used. This is because when the plugin initiates a restart, start_state references in $WEST_SIM_ROOT/restartXX/start_states.txt are set relative to $WEST_SIM_ROOT. All basis/start state references are defined relative to west.data.data_refs.basis_state, so if that points to a subdirectory of $WEST_SIM_ROOT, those paths will not be accurate."

Is this still the case? Does this mean I need to have all of my basis states folders, and bstates.txt, directly in $WEST_SIM_ROOT? My simulation begins with ~100 basis states, and this seems like an odd design choice to force them into the outermost directory where other westpa files and scripts also reside.

Best,
Hayden

Jeremy Leung

unread,
Nov 15, 2023, 1:24:16 PM11/15/23
to westpa-users
Hi Hayden,
  • I save my protein coordinates as trajectories to per iteration HDF5 files, while the restart files (also saved to per iteration HDF5 files) contain full system data. Does the haMSM module play nicely with the HDF5 framework? If so, how to choose the struct_filetype field for the haMSM plugin in west.cfg when using the HDF5 framework?
I believe there were efforts to streamline the process but wasn't implemented. You might have to manually crawl/create your own 'coord' auxdata set, which the rest of the plugin will use to build the MSM. msm_we relies on the 'coord' aux dataset in the main h5 file. If that dataset exists, the plugin will skip saving for that iteration.
  • The tutorial 7.7 example provides a restart_overrides.py file with two functions, processCoordinates and reduceCoordinates, yet the westpa docs only mention the need for processCoordinates. Are both functions needed?
`reduceCoordinates` is used when trying to copy the (xyz) coordinates into the west.h5 as the 'coord' auxdata. Say you don't need the water coordinates, you can remove them using this function. processCoordinates are used when featurizing, which is during the MSM building process. Using reduceCoordinates is highly encouraged because loading unnecessary coordinates will drastically increase RAM requirements when building the MSM, which is already very high due to the large amounts of data needed.

  • It is more efficient for me to compute all of the reduced coordinates needed for haMSM during the progress coordinates calculation stage (don't need to re-load coordinates into MDAnalysis/mdtraj) and save these as auxiliary data sets. Is it possible to skip the processCoordinates function and just pull reduced coordinates from westpa aux datasets for haMSM analysis? How would this be done?
That is exactly how the whole process works. It reads the coordinates from the 'coord' auxdata set for featurization. You can include your already featurized datasets here. In processCoordinates, just return the coordinates as is.
  • For the Restarting plugin folder structure:
It is still the case. It's because the restarting plugin has a specific file structure which is build upon $WEST_SIM_ROOT . If you have a lot of bstates, you can actually circumvent it by directly appending your path to bstates.txt auxref column.  For example, if you do the following, $WEST_STRUCT_DATA_REF will now point to $WEST_SIM_ROOT/bstates/{01..100}, where you have your 100 bstate folders in $WEST_SIM_ROOT/bstates.

Say in west.cfg you have the following:
  west:
    data:
      data_refs:
        basis_state: $WEST_SIM_ROOT

My $WEST_SIM_ROOT/bstates.txt  file (which you can put whever, just make sure you point to the right path when calling w_init):
b01 0.25 bstates/01
b02 0.25 bstates/02
b03 0.25 bstates/03
b04 0.25 bstates/04
...

Hope that helps!

-- JL

Hayden Scheiber

unread,
Nov 15, 2023, 5:03:02 PM11/15/23
to westpa-users
Hi Jeremy,

Thanks for the detailed response, this is really helpful! 
I had abandoned using the per-iteration HDF5 framework with haMSMs because it seemed like I needed to load coordinates into msm_we module from trajectory files directly. I will try saving the "coord" aux dataset within the same script as the progress coordinate generation for westpa. If my understanding is correct, this will allow the haMSM module to skip running reduceCoordinates. Then I can make processCoordinates simply pass the input coord array to the output without modification.

But then that begs the question, will the haMSM restart module be able to generate new start states while using the per-iteration HDF5 framework? In my case, restart files are full-precision gromacs trajectory (.trr) files that are stored within per-iteration HDF5 files. How does the haMSM restart module pull out start states?

Cheers,
Hayden

Leung, Jeremy

unread,
Nov 15, 2023, 5:19:18 PM11/15/23
to westpa...@googlegroups.com
Hi Hayden,

I recall (it's been a while since I used the restarting plugin to the full extent) but it actually writes pdb files (or other file types) from the coord dataset. https://github.com/westpa/westpa/blob/fd595ec58f9b5331b7ddf4e4dca6799ffb7ae6bd/src/westpa/westext/hamsm_restarting/restart_driver.py#L1029

I guess that's a reason to keep the water in your coord dataset (or somehow re-solvate the system in gen_istates).


--JL

--
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]


Sent: Wednesday, November 15, 2023 5:03:05 PM
To: westpa-users <westpa...@googlegroups.com>
Subject: Re: [westpa-users] haMSMs with progress coordinates?

Hayden Scheiber

unread,
Nov 15, 2023, 7:10:25 PM11/15/23
to westpa-users
Hi Jeremy,

That is unfortunate. It would be much simpler to retrieve the restart file(s) for the corresponding start states. 
Then I could just "continue" the simulations without needing to generate new istates from them.

So now I will need to either: 
  1. Save all atom coordinates, including solvent molecules, to the "coord" aux dataset (west.h5 will become huge). Then have the haMSM module write these coords into pdb files representing start states, then convert these to gro files and add velocities in gen_istates. I would also separately save the processed features to be passed to the haMSM module into a separate aux dataset, and load them with processCoordinates.
  2. Save just the protein coordinates to the "coord" aux data (still a decently large dataset), then load these coords, solvate and relax the system with gen_istates (slow and inefficient). As with (1), I would also need to separately save any features I want to pass to the haMSM module into a separate aux dataset and load these with processCoordinates.
  3. Somehow modify restart_driver.prepare_new_we to load trajectory restart files (in my case just a seg.trr file) rather than generate a pdb file based on the "coord" dataset. If I go this route I can just save the haMSM features directly to the "coord" aux data set, and I don't need to use gen_istates.

I would love to use option 3 as it would be most efficient. Any suggestions on how I could implement this modification?

Thanks,
Hayden

Leung, Jeremy

unread,
Nov 16, 2023, 10:59:25 AM11/16/23
to westpa...@googlegroups.com
Hi Hayden,

I know it's a great idea. I also suggested it a few years back but John Russo spent a lot of time and effort on this already. :) It's non-trivial to implement on different file types and folder structures, especially with the goal of being very modular!

2 Ways: 

1) John Russo worked on something that grabs directly from HDF5 Framework. If allows users to "Allow loading basis-states directly from an HDF5 file". Obviously it's a closed PR so I don't know how well it worked.


2) Drop the complete "automated" nature of the restarting plugin and deal with all the model building/restarting yourself. This was what I did to get the functionality I needed. Grab the H5 files and you can follow the msm_we user guide on how to build your own haMSM in a jupyter notebook. Then restart the simulations yourself. MSM-building is a very trial-and-error process anyways.

You will have to write your own code to prep the restart files, but you can base it off of John Russo's code in src/westpa/westext/hamsm_restarting/restart_driver.py. 

Hope this helps.

-- JL

---

Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

Hayden Scheiber

unread,
Nov 16, 2023, 1:16:43 PM11/16/23
to westpa-users
Hi Jeremy,

Thanks for the reply. I will go with your second option and build up the functionality I need myself. This will help me better understand the details under the hood anyway. Thanks for all your help on this!

Cheers,
Hayden

Hayden Scheiber

unread,
Nov 22, 2023, 8:09:31 PM11/22/23
to westpa-users
Hello Jeremy,

I have began working with the msm_we python library directly, rather than the more restricted haMSM plugin for westpa. I've managed to build my own haMSM model from my westpa dataset (I actually forked the msm_we repo and modified the code to directly import features from the coord aux dataset), and now would like to use my haMSM model to generate a new set of start states, approximating the steady state distribution from my haMSM.

As you mentioned, I can follow John Russo's code in src/westpa/westext/hamsm_restarting/restart_driver.py to achieve this. However, I am a bit stuck. In the RestartDriver class, the initialization function has an input, the sim_manager from a previously completed WESTPA run. How do I pull this sim_manager object from a completed WESTPA simulation, outside of the context of a WESTPA plugin? Can I just load it from the west.h5 file somehow?

Thanks,
Hayden

Lillian Chong

unread,
Nov 28, 2023, 1:29:11 PM11/28/23
to westpa...@googlegroups.com
Hi Hayden,

I hope you had a nice Thanksgiving! We are just getting back into things after the Thanksgiving break. 
Since your question is getting into WESTPA development, Jeremy is going to follow up with you online.

As an open-source software project, we welcome code contributions to WESTPA from our community of users:

All the best,
Lillian

Reply all
Reply to author
Forward
0 new messages