reading in subsets of data from trajectories

1,046 views
Skip to first unread message

Rob Arbon

unread,
Jun 17, 2021, 1:11:01 PM6/17/21
to MDnalysis discussion
Hi folks, 

Loving MDAnalysis but I do have a question.   I would like to have control over how trajectories are read in. I would like to do something like: 

u = Universe(topology, trajectories, start=200, stop=1000, skip=10)

So for each trajectory we only read in from the 200th frame, stop at frame 1000 and only keep every 10th frame.

Additionally, I would also like to keep only a subset of coordinates, e.g., chain A and B.

Ultimately I would like to use this with PCA and the RDF functions.

Can anyone explain how I might do this?

Many thanks,

Rob

Richard Gowers

unread,
Jun 17, 2021, 1:21:12 PM6/17/21
to Mdnalysis-Discussion
Hi Rob,

Welcome to the mailing list.  By default, all frames aren't read in, and instead only one frame of coordinates is ever "active" at one time.  Which frame is currently active is manipulated using the "u.trajectory" attribute, so to do your desired slice would be something like:

u = mda.Universe(..., ...)
ag = u.select_atoms('something')
for ts in u.trajectory[200:1000:10]:
    print(ag.positions)  # <-- this will be iterating over the frames you've described in your message

For PCA and RDF functions, we have analysis classes that handle that, these will take the desired atomgroups and also accept a slice of the trajectory to the ".run()" method, so something like:

u = mda.Universe(..., ...)
ag1 = u.select_atoms('x')
ag2 = u.select_atoms('y')
rdf
= InterRDF(ag1, ag2) rdf.run(start=200, stop=1000, step=10)

https://docs.mdanalysis.org/stable/documentation_pages/analysis/rdf.html#average-radial-distribution-function

https://docs.mdanalysis.org/stable/documentation_pages/analysis/pca.html

There isn't currently a way to only read in certain selections of a Universe.  One way to achieve this would be to read in the whole Universe and then write out your subsection and reread these files.

Hopefully that's enough to get you started :)
Richard

--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-discussion/548716c0-745b-4f98-b3a6-f8c789988822n%40googlegroups.com.

Irfan Alibay

unread,
Jun 17, 2021, 1:25:35 PM6/17/21
to MDnalysis discussion
Just as a follow-up to Richard. the materials from our recent workshop go through trajectory manipulation, and might be of interest here: https://github.com/MDAnalysis/WorkshopPrace2021/tree/main/Day1-Session2-Practical

- Irfan

Rob Arbon

unread,
Jun 17, 2021, 1:32:10 PM6/17/21
to MDnalysis discussion
Thanks for that! 

That almost does what I want.  I have  say, 10 trajectories, and I would like to skip the first 200 frames of each trajectory.  I see that the universe object concatenates trajectories - is there a way of getting them to be treated as independent?

Thanks,

Rob

Irfan Alibay

unread,
Jun 17, 2021, 1:41:32 PM6/17/21
to MDnalysis discussion
I'm not 100% sure if I'm interpreting the question properly, but if you want to just have independent trajectories, the best way to do this is just to create a separate Universe object for each trajectory.

Rob Arbon

unread,
Jun 17, 2021, 2:56:02 PM6/17/21
to MDnalysis discussion
Ah ok.  I'm talking about the situation where you might have a lot of trajectories, all indepedent random velocity seeds, but they start from the same crystal structure. Thus you want to ignore the first few hundred frames as they're all too similar.  But you want all the trajectories in the same universe because you'd like to compute, e.g., RDF using all of the data.

Oliver Beckstein

unread,
Jun 17, 2021, 3:54:51 PM6/17/21
to mdnalysis-discussion

On Jun 17, 2021, at 10:20 AM, Richard Gowers <richard...@gmail.com> wrote:

There isn't currently a way to only read in certain selections of a Universe.  One way to achieve this would be to read in the whole Universe and then write out your subsection and reread these files.

What Richard said covers 98% of normal use cases: Don’t worry about reading the whole length about the trajectory (because MDAnalysis allows you to limit the part of the trajectory that you’re iterating over by slicing or start/stop/step arguments for analysis classes) and don’t worry about only loading part of the system (because you use selections to indicate what to operate on).


For the remaining 2% of use cases there are some advanced approaches to only reading parts of a trajectory:

1) MemoryReader

This will read the data into memory. If you have enough memory, this will speed up analysis if you’re running multiple analyses on the same trajectory. 

- Unlike the case where you read each frame from disk, skipping frames matters for the MemoryReader so you can specify start/stop/step

- You can also generate a subsystem with your in-memory trajectory:


2) XTC or TRR (GROMACS) trajectory: sub kwarg

If you have a GROMACS XTC or TRR trajectory there’s the sub kwarg that allows you to only read a subset of data, see https://docs.mdanalysis.org/stable/documentation_pages/coordinates/XTC.html and https://docs.mdanalysis.org/stable/documentation_pages/coordinates/TRR.html ; if you want to use it and don’t find the documentation on this feature  particularly clear (because it is not…) just ask here.


There’s no general `sub` kwarg for other coordinate formats. It only exists for XTC/TRR because it was added a long time ago to support some special downstream application and we never removed it again. 

If there’s widespread interest to have this functionality more generally supported then discuss it here on the mailinglist together with some use cases that would motivate such a feature.

Oliver


--
Oliver Beckstein (he/his/him)


MDAnalysis – a NumFOCUS fiscally sponsored project




Oliver Beckstein

unread,
Jun 17, 2021, 5:23:50 PM6/17/21
to mdnalysis-discussion


On Jun 17, 2021, at 11:56 AM, Rob Arbon <robert...@gmail.com> wrote:

Ah ok.  I'm talking about the situation where you might have a lot of trajectories, all indepedent random velocity seeds, but they start from the same crystal structure. Thus you want to ignore the first few hundred frames as they're all too similar.  But you want all the trajectories in the same universe because you'd like to compute, e.g., RDF using all of the data. 

Currently you’d generate a separate universe for each trajectory, collect the RDFs separately, and then average them in post processing (with correct weights). This would give you error bars, too.

What you suggest is not currently possible with the RDF (or any) analysis class but for your own code you can easily slice a trajectory with an array of frames (“fancy indexing”) that masks out the initial equilibration phase. You just have to know the lengths of your individual trajectories.

There’s currently some discussion about how to handle ensembles at PR https://github.com/MDAnalysis/mdanalysis/pull/2664 — feel free to chime in.

Rob Arbon

unread,
Jun 18, 2021, 3:50:13 AM6/18/21
to MDnalysis discussion
thanks for this - the sub argument could be useful actually.  But the time slicing is the most important thing. 

Rob Arbon

unread,
Jun 18, 2021, 3:55:09 AM6/18/21
to MDnalysis discussion
Yes, ok, that sounds reasonble.  So I need to map the first 200 frames of each trajectory to their indices in the concatenated verison - and save as a new universe?  Or is there a way of doing that 'in place'? 

I'll have a look at the PR in a few weeks - I'm tied up with some time sensitive stuff at the moment, but I'm keen to contribute.

Your suggestion about separate universes is what I'm doing at the moment. 

Thanks.

Oliver Beckstein

unread,
Jun 21, 2021, 9:43:48 PM6/21/21
to mdnalysis-...@googlegroups.com


> Am 6/18/21 um 00:55 schrieb Rob Arbon <robert...@gmail.com>:
>
> Yes, ok, that sounds reasonble. So I need to map the first 200 frames of each trajectory to their indices in the concatenated verison -

When you iterate over a trajectory you can do fancy indexing. You can use a list of frame numbers or Boolean arrays. For simplicity, let’s say each trajectory has 10 frames and you want to discard the first two. Then make a Boolean array like

keep = np.ones(10, dtype=bool)
keep[:2]=False
production_frames = np.concatenate(10*keep.tolist())
for ts in u.trajectory[production_frames]:
Do_stuff()

I didn’t test the code and it’s kind of ugly but I hope you get the idea.

> and save as a new universe? Or is there a way of doing that 'in place'?

You don’t change the universe, you just slice the trajectory. This approach won’t work with typical analysis functions because they are limited to start/stop/step. (However, open an issue to suggest that analysis functions can also take an opt argument to take an array for indexing.)

Oliver
Reply all
Reply to author
Forward
0 new messages