Accessing and looping over the .xtc file

Bala subramanian

unread,

Jul 12, 2012, 7:27:07 AM7/12/12

to mdnalysis-...@googlegroups.com

Friends my aim is to do distance measurement of each water atom with each protein atom and do some analysis. So basically I have to loop over each water coordinates with each protein coordinates. I have a gromacs .xtc file to work with. I tried the looping just for a single frame with two different codes that have attached as text file.

In brief, i tried to access the co-ordinates using both MDAnalysis.Universe and libxdrfile.openxtc routines and i found that the execution time for the looping with libxdrfile.readxtc routine is faster

1) I feel using MDAnalysis.Universe routine is more convenient than libxdrfile routines as with one code i can work on any trajectory. If you have any suggestion to improve the efficiency of looping in code 1, It would be of great help.

2) As far as using libxdrfile, i have several queries as below.

a) In the following, why we create XTC object and then pass it to read_xtc method. What is the role of XTC, is it something like a file handle.

XTC = xdrfile_open(xtc, 'r')

status,step,time,prec = read_xtc(XTC, box, x)

b) how i can read only specific frames using the "read_xtc" method. Suppose i have 5000 frames, i need to skip first 500 frames, how can i do the same ?

Thanks,

Bala

code.py

francesco oteri

unread,

Jul 12, 2012, 7:58:06 AM7/12/12

to mdnalysis-...@googlegroups.com

Hi bala,

can you use a single frame trajectory as test case?

I suspect that in the first case the system loads the entire trajectory transforming it in a list (or making it manageble as a list), while in the second case you load the trajectory frame by frame.

As test case 2, you can try to run multiple times the code 1 with an increasing number of steps: what I expect, if the program loads the entire trajectory at the beginning, is a linear trend with the form a+bx where:

a = time to load the trajectory (fix)

b = time to process a frame inside the loop

x = frame count

Francesco

2012/7/12 Bala subramanian <bala.bi...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To post to this group, send email to mdnalysis-...@googlegroups.com.
To unsubscribe from this group, send email to mdnalysis-discus...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mdnalysis-discussion?hl=en.

--
Cordiali saluti, Dr.Oteri Francesco

Bala subramanian

unread,

Jul 12, 2012, 8:43:09 AM7/12/12

to mdnalysis-...@googlegroups.com

Hi Oteri,

Thanks for the suggestion. I created a trajectory of single frame trajectory and then ran the looping with both the codes for three times as cpu time may depend on other processes running on the machine. The timings (in seconds) are as follows.

code1: 117.85, 123.17, 120.25 , code2: 55.97, 54.01, 52.26

C. Balasubramanian

francesco oteri

unread,

Jul 12, 2012, 8:50:01 AM7/12/12

to mdnalysis-...@googlegroups.com

So,

for sure there is some overhead in the code independent by the trajectory lenght.

Developers for sure know how to solve the problem, is it?

Tyler Reddy

unread,

Jul 12, 2012, 12:36:17 PM7/12/12

to mdnalysis-...@googlegroups.com

There was some discussion about improving the way that XTC files are read because it does take a while to load a sizeable trajectory into a Universe object. I routinely have code that waits several minutes to load a trajectory. I haven't messed around with the lower-level function described here though.

Oliver Beckstein

unread,

Jul 12, 2012, 1:21:37 PM7/12/12

to mdnalysis-...@googlegroups.com

On 12 Jul, 2012, at 09:36, Tyler Reddy wrote:

> There was some discussion about improving the way that XTC files are read because it does take a while to load a sizeable trajectory into a Universe object.

As soon as you try to get the length of a XTC trajectory (u.trajectory.numframes [1] or len(u.trajectory)) we have to read the WHOLE trajectory – XTC does not store the length, so the only way to currently get it is to read it frame by frame. (DCDs store the length in the header but they can be wrong if your simulation ended prematurely...).

As soon as you use any kind of slicing (u.trajectory[5000::100]) we have to know the total trajectory length. Furthermore, XTC cannot do random access to frames, so we have to slowly read-forward.

Bottomline:

1) simple frame-by-frame processing of XTC through the trajectory interface "should" not be slower than using xdrlib

for ts in u.trajectory:
analyze(ts)
...

2) As soon as you do things such as

u.trajectory[123]

for ts in u.trajectory[10::5]
...

things will incur a startup penalty (reading the FULL trajectory once) and might also be slower than expected because we can only get to individual frames by forward-reading through the FULL trajectory.

[1] http://packages.python.org/MDAnalysis/documentation_pages/coordinates/XTC.html?highlight=xtc#MDAnalysis.coordinates.xdrfile.XTC.XTCReader.numframes

> 2) As far as using libxdrfile, i have several queries as below.
>
> a) In the following, why we create XTC object and then pass it to read_xtc method. What is the role of XTC, is it something like a file handle.

yes

>
>
> XTC = xdrfile_open(xtc, 'r')
>
> status,step,time,prec = read_xtc(XTC, box, x)
> b) how i can read only specific frames using the "read_xtc" method. Suppose i have 5000 frames, i need to skip first 500 frames, how can i do the same ?

only by forwarding – improvements are welcome and there was a thread on the devel mailing list about some ideas related to better XTC reading. You can also file a feature request through the Issue Tracker, but please be as a specific as possible.

Oliver

Oliver Beckstein * orbe...@gmx.net
skype: orbeckst * orbe...@gmail.com

Naveen Michaud-Agrawal

unread,

Jul 12, 2012, 1:52:06 PM7/12/12

to mdnalysis-...@googlegroups.com

The issue with the xtc format is that it is a compressed format using
reduced precision for the coordinates (see
http://manual.gromacs.org/online/xtc.html).

There are a number of other tricks used to save space. For example,
atoms close in sequence are usually close in space (especially water),
so the xtc format does a distance comparison between sequential atoms
and if the difference is small enough only a delta is stored (and
often times the delta can be further compressed to only take a single
machine variable). As a result, each frame can take a different amount
of space in the trajectory and there is no way to have random access.
Unfortunately the developers of the format never incorporated an index
into the timesteps.

Naveen

-----------------------------------
Naveen Michaud-Agrawal

Bala subramanian

unread,

Jul 12, 2012, 1:54:36 PM7/12/12

to mdnalysis-...@googlegroups.com

Hi oliver,

Thank you for your comments. I used the attached code to calculate the distance of each water oxygen to each protein atom and then correct the distance to account for water that has crossed the box edge ( i used an unimaged trajectory here). The trajectory has just 1 frame, 5886 water oxygen and 1440 protein atoms. It takes about 367.27 seconds (~6min) for the looping. Wud you suggest some faster way to this distance calculation. For instance, is there way to use KDTtree to obtain the distance vector.

Thanks,

Bala

code.py

Oliver Beckstein

unread,

Jul 12, 2012, 5:44:21 PM7/12/12

to mdnalysis-...@googlegroups.com

On 12 Jul, 2012, at 10:54, Bala subramanian wrote:

> Thank you for your comments. I used the attached code to calculate the distance of each water oxygen to each protein atom and then correct the distance to account for water that has crossed the box edge ( i used an unimaged trajectory here). The trajectory has just 1 frame, 5886 water oxygen and 1440 protein atoms. It takes about 367.27 seconds (~6min) for the looping. Wud you suggest some faster way to this distance calculation.

The distance_array http://packages.python.org/MDAnalysis/documentation_pages/analysis/distances.html?highlight=distance_array#MDAnalysis.analysis.distances.distance_array will certainly be faster and can deal with simple orthorhombic boxes.

> For instance, is there way to use KDTtree to obtain the distance vector.

The KD-tree module does not have support for periodic boundaries. I'm not sure if you can get the distances after a "list_search" – the docs http://packages.python.org/MDAnalysis/documentation_pages/KDTree/KDTree.html#MDAnalysis.KDTree.KDTree.KDTree.list_search say "radii NOT available". I certainly never used it in such a way. One has to look at the code itself to find out what one could do.

Bala subramanian

unread,

Jul 13, 2012, 10:17:56 AM7/13/12

to mdnalysis-...@googlegroups.com

Hi oliver,

I am updating on the same issue so that it can be useful for future users.

I made the distance measurement of between two set of atoms (5886 water and 1440 protein atoms) by using the 1) normal looping over water and protein atoms and 2) using the distant_array routine.

In the first case, the cpu time for execution is : 235.73s while for the second case it is: 0.08s. (The code is attached). This is really a significant difference. Thank you oliver and others for your comments and suggestions.

Bala

--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To post to this group, send email to mdnalysis-...@googlegroups.com.
To unsubscribe from this group, send email to mdnalysis-discus...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mdnalysis-discussion?hl=en.