Issue 127 in mdanalysis: Index file for xtc trajectory

mdana...@googlecode.com

unread,

Feb 4, 2013, 8:14:24 PM2/4/13

to mdnalys...@googlegroups.com

Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 127 by francesc...@gmail.com: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Oliver,
In order to decrease the time needed to process an .xtc trajecotry,
it would be nice creating an index file for storing the position of each
frame. Such a file could permit the random access to each frame and,
indirectly, it will permit to know the number of frame.

Francesco

mdana...@googlecode.com

unread,

Feb 7, 2013, 6:25:49 PM2/7/13

to mdnalys...@googlegroups.com

Updates:
Labels: -Type-Defect Type-Enhancement

Comment #1 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

I agree with you (and anyone else agreeing should star this issue!).

For future reference:

* Previous discussion (including a link to LOOS):
https://groups.google.com/d/msg/mdnalysis-discussion/PK42NMcm8a0/jCI86vcxbjAJ

* Tsjerk Wassenaar's Python implementation of reversing a XTC:
http://permalink.gmane.org/gmane.science.biology.gromacs.user/41141 (or
archived at http://www.webcitation.org/6EGTvnaAS )

Manuel Nuno Melo

unread,

Apr 18, 2013, 11:15:52 AM4/18/13

to mdnalys...@googlegroups.com, codesite...@google.com, mdana...@googlecode.com

Hi all,

Unaware of this ongoing discussion (and those in the related threads), and feeling the need to speed up XTC access, I hacked my way around implementing most of the stuff being discussed here or in the linked discussions (no on-disk cache/checksum, though).
I'll be glad to join in on any efforts to this end.

Basically my approach thus far was:
1- expose xdr_seek and xdr_tell functions via libxdrfile, to do what their name suggests. Implement corresponding _seek and _tell functions in core.py.

2- expose a quick_read_xtc function via libxdrfile that returns a tuple (status, frametime, frameoffset). This function reads the header, and, if the number of atoms is greater than 9, then reads the framesize stored a couple of bytes down the line. It then uses that value to seek to the beginning of the following frame.
If there are fewer than 9 atoms, there's no compression and the frame size only depends on the atom number. It then skips to the next frame immediately after reading the header.

3- reimplement _read_trj_numframes to iteratively call libxdrfile.quick_read_xtc and store the frame times and offsets. (it then repositions the stream back to where it was before this iteration).

4- reimplement __getitem__ using the index constructed by _read_trj_numframes and _seek for quick seeking.

Using quick_read_xtc I get a pretty good speedup in calculating the total number of frames (~700x compared to the naïve plodding, on a 180MB, 1980-frame trajectory). However, a script doing just this was still 5x slower than GMX's native trjconv (1.5 vs. 0.3s). I first thought this was due to the iterative read taking place between C and Python (an improvement will definitely be to do it all in C and pass the entire time/offset lists at the end). However, timing just the index generation under IPython shows it to be around 12ms, which means the whole module import overhead is the most expensive part and this code might actually be close to optimum speed.
(On this same note, caching the index on-disk using a checksum might also be quite expensive since in that case the entire file will actually have to be read for checksum calculation.)

I have not yet bothered doing the same for TRR files, which should be straightforward.

Accessing >2GB files seeks to work fine. I made sure offsets are stored as Python's long or numpy's int64, but can't vouch for portability away from my Linux 64-bit system. The interfacing in other cases (32bit systems come to mind) might be more complicated and I really have little clue what to cast and to which types.

Lastly, my implementation may be largely incompatible with the code style used for other formats. I have not checked those at all (is there already an index object for other formats?). I could definitely use some guidance on merging my code in, if you're interested in it.

I appreciate feedback!
Manuel

Oliver Beckstein

unread,

Apr 18, 2013, 12:02:06 PM4/18/13

to mdnalys...@googlegroups.com

Hi Manuel,

Your approach sounds like what we're looking for.

I suppose the most straightforward way to start working on integrating it would start with you publishing a clone of the current development branch on google code that contains your modifications (http://code.google.com/p/mdanalysis/wiki/DistributedDevelopment). Anyone interested can then pull your changes and look over your code. This will also allow us to consider how to integrate it – that's hard to do without looking at the code. It will also make it easy to create a feature-fastxtc branch in the main repository, which will eventually be merged into develop (http://code.google.com/p/mdanalysis/wiki/DevelopmentWorkflow).

An important consideration is to make sure that the new code does not break old scripts, i.e. the existing test cases should still pass. We would also want new test cases that exercie the new code specifically (http://code.google.com/p/mdanalysis/wiki/ContributingCode).

However, given that we're working towards a 0.8 release that could in principle contain backwards-incompatble changes (although we'd still like to avoid them) and given the large impact that your changes could make I would be quite willing to consider drastic changes to the existing code: i.e. we'll try to make this work!

If you're willing to work on this I'll make you the owner of this issue report.

I'm sure that you'll have many fans if you can make this work.

Oliver

> --
> You received this message because you are subscribed to the Google Groups "MDnalysis-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-dev...@googlegroups.com.
> To post to this group, send email to mdnalys...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msg/mdnalysis-devel/-/i90t5l2Mth0J.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
Oliver Beckstein * orbe...@gmx.net
skype: orbeckst * orbe...@gmail.com

Manuel Nuno Melo

unread,

Apr 18, 2013, 12:13:05 PM4/18/13

to mdnalys...@googlegroups.com

Great,

I'll look into the distributed development documentation and get working on publishing my code.

Come the fandom!
Manel

mdana...@googlecode.com

unread,

Apr 18, 2013, 4:53:28 PM4/18/13

to mdnalys...@googlegroups.com

Updates:
Status: Started
Owner: manuel.n...@gmail.com

Comment #2 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

I am happy to say that Manuel wants to integrate his fast XTC reader code
into the published code base; see
https://groups.google.com/d/msg/mdnalysis-devel/szaMMcvY3Ik/i90t5l2Mth0J
for the discussion.

Please add further comments, links to branches etc to the Issue 127 report
so that the discussion is all in one place.

-- Oliver

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

mdana...@googlecode.com

unread,

Apr 18, 2013, 9:32:39 PM4/18/13

to mdnalys...@googlegroups.com

Comment #3 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Just a note on some tests I was carrying out on big trajectories (14Gb,
~90000 frames):

Indexing a trajectory this large took me 2 minutes over NFS and Gigabit
Ethernet (all my benchmarking was done after flushing the NFS cache). My
current implementation quickly seeks to the next frame via a function in
the C-layer, which then passes back the frame offset and time to Python.
The list that'll become the index is then built from these values on the
Python side.

Iterating through the trajectory fully in the C layer is 2x faster, but in
the test I did I was just seeking, not storing offsets nor frame times. I
think it's worth going the all-in-C route but we'll probably need to use
vectors --or other self resizing container-- to store the offsets since we
don't know at the beginning how many there will be.
I don't know if it's worth the hassle to also index frame times; besides,
leaving them out makes seeking a bit faster since we don't need to stop and
read the header.

Finally, comparison to GMX's trjconv proved to be very unfair. I was using
trjconv -dump 13481700 -b 13481700 [...] to seek to, and dump the last
frame, which it did in an amazing 1.2 seconds. Digging through the code (in
libxdrf.c) I found out that the implementation for the -b flag actually
performs a binary search till it finds the requested start time.
This approach for random frame access is, of course, rendered useless by a
frame offset index; still it might be worth considering it for cases where
indexing may be undesirable.

Manel

mdana...@googlecode.com

unread,

Apr 24, 2013, 8:25:47 AM4/24/13

to mdnalys...@googlegroups.com

Comment #4 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Ok, come and get it at https://code.google.com/r/manuelnunomelo-xtcindex/

I went ahead and implemented the whole indexing in C. The index is then
memory copied and returned as a numpy array, instead of frame-by-frame. I
used Py_INCREF on that array prior to returning it to Python, but am not
sure whether that's required.

I guess we could avoid the memory copying and have a numpy array point
directly at the offset data, but I don't know how to deal with freeing the
dynamically allocated memory in that case, upon garbage collection. Please
double check that my approach is ok.

And please bear in mind my newbiness when it comes to C (and git).

As a side note, I'm working on an extension to MDAnalysis by which a
trajectory is scanned in parallel (using multiprocessing). If you think
it'd be compatible with it I could implement something similar at the
_read_trj_numframes level to have parallel indexing.

Cheers,

mdana...@googlecode.com

unread,

Apr 24, 2013, 1:51:19 PM4/24/13

to mdnalys...@googlegroups.com

Comment #5 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Manuel,

can you please check the UnitTest suite?

nosetests test_coordinates:XTCReader

seems to be ok but

nosetests test_libxdrfile.py

Has one error (see below).

Also run a full

nosetest test_coordinates.py

just to ensure that nothing else is broken.

Once you've fixed the unit tests and/or your code I will pull again and
push to a feature branch feature-xtcindex in the main repository. We can
then look more closely at the code (e.g. using the code review in gcode).

Thanks,
Oliver

--------

.....E
======================================================================
ERROR: test_numframes (MDAnalysisTests.test_libxdrfile.TestXTC)
----------------------------------------------------------------------
Traceback (most recent call last):

File "/Volumes/Data/oliver/Biop/Library/python/mdanalysis/testsuite/MDAnalysisTests/test_libxdrfile.py",
line 62, in test_numframes
numframes = xdr.read_xtc_numframes(XTC)

File "/Volumes/Data/oliver/Biop/Library/python/mdanalysis/package/MDAnalysis/coordinates/xdrfile/libxdrfile.py",
line 414, in read_xtc_numframes
return _libxdrfile.read_xtc_numframes(*args)
TypeError: in method 'read_xtc_numframes', argument 1 of type 'XDRFILE *'

mdana...@googlecode.com

unread,

Apr 24, 2013, 11:57:17 PM4/24/13

to mdnalys...@googlegroups.com

Comment #6 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Please try again, Oliver.

I made the code comply with all the tests, or modified the tests that no
longer applied (_read_trj_numframes now returns a tuple with numframes and
the offsets array).

I am a bit wary of the use of seek/tell. I specifically casted everything
offset related as int64_t; please check if this is ok (maybe it should be
off_t instead).

I extended the indexing approach to TRRs. Doing that I noticed a bug caused
when TRRs have frames without coordinates: the coordinates form the
previous frame will be reused (ok) and re-corrected for unit difference
(not ok!). That actually applies to any property (coordinates, velocities
or forces) that is not present every frame. I fixed it by passing back from
the trr reader a set of flags indicating what has been read and should be
scaled.

I await feedback,
Manel

mdana...@googlecode.com

unread,

Apr 26, 2013, 2:16:22 AM4/26/13

to mdnalys...@googlegroups.com

Comment #7 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Manel and anyone else willing to work on this,

I pushed a new branch 'feature-xtcindex' for people to iron out remaining
issues with the great new xdr lib – it looks very promising.

Folks, please look at the code and comment through the code review feature
on the commits in
https://code.google.com/p/mdanalysis/source/list?name=feature-xtcindex (see
Manel's request for feedback above). Once it looks ok we'll merge it into
develop (you can show support by using the +1 in the code review.)

Also, at the moment, 34 tests FAIL that used to pass before. They all have
an ERROR due 'ValueError: too many values to unpack' and many are related
to TRR (see below). Can someone please look at this, probably a single fix
will make them all go away (or 32 fixes to the test cases...).

Thanks,
Oliver

Failed tests (excluding KnownFailures):

ERROR: test_get_velocities
(MDAnalysisTests.test_atomgroup.TestAtomGroupVelocities)
ERROR: test_set_velocities
(MDAnalysisTests.test_atomgroup.TestAtomGroupVelocities)
ERROR: test_velocities
(MDAnalysisTests.test_atomgroup.TestAtomGroupVelocities)
ERROR: test_set_all_format_tuples
(MDAnalysisTests.test_coordinates.TestChainReaderFormats)
ERROR: test_TRR2NCDF (MDAnalysisTests.test_coordinates.TestNCDFWriter)
ERROR: test_coordinates
(MDAnalysisTests.test_coordinates.TestTRRNoConversion)
ERROR: test_Writer (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_coordinates (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_dt (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_flag_convert_gromacs_lengths
(MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_frame (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_get_Writer (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_jump_lastframe_xdrtrj
(MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_jump_xdrtrj (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_next_xdrtrj (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_reverse_xdrtrj (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_rewind_xdrtrj (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_slice_xdrtrj (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_time (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_totaltime (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: Test that xtc/trr unitcell is read correctly (Issue 34)
ERROR: test_velocities (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_volume (MDAnalysisTests.test_coordinates.TestTRRReader)
ERROR: test_load_new_raises_ValueError
(MDAnalysisTests.test_coordinates.TestTRRReader_Sub)
ERROR: load solvated trajectory into universe with unsolvated protein.
ERROR: test_timestep_not_modified_by_writer
(MDAnalysisTests.test_coordinates.TestTRRWriter)
ERROR: test_velocities (MDAnalysisTests.test_coordinates.TestTRRWriter)
ERROR: Test writing Gromacs trajectories (Issue 38)
ERROR: Test writing Gromacs trajectories from AMBER NCDF (Issue 117)
ERROR: test_single_frame_CRD
(MDAnalysisTests.test_coordinates.TestTRRWriterSingleFrame)
ERROR: test_single_frame_GRO
(MDAnalysisTests.test_coordinates.TestTRRWriterSingleFrame)
ERROR: test_single_frame_PDB
(MDAnalysisTests.test_coordinates.TestTRRWriterSingleFrame)
ERROR: testForces (MDAnalysisTests.test_velocities_forces.TestTRRForces)
ERROR: testForces
(MDAnalysisTests.test_velocities_forces.TestTRRForcesNativeUnits)

mdana...@googlecode.com

unread,

Apr 26, 2013, 8:00:26 PM4/26/13

to mdnalys...@googlegroups.com

Comment #8 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi,

The errors are indeed solvable by fixing only two points in the code
(libxdrfile.read_trr now returns three extra boolean values in the tuple).
How should I go about adding this to the code? Can I git push to the
feature-xtcindex branch?

I'm also not quite sure where to comment on code changes. The merge ended
up capturing my previous changes to my own code and it isn't very clear at
each revision what changed relative to the preexisting code. I apologize if
I should have done things differently.

Cheers,
Manel

mdana...@googlecode.com

unread,

Apr 26, 2013, 8:49:12 PM4/26/13

to mdnalys...@googlegroups.com

Comment #9 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Manel,

> The errors are indeed solvable by fixing only two points in the code
> (libxdrfile.read_trr now returns three extra boolean values in the
> tuple). How should I go about adding this to the code? Can I git push to
> the feature-xtcindex branch?

In your own clone:

git remote add mdanalysis http://code.google.com/p/mdanalysis/
git fetch --all
git merge mdanalysis/feature-xtcindex

That will update your local clone. (See also
https://code.google.com/p/mdanalysis/wiki/DistributedDevelopment ).

Then add your changes and push to your clone at google code. I will pull
the changes from you, run test cases locally, and push it out to the
feature-xtcindex branch. Once everything looks ok and does not immediately
break people's code we'll merge it into develop. Because it's a pretty big
step forward I want to give other developers a chance to play with the
code. (Quite a few of us seem to use the develop branch for serious work so
I don't want to break it lightly...)

> I'm also not quite sure where to comment on code changes. The merge ended
> up capturing my previous changes to my own code and it isn't very clear
> at each revision what changed relative to the preexisting code. I
> apologize if I should have done things differently.

Don't worry; maybe other people might want to comment on the code. One has
to click through the diffs of the individual files and comment on
individual lines. I simply looked at commits that had your name on it.

Thanks a lot for your efforts, it's a very valuable contribution to
MDAnalysis and perhaps even people beyond MDAnalysis might be interested in
using the upgraded xdrlib.

Oliver

mdana...@googlecode.com

unread,

Apr 26, 2013, 9:17:29 PM4/26/13

to mdnalys...@googlegroups.com

Comment #10 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Perfect, thanks for guiding me.

I have updated my clone; you can now pull the changes from it.

Manel

mdana...@googlecode.com

unread,

Apr 26, 2013, 9:42:35 PM4/26/13

to mdnalys...@googlegroups.com

Comment #11 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Thanks, Manel. Looking good, I pushed the fix to feature-xtcindex.

Developers: Please test it; if I don't hear any complaints over the next
week I will merge it into develop for inclusion into the 0.8 release (and
close this issue).

— Oliver

mdana...@googlecode.com

unread,

Apr 27, 2013, 5:54:52 AM4/27/13

to mdnalys...@googlegroups.com

Comment #12 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

It'd be great if someone could test this on large files (>4GB) on a 32-bit
system (don't have any at hand myself), just to check whether there are any
overflow problems with the offset storage/passing.

Manel

mdana...@googlecode.com

unread,

May 6, 2013, 12:24:23 PM5/6/13

to mdnalys...@googlegroups.com

Updates:
Labels: -Priority-Medium Priority-High

Comment #13 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Manuel,

Your XTC/TRR improvements are awesome as far as I can tell.

We're having a hard time finding 32-bit systems to test on, though, so if
anyone out there is still running vintage boxes (especially Linux), PLEASE
test this so that we can get this code merged ASAP. Some notes on how to
distinguish 32/64 bit O/S are below.

In order to get the development branch do the following (see also
https://code.google.com/p/mdanalysis/wiki/DevelopmentBranch )

git fetch --all
git checkout -b feature-xtcindex origin/feature-xtcindex

Then install, e.g.

cd package
python setup.py develop

You'll then need to find a trajectory with file size greater than 4 GB,
e.g. "big.xtc" and it's "big.pdb":

import MDAnalysis
u = MDAnalysis.Universe("big.pdb", "big.xtc")
print(len(u.trajectory)) # this might take a few minutes but not
hours as before
print(u.trajectory[0])
print(u.trajectory[-10]) # this should be instantaneous
print([(ts.frame, u.trajectory.time) for ts in
u.trajectory[-10:-100:-5]]) # this should work and be fast

Check that the output makes sense. Report the file size (in MiB) and
approximately how long it took to run the above commands. Report if
anything does not work or if the output seems wrong.

Thanks,
Oliver

LINUX

http://www.cyberciti.biz/faq/linux-how-to-find-if-processor-is-64-bit-or-not/

Operating System:

uname -a

If it says x86_64 it is 64 bit, if it's i386, i486, i586, i686 it's 32 bit
O/S.

Processor:

grep flags /proc/cpuinfo

• lm flag means Long mode cpu - 64 bit CPU
• Real mode 16 bit CPU
• Protected Mode is 32-bit CPU

MAC OS X

On Mac OS X: try

sysctl -a hw

(although I don't know what to look for when looking for 32
bit. "hw.cpu64bit_capable: 1" seems to say I have a 64-bit capable cpu)

system_profiler

Maybe look for "64-bit Kernel and Extensions"? Alternatively graphically:
http://support.apple.com/kb/ht3696
http://apple.stackexchange.com/questions/23713/how-do-i-figure-out-if-my-os-is-mac-os-x-32-bit-or-64-bit
also
says that a 64-bit cpu with 32-bit kernel will run 64-bit software.

Thus, you'd probably need a fairly old Mac OS X (old Core processor with
Mac OS X Tiger or something...).

mdana...@googlecode.com

unread,

May 6, 2013, 12:37:47 PM5/6/13

to mdnalys...@googlegroups.com

Comment #14 on issue 127 by sebastie...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Well, I don't have any 32-bit linux but I will do a 32-bit chroot to test
that with a nearly-5GB-fat xtc file of mine.
Probably won't have time to do that before tomorrow, though.
Thanks Manuel for your work,

Séb

mdana...@googlecode.com

unread,

May 10, 2013, 3:16:45 AM5/10/13

to mdnalys...@googlegroups.com

Comment #15 on issue 127 by peyser.a...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Very fast --- much faster than the format naive indexing I was writing this
week (!!!). One request though: currently in the TrjReader, the offsets are
all obscured with name mangled properties (double underscores).

For 10s of gigabytes, this is reasonable enough, since the indexing takes
about a minute. But if you have some reason to scale up to really large
data sets of 100s or 1000s of gigibytes, being able to index once, and save
and restore, would be much better. That suggests just a single underscore
for the offset member since there is no method interface to access it and
set it.

mdana...@googlecode.com

unread,

May 10, 2013, 6:16:54 AM5/10/13

to mdnalys...@googlegroups.com

Comment #16 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Offset loading/saving routines should be straightforward to implement (but
it's not just accessing the mangled variables: __numframes must also be
set, otherwise a re-indexing might be triggered later). I'll take a look at
it.

I think it's a nice idea, at least for now, to leave it up to the user to
manage the saving/loading of offsets. I think it may be hard to automate
the RightThing for all situations.

Manel

mdana...@googlecode.com

unread,

May 11, 2013, 4:05:05 AM5/11/13

to mdnalys...@googlegroups.com

Comment #17 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Manel,

if you think it is doable/sensible then you could add some experimental
get/set methods to allow advanced users to manipulate the index. Send me a
pull request if you have changes that should be merged.

I am still waiting for feedback from 32-bit systems. Once that looks good I
will merge into develop.

Btw, note that this issue is now the second-most anticipated feature in
MDAnalysis, judged by the "star" ranking, so you're clearly working on
something that people have been waiting for :-)

Oliver

mdana...@googlecode.com

unread,

May 11, 2013, 9:35:57 PM5/11/13

to mdnalys...@googlegroups.com

Comment #18 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Alex, Oliver,

I added functions 'save_offsets' and 'load_offsets' to the xdr TrjReader
class. Check it at https://code.google.com/r/manuelnunomelo-xtcindex/.

In this form the functions merely write/read offsets to/from files. Should
they also allow the user to directly pass an array? (Seems unnecessary and
not that useful; anyway, as Alex pointed out, the offset array can always
be accessed as _TrjReader__offsets).

I'd still like to know what you think about the possibility of parallel
indexing. I'll upload some code soon to illustrate what I mean.

Manel

mdana...@googlecode.com

unread,

May 12, 2013, 9:45:41 PM5/12/13

to mdnalys...@googlegroups.com

Comment #19 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Oliver (and other developers),

I fixed a memory allocation bug that could crop up when errors occur during
indexing. Be sure to pick up the latest changes from
https://code.google.com/r/manuelnunomelo-xtcindex/.

I also implemented (in another clone:
https://code.google.com/r/manuelnunomelo-parallel-xtcindex/) a parallel
indexer. While it does work fine, speedup is nonexistent, meaning that the
process is IO bound -- which makes sense. I'll keep the clone around should
anyone have any use for it: I can think of some hardware setups where
concurrent reads are possible, therefore benefiting from this kind of
parallelization.

Cheers,

mdana...@googlecode.com

unread,

May 27, 2013, 4:01:44 PM5/27/13

to mdnalys...@googlegroups.com

Comment #20 on issue 127 by sebastie...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Manel and other folks,

I finally had enough time to test your code with a 32-bit python and a
5.6GB xtc file.
Unfortunately, the test ended with an error. Here is the complete traceback
(as given by ipython):

In [1]: from MDAnalysis import Universe

In [2]: test = Universe("../../dppc_big.gro","../../dppc_biggest.xtc")
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-2-bd5f2881232b> in <module>()
----> 1 test = Universe("../../dppc_big.gro","../../dppc_biggest.xtc")

/home/test/mdanalysis/package/MDAnalysis/core/AtomGroup.pyc in
__init__(self, topologyfile, coordinatefile, **kwargs)
1831
1832 # Load coordinates
-> 1833 self.load_new(coordinatefile, **kwargs)
1834
1835 def _build_segments(self):

/home/test/mdanalysis/package/MDAnalysis/core/AtomGroup.pyc in
load_new(self, filename, **kwargs)
1893 # supply number of atoms for readers that cannot do it for
themselves
1894 kwargs['numatoms'] = self.atoms.numberOfAtoms()
-> 1895 self.trajectory = TRJReader(filename, **kwargs) #
unified trajectory API
1896 if self.trajectory.numatoms != self.atoms.numberOfAtoms():
1897 raise ValueError("The topology and %s trajectory files
don't have the same number of atoms!" % self.trajectory.format)

/home/test/mdanalysis/package/MDAnalysis/coordinates/xdrfile/core.pyc in
__init__(self, filename, convert_units, sub, **kwargs)
319 # actual number of atoms in the trr file
320 # first time file is opened, exception should be thrown if
bad file
--> 321 self.__trr_numatoms = self._read_trj_natoms(self.filename)
322
323 # logic for handling sub sections of trr:

/home/test/mdanalysis/package/MDAnalysis/coordinates/xdrfile/core.pyc in
_read_trj_natoms(self, filename)
423 """Generic number-of-atoms extractor with minimum
intelligence. Override if necessary."""
424 if self.format == 'XTC':
--> 425 numatoms = libxdrfile.read_xtc_natoms(filename)
426 elif self.format == 'TRR':
427 numatoms = libxdrfile.read_trr_natoms(filename)

/home/test/mdanalysis/package/MDAnalysis/coordinates/xdrfile/libxdrfile.pyc
in read_xtc_natoms(*args)
408 def read_xtc_natoms(*args):
409 """read_xtc_natoms(fn) -> int"""
--> 410 return _libxdrfile.read_xtc_natoms(*args)
411
412 def read_xtc_numframes(*args):

IOError: [12] Error reading natoms from xtc '../../dppc_biggest.xtc'

I will try to take a look at the code tomorrow.

Cheers,

Séb

mdana...@googlecode.com

unread,

May 27, 2013, 9:28:48 PM5/27/13

to mdnalys...@googlegroups.com

Comment #21 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Sébastien,

Thanks for the trouble checking the 32 bit version.
I was looking through your traceback and it seems you are getting a
file-not-found error.
Whether that is the case or not the raised error is definitely not helpful
since it can be misleading and can't distinguish between the different xdr
errors (it does print out the returned value of [12], corresponding to
file-not-found).
Maybe something along the lines of
"Error opening/reading xtc "filename" (was trying to get the number of
atoms)."
is more debugging-friendly. The alternatives are either to do some checking
for file existence on the python side, or to implement xdr error number
parsing, also on the python side.

Cheers,
Manel

mdana...@googlecode.com

unread,

May 29, 2013, 4:07:41 AM5/29/13

to mdnalys...@googlegroups.com

Comment #22 on issue 127 by sebastie...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi,

I took some time to dig a little into the code and I found the culprit: for
some reason the C function "fopen" that is called by libxdrfile.c does not
work for big files when the code is compiled by a 32-bit version of Python.
Due to this failure the read_xtc_atoms function raises an IOError as a side
effect even though the file is actually present.
Everything works just fine when compiled with a 64-bit implementation of
Python.
What puzzles me is that a dummy extension (see attached file) works fine
whatever the Python version... anyway, I must admit that I don't have much
experience with 32/64-bit issues.

That being said, I would vote for merging Manel's work into the dev branch
with an additional check on the file's size to avoid the "empty" IOError
and transform it into a more explicit error such as IOError("This file is
too big to opened by this implementation of Python").
This may seem to be a little rough but since there are less and less 32-bit
systems (at least among MD users), I think this would be relatively
harmless.
Any opinions?

Séb

Attachments:
testIO2.pyx 333 bytes

mdana...@googlecode.com

unread,

May 29, 2013, 4:58:35 AM5/29/13

to mdnalys...@googlegroups.com

Comment #23 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

I would tend to agree with the restricting of big files to 64-bit systems.
But can we reliably read the file size correctly on 32 bit if it is already
too big?

Manel

mdana...@googlecode.com

unread,

May 29, 2013, 7:34:03 AM5/29/13

to mdnalys...@googlegroups.com

Comment #24 on issue 127 by sebastie...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

> But can we reliably read the file size correctly on 32 bit if it is
> already too big?

Yep, this is not an issue as python use double:

In [2]: os.path.getsize("dppc_biggest.xtc")
Out[2]: 7803147364L

In [3]: sys.maxint, sys.maxsize
Out[3]: (2147483647, 2147483647)

In [4]: os.path.getsize("dppc_biggest.xtc") > sys.maxint
Out[4]: True

Cheers,

Séb

mdana...@googlecode.com

unread,

May 29, 2013, 12:26:39 PM5/29/13

to mdnalys...@googlegroups.com

Comment #25 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Manel and Séb,

I agree with implementing Séb's check and merging into devel. Perhaps we
find a better solution later but right now a large number of people will
benefit. (We can open a separate issue for this.)

Séb, can you do the merge including the update of CHANGELOG (don't forget
to add Manel and yourself). Manel should also already be in the AUTHORS
file and the sphinx conf file (as an author), at least on the
feature-xtcindex branch (to which I just pushed Manel's last two commits).
Once the merge is done, close the issue.

Manel, can you add a sentence to the documentation strings explaining what
happens when a XTC/TRR files is read so that users know that the inital lag
time will benefit them greatly later on? Also add the known limitation that
it only works reliably on 64-bit systems.

Should we add documentation on load_offset() and save_offset() or treat it
as undocumented/experimental for the time being? Do we have a UnitTest for
this functionality?

Thanks everyone,
Oliver

mdana...@googlecode.com

unread,

May 29, 2013, 7:38:33 PM5/29/13

to mdnalys...@googlegroups.com

Comment #26 on issue 127 by manuel.n...@gmail.com: Index file for xtc
trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Oliver,

I don't mind adding documentation/UnitTests for the saving/loading of
offsets (will only be able to do it after next week, though). Are there any
specific guidelines?
I can also add some sanity checking to catch loading errors from when the
wrong offset list is loaded for a particular trajectory.

Also on the documentation topic, the libxdrfile.i has an extensive
documentation header that should be updated, and that seems to be rather
strictly formatted (SWIG requirement?). I'll do my best to conform to the
existing format.

Finally, the xdr library c files all carry the original copyright header.
I'm really not sure how to add the indication of my modifications. On the
plus side, it is my understanding that we can now license the modified c
files under GPL, not the lesser GPL, thus matching the project's license.

Let me know what you think.
Manel

mdana...@googlecode.com

unread,

May 29, 2013, 8:20:42 PM5/29/13

to mdnalys...@googlegroups.com

Comment #27 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Manel,

Guidelines for UnitTests:

* UnitTests#Writing_test_cases
* You can add your test cases to the test_libxdrfile.py file if you're
testing low-level code or code very specific to the Reader implementation
or test_coordinates.py for higher level constructs.
* Do not add a 4 GB test trajectory to the test files... I think we'll have
to pass on testing this specific case (unless anyone feels differently?)

The text in libxdrfile.i is standard reST (restructured text) + sphinx
constructs http://sphinx-doc.org/, which is the standard for our Python
documentation. In particular, I found that you want to explicitly write out
the docs for the functions.

Add your name to all files that you edited, including the top of
libxdrfile.i and AUTHORS. Add comments to all *.c/*.h files, listing your
name and year and summarize briefly your additions.

I would keep the licence as Lesser GPL so that the code could be used
under the same terms as Gromacs itself. I don't think that it is a problem
to include LGPL code in a GPL2 project. As it is, the whole (modified)
libxdrfile can be used independently of MDAnalysis and some people might
want to make use of it.

Oliver

mdana...@googlecode.com

unread,

Jun 26, 2013, 12:34:03 PM6/26/13

to mdnalys...@googlegroups.com

Comment #28 on issue 127 by sebastie...@gmail.com: Index file for xtc

trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

Hi Manel,

I added the "32-bit" check that will raise the following OSError when
trying to open a file whose size is too big:
OSError: Filesize is too big to be handled by xdrfile library (filesize:
7803.1MB - max size: 2147.5MB)

Of course we can hardly add a 2GB-fat file for testing purpose just for the
sake of this issue so I didn't had any tests to the one you wrote.

I also took the liberty to merge the feature-xtcindex branch with the
develop branch so your code is now officially added to MDAnalysis!

I will remove the feature-xtcindex from the repo as soon as you tell me you
pushed everything. We will then be able to close this issue.

Thanks again for your nice work.

Cheers,

Séb

mdana...@googlecode.com

unread,

Jul 15, 2013, 6:34:05 PM7/15/13

to mdnalys...@googlegroups.com

Updates:
Status: Fixed

Comment #29 on issue 127 by orbeckst: Index file for xtc trajectory
http://code.google.com/p/mdanalysis/issues/detail?id=127

The fast XTC/TRR code was merged a while back and is in the development
branch (and works well — a good reason to run develop!)

Reply all

Reply to author

Forward