improving neighbor searching

francesco oteri

unread,

Jun 4, 2012, 8:14:54 PM6/4/12

to mdnalysis-...@googlegroups.com

Dear MDAnalysis developers,

I have recently started to use MDAnalysis and in particular the hbond sscript.

Before, I used g_hbond (gromacs program), from a rough comparison, it seems to me that

g_hbond is nearly 7x faster than hbond.py. In order to improve the performances, are you

considering using grid searching like g_hbond or, riight now,you don't think about changing

the neighbor searching algorithm?

Francesco

Tyler Reddy

unread,

Jun 4, 2012, 8:36:57 PM6/4/12

to mdnalysis-...@googlegroups.com

It looks like the current implementation works faster if the second
selection is larger than the first when doing H-bond analysis.

I'm certainly not surprised that the GROMACS code would do it faster.
Is it a 7X difference even with a single .gro file? If you are
benchmarking large .xtc files then I would expect the relative speed
to be influenced by the large amount of time MDAnalysis takes to read
in the trajectory and go through the frames. There's certainly some
interest in implementing random file access and speeding up the .xtc
parsing capabilities, although I haven't gotten around to looking at
this in detail.

Maybe others can comment on the specifics of the current H-bond
algorithm. I haven't looked at that yet.

Tyler

> --
> You received this message because you are subscribed to the Google Groups
> "MDnalysis discussion" group.
> To post to this group, send email to mdnalysis-...@googlegroups.com.
> To unsubscribe from this group, send email to
> mdnalysis-discus...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mdnalysis-discussion?hl=en.

francesco oteri

unread,

Jun 4, 2012, 9:03:44 PM6/4/12

to mdnalysis-...@googlegroups.com

Hi Tyler

2012/6/4 Tyler Reddy <tyler...@bioch.ox.ac.uk>

It looks like the current implementation works faster if the second
selection is larger than the first when doing H-bond analysis.

I am using the same selection for both the two groups (i.e., protein)

I'm certainly not surprised that the GROMACS code would do it faster.
Is it a 7X difference even with a single .gro file? If you are
benchmarking large .xtc files then I would expect the relative speed

The file is just 34M. I'll try do profile the code in order to find the bottleneck!

to be influenced by the large amount of time MDAnalysis takes to read
in the trajectory and go through the frames. There's certainly some
interest in implementing random file access and speeding up the .xtc
parsing capabilities, although I haven't gotten around to looking at
this in detail.

I am working to a format permitting to read it randomly.

The basic idea is creating a sort of frame index that for each frame, stores its position

in the .xtc file. It can be loaded as optional permitting to MDAnalysis, and other tools,

to randomly access the files.

Francesco

--
Cordiali saluti, Dr.Oteri Francesco

Oliver Beckstein

unread,

Jun 4, 2012, 9:50:10 PM6/4/12

to mdnalysis-...@googlegroups.com

On 4 Jun, 2012, at 14:03, francesco oteri wrote:

> I am working to a format permitting to read it randomly.
> The basic idea is creating a sort of frame index that for each frame, stores its position
> in the .xtc file. It can be loaded as optional permitting to MDAnalysis, and other tools,
> to randomly access the files.

LOOS from Alan Grossfield's lab http://loos.sourceforge.net/ (which is a rather nice project, by the way) already uses a frame cache for XTC. Basically, when the XTC is read the first time then an index of frames and file offsets is built, which then allows random access, see http://loos.sourceforge.net/classloos_1_1_x_t_c.html for a start.

I would love to have something similar in MDAnalysis but I don't know if there are inherent problems with e.g. big files >2GB etc. I didn't have the time to make a serious attempt.

My idea would have been to store the frame list in memory and also as an additional file on disk (also using the XDR file format that is used for TRR and XTC), together with a checksum that the C library can use to decide if the frame list on disk still corresponds to the XTC on disk. In pseudo code:

# first read the trajectory once:
# 1. build the frame list if needed
# 2. tells us how many frames are in the trajectory

scanXTC("traj.xtc"):
cachefile = xdr_read(".traj.framecache") or None
cs = calculate checksum(XTC)
if not cachefile or cachefile.checksum != cs:
# build a new cache. slow
for frame in xtc:
framecache.append((frame, file-offset))
xdr_write(cachefile, framecache, cs)
return cachefile.numframes

# reading a frame is done by looking up the file offset for the
# frame in the frame list, seeking to the frame, and then reading
# the frame as usual

read_frame_XTC(xtcfile, framecache, frame):
xtcfile.fseek( framecache[frame] )
return xtcfile.read_frame()

I think that this could all be implemented in the C code of the xtc/trr library ("libxdr") even though I have been writing the pseudo code in a object-oriented manner.

If anyone starts working on this in earnest then please open an issue in the issue tracker to coordinate development. I would assign the issue to whoever seems most eager to work on it :-).

Oliver

--
Oliver Beckstein * orbe...@gmx.net
skype: orbeckst * orbe...@gmail.com

Naveen Michaud-Agrawal

unread,

Jun 5, 2012, 12:31:53 AM6/5/12

to mdnalysis-...@googlegroups.com

>
> I am working to a format permitting to read it randomly.
> The basic idea is creating a sort of frame index that for each frame, stores
> its position
> in the .xtc file. It can be loaded as optional permitting to MDAnalysis, and
> other tools,
> to randomly access the files.

This index should be pretty easy to compute. If I recall correctly,
each frame in the .xtc contains a header which specifies how big the
frame is. I'm surprised that gromacs doesn't have a utility/file
format to create this index.

Naveen Michaud-Agrawal

Reply all

Reply to author

Forward