Minimum distance from solvent COM

PBurns

unread,

Nov 15, 2013, 7:33:37 PM11/15/13

to mdnalysis-...@googlegroups.com

I'm somewhat new to python and MDAnalysis. I wrote a function to find the minimum distance from a protein for each solvent molecule center of mass (of a particular solvent restype). Can anybody suggest to me a better or faster way to write this function; currently 20,000 frames on a single processor is taking ~3 days.

def get_com_min_dist(universe,ts, restype, target_seltext):

solvent = universe.selectAtoms("resname "+str(restype))

srid=[r.id for r in solvent.residues]

scomcoord=np.ones((len(srid),3))

for i,r in enumerate(srid):

scomcoord[i,:]=universe.selectAtoms("resname "+str(restype)+" and resid "+str(r)).centerOfMass()

scomcoord=numpy.array(scomcoord, dtype=float32)

target = universe.selectAtoms(target_seltext)

box=universe.trajectory.ts.dimensions[:3]

d=distance_array(scomcoord,target.coordinates(),box)

com_min_dist_list=[numpy.amin(x) for x in d]

return com_min_dist_list

Thanks!

Patrick

Oliver Beckstein

unread,

Nov 15, 2013, 8:31:14 PM11/15/13

to mdnalysis-...@googlegroups.com

Hi Patrick,

On 15 Nov, 2013, at 17:33, PBurns wrote:

> I'm somewhat new to python and MDAnalysis. I wrote a function to find the minimum distance from a protein for each solvent molecule center of mass (of a particular solvent restype). Can anybody suggest to me a better or faster way to write this function; currently 20,000 frames on a single processor is taking ~3 days.

Ugh – that's slow... admittedly, MDAnalysis is not necessarily the fastest kid on the block and that kind of analysis is pretty computation heavy (similar to radial distribution functions) but that's just painful.

>
> def get_com_min_dist(universe,ts, restype, target_seltext):
> solvent = universe.selectAtoms("resname "+str(restype))
> srid=[r.id for r in solvent.residues]
> scomcoord=np.ones((len(srid),3))
> for i,r in enumerate(srid):
> scomcoord[i,:]=universe.selectAtoms("resname "+str(restype)+" and resid "+str(r)).centerOfMass()
> scomcoord=numpy.array(scomcoord, dtype=float32)
> target = universe.selectAtoms(target_seltext)
> box=universe.trajectory.ts.dimensions[:3]
> d=distance_array(scomcoord,target.coordinates(),box)
> com_min_dist_list=[numpy.amin(x) for x in d]
> return com_min_dist_list

I assume you're using your function in something like the following way:

for ts in u.trajctory:
com_min_dist_list = get_com_min_dist(universe,ts, restype, target_seltext)
...

The first thing to do is to profile your code and figure out where it spends most of its time (see eg http://docs.python.org/2/library/profile.html ) If it spends a lot of time in anything but distance_array() then you can probably improve substantially on its speed. If you're already limited by distance_array() then see the end of this email.

Then use profiling or something like timeit or ipython's %timeit to check how modifications to your code (hopefully) improve its performance.

A few immediate things that come to my mind:

- move the selections out of the function because your atom groups do not change; provide atomgroups (such as target and solvent) as arguments and not the selection strings; your selections do not change between time steps, do they?

- avoid iterations at all costs; use numpy array operations, in particular use something like:

com_min_dist_list = d.amin(axis=1)

- pre-allocate all arrays for once and only fill them (pass the arrays as args, such as scomcoord and d)

All that would then translate into something like this:

def get_com_min_dist(target, solvent, com_min_dist_list, d, scomcoord, box):
for i in enumerate(solvent):
scomcoord[i,:] = solvent[i].centerOfMass() # can't think of a better way for doing this
distance_array(scomcoord,target.coordinates(),box[:3], result=d) # note result=d
com_min_dist_list[:] = d.min(axis=1) # axis=1 should be correct... just try it out :-)
return com_min_dist_list

# create selection atomgroups once:
solvent = u.selecAtoms("resname {0}".format(restype))
target = u.selecAtoms(target_seltext)

# pre-allocate all arrays
scomcoord=np.empty((solvent.numberOfResidues(), 3), dtype=np.float32)
d = np.empty(solvent.numberOfResidues(), target.numberOfAtoms(), dtype=np.float32)
com_min_dist_list = np.empty(solvent.numberOfResidues(), dtype=np.float32)

for ts in u.trajctory:
get_com_min_dist(target, solvent, com_min_dist_list, d, scomcoord, ts.dimensions)
# do something with your results in com_min_dist_list

I have not tested this code (so it almost certainly contains bugs) but it might be a starting point. The main question is if you're now spending most of your time in distance_array(), which should be the bottle neck. Depending on your array sizes you might see speed-ups with the parallel version http://pythonhosted.org/MDAnalysis/documentation_pages/core/parallel.html#MDAnalysis.core.parallel.distances :

from MDAnalysis.core.parallel.distances import distance_array

Hope that helps,
Oliver

>
>
>
> Thanks!
> Patrick
>

--
Oliver Beckstein * orbe...@gmx.net
skype: orbeckst * orbe...@gmail.com

Tyler Reddy

unread,

Nov 16, 2013, 4:00:36 AM11/16/13

to mdnalysis-...@googlegroups.com

It's also worth noting that you can use the python multiprocessing module with MDAnalysis (https://code.google.com/p/mdanalysis/wiki/Multicore_MDAnalysis) to divide the work over multiple cores. You might, for example, dispatch distance_array() type work on each child process for a subset of the residues / atoms in your system. If each core is only dealing with part of the system and you use small enough parts / core, you can substantially speed up the code.

The caveats... The biggest one is probably physical memory--if you're iterating over frames in a trajectory each core would probably have to open a separate MDAnalysis Universe object for an N x cores increase in memory usage. I think sending numpy arrays of coordinates from the parent process to child processes in each frame would have too much overhead--if it didn't that probably would save a lot of memory.

Also, exceptions typically don't propagate from child processes to the parent in python 2.X so you'd have to test / debug on a single core workflow first. It's a bit of effort to decide on how you divide the work over many cores and how to efficiently combine the data back into the parent process for storage / plotting, but can definitely pay off (especially with a 3 day wall clock performance).

--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-discus...@googlegroups.com.
To post to this group, send email to mdnalysis-...@googlegroups.com.
Visit this group at http://groups.google.com/group/mdnalysis-discussion.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward