Hi Patrick,
On 15 Nov, 2013, at 17:33, PBurns wrote:
> I'm somewhat new to python and MDAnalysis. I wrote a function to find the minimum distance from a protein for each solvent molecule center of mass (of a particular solvent restype). Can anybody suggest to me a better or faster way to write this function; currently 20,000 frames on a single processor is taking ~3 days.
Ugh – that's slow... admittedly, MDAnalysis is not necessarily the fastest kid on the block and that kind of analysis is pretty computation heavy (similar to radial distribution functions) but that's just painful.
>
> def get_com_min_dist(universe,ts, restype, target_seltext):
> solvent = universe.selectAtoms("resname "+str(restype))
> srid=[
r.id for r in solvent.residues]
> scomcoord=np.ones((len(srid),3))
> for i,r in enumerate(srid):
> scomcoord[i,:]=universe.selectAtoms("resname "+str(restype)+" and resid "+str(r)).centerOfMass()
> scomcoord=numpy.array(scomcoord, dtype=float32)
> target = universe.selectAtoms(target_seltext)
> box=universe.trajectory.ts.dimensions[:3]
> d=distance_array(scomcoord,target.coordinates(),box)
> com_min_dist_list=[numpy.amin(x) for x in d]
> return com_min_dist_list
I assume you're using your function in something like the following way:
for ts in u.trajctory:
com_min_dist_list = get_com_min_dist(universe,ts, restype, target_seltext)
...
The first thing to do is to profile your code and figure out where it spends most of its time (see eg
http://docs.python.org/2/library/profile.html ) If it spends a lot of time in anything but distance_array() then you can probably improve substantially on its speed. If you're already limited by distance_array() then see the end of this email.
Then use profiling or something like timeit or ipython's %timeit to check how modifications to your code (hopefully) improve its performance.
A few immediate things that come to my mind:
- move the selections out of the function because your atom groups do not change; provide atomgroups (such as target and solvent) as arguments and not the selection strings; your selections do not change between time steps, do they?
- avoid iterations at all costs; use numpy array operations, in particular use something like:
com_min_dist_list = d.amin(axis=1)
- pre-allocate all arrays for once and only fill them (pass the arrays as args, such as scomcoord and d)
All that would then translate into something like this:
def get_com_min_dist(target, solvent, com_min_dist_list, d, scomcoord, box):
for i in enumerate(solvent):
scomcoord[i,:] = solvent[i].centerOfMass() # can't think of a better way for doing this
distance_array(scomcoord,target.coordinates(),box[:3], result=d) # note result=d
com_min_dist_list[:] = d.min(axis=1) # axis=1 should be correct... just try it out :-)
return com_min_dist_list
# create selection atomgroups once:
solvent = u.selecAtoms("resname {0}".format(restype))
target = u.selecAtoms(target_seltext)
# pre-allocate all arrays
scomcoord=np.empty((solvent.numberOfResidues(), 3), dtype=np.float32)
d = np.empty(solvent.numberOfResidues(), target.numberOfAtoms(), dtype=np.float32)
com_min_dist_list = np.empty(solvent.numberOfResidues(), dtype=np.float32)
for ts in u.trajctory:
get_com_min_dist(target, solvent, com_min_dist_list, d, scomcoord, ts.dimensions)
# do something with your results in com_min_dist_list
I have not tested this code (so it almost certainly contains bugs) but it might be a starting point. The main question is if you're now spending most of your time in distance_array(), which should be the bottle neck. Depending on your array sizes you might see speed-ups with the parallel version
http://pythonhosted.org/MDAnalysis/documentation_pages/core/parallel.html#MDAnalysis.core.parallel.distances :
from MDAnalysis.core.parallel.distances import distance_array
Hope that helps,
Oliver
>
>
>
> Thanks!
> Patrick
>
--
Oliver Beckstein *
orbe...@gmx.net
skype: orbeckst *
orbe...@gmail.com