residue-residue contact map

313 views
Skip to first unread message

Demian Riccardi

unread,
Jan 21, 2022, 6:32:18 PM1/21/22
to MDnalysis discussion
Hello MDAnalysis wizards,

I wrote a MDA script to accumulate contacts (within 4.5 Angstroms) between residues. I pasted in the business part below; it used the capped_distance method, which is nice and fast (for my system, it returns the list of pairs in around 0.01 s per frame on my ok machine).  I'm writing today to see if there are any tips to speed up the slowest part (noted in the comment below).  It turns out that converting the map of tuples into a unique countable list of tuples takes around a half-second for this system (going from around 50,000 atom-atom pairs down to around 2000 residue-residue contacts). here, I try to make it easier on the eyes:

resid_neighs  = set(
                                 map(
                                       lambda x:                      
                                             tuple(sorted(
                                                [
                                                   protein_group.atoms[x[0]].resid,
                                                   protein_group.atoms[x[1]].resid
                                                 ]
                                   )), neighs))     

Perhaps there is a faster way to do this with other MDA library functions?  or with python; I'm still learning python. (Aside: I went through the contact_matrix example for the salt-bridges, but it doesn't seem to provide a simpler/faster path to the residue-residue contact) 

Thank you for creating this wonderful library!

Demian

business part (add one to residue-residue contact counter if at least one pair of atoms is within 4.5 angstroms):
-----------------------------------------------------
contact_map = Counter()

for ts in u.trajectory:

    # all atom-atom contacts within 4.5 angstroms
    neighs = mda_distances.capped_distance(
        protein_group.positions,
        protein_group.positions,
        4.5,
        return_distances = False )

    # set on sorted tuples removes duplicates (slowest piece)
    resid_neighs  = set(map(lambda x: tuple(sorted([protein_group.atoms[x[0]].resid,protein_group.atoms[x[1]].resid])), neighs))

    for contact in resid_neighs:
        contact_map[contact] += 1

Demian Riccardi

unread,
Jan 22, 2022, 11:34:53 AM1/22/22
to MDnalysis discussion
I was able to get around 10x speed up by pulling the resid lookup into a list outside the trajectory iterator:

contact_map = Counter()
resids = list(map(lambda x: x.resid, protein_group.atoms))


for ts in u.trajectory:
    # all atom-atom contacts within 4.5 angstroms
    neighs = mda_distances.capped_distance(
        protein_group.positions,
        protein_group.positions,
        4.5,
        return_distances = False )

    resid_neighs  = set(map(lambda x: tuple(sorted([resids[x[0]],resids[x[1]]])), neighs))

       
    for contact in resid_neighs:
        contact_map[contact] += 1

Oliver Beckstein

unread,
Jan 23, 2022, 4:26:59 PM1/23/22
to mdnalysis-...@googlegroups.com
Not selecting inside a loop is always a good idea!

I think you can get your resids simply with

resids = protein_group.atoms.resids

Best,
Oliver 

Am 1/22/22 um 09:34 schrieb Demian Riccardi <demianr...@gmail.com>:

I was able to get around 10x speed up by pulling the resid lookup into a list outside the trajectory iterator:
--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-discussion/59072306-20c5-43e7-9c9e-1910417e0effn%40googlegroups.com.

Demian Riccardi

unread,
Jan 24, 2022, 10:35:05 AM1/24/22
to MDnalysis discussion
protein_group.atoms.resids is easier on the eyes!  

switching from numpy.ndarray to list via

resids = list(protein_group.atoms.resids) 

is a little faster in the loop as implemented in the OP.  There may still be a better way to pull out the unique res pairs, but it is fast enough for my purposes now.

Thanks!

Demian

Oliver Beckstein

unread,
Jan 24, 2022, 10:50:41 AM1/24/22
to mdnalysis-discussion

On Jan 24, 2022, at 8:35 AM, Demian Riccardi <demianr...@gmail.com> wrote:

protein_group.atoms.resids is easier on the eyes!  

switching from numpy.ndarray to list via

resids = list(protein_group.atoms.resids) 

You should also be able to copy the array if you prefer numpy arrays – the main point is that resids is a managed attribute so there’s a good chance that it is recomputed every time. I don’t remember if/under which circumstances these managed attributes get cached because there’s always a concern with stale caches. Maybe one of the other devs can shed more light on it.

Oliver

Lily Wang

unread,
Jan 24, 2022, 7:45:27 PM1/24/22
to mdnalysis-...@googlegroups.com
I usually take advantage of indexing numpy arrays when working with distances. On the test data files this code below is about 4x faster than the function using a loop:

u = mda.Universe(TPR, XTC)
protein_group = u.select_atoms("protein")
%timeit original_function()
1.07 s ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit indexing_function()
257 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

====================

def original_function():

    contact_map = Counter()
    resids = list(map(lambda x: x.resid, protein_group.atoms))

    for ts in u.trajectory:
        # all atom-atom contacts within 4.5 angstroms 
        neighs = mda_distances.capped_distance(
            protein_group.positions,
            protein_group.positions,
            4.5,
            return_distances = False )

        resid_neighs  = set(map(lambda x: tuple(sorted([resids[x[0]],resids[x[1]]])), neighs))

        for contact in resid_neighs:
            contact_map[contact] += 1
    return contact_map

def indexing_function():
    resids = protein_group.resids
    lowest_resid = resids.min()
    n_resids = resids.max() - lowest_resid + 1
    contact_map = np.zeros((n_resids, n_resids), dtype=int)
    
    for ts in u.trajectory:
        pairs = mda_distances.capped_distance(
                protein_group.positions,
                protein_group.positions,
                4.5,
                return_distances = False )
        contact_resids = np.sort(resids[pairs], axis=-1)
        x, y = contact_resids.T - lowest_resid
        contact_map[x, y] += 1

    return {
        (lowest_resid + i, lowest_resid + j) : x
        for i, row in enumerate(contact_map)
        for j, x in enumerate(row[i:], i)
    }


Reply all
Reply to author
Forward
0 new messages