Cluster analysis

azade...@gmail.com

unread,

Jun 11, 2021, 9:12:26 AM6/11/21

to MDnalysis discussion

Dear MDAnalysis users,

I would like to calculate the number of clusters along the trajectory and I was wondering whether there is a useful function for this purpose.

Actually, my system is consisted of a number of similar proteins and I would like to know how many clusters of these proteins are formed during the trajectory and also each cluster is consisted of how many proteins. It is possible that each cluster is formed by proteins attaching to each other like a chain.

Thank you very much in advance for you help.

Best regards,

Azadeh

Lucy Jiménez

unread,

Jun 15, 2021, 2:00:54 AM6/15/21

to mdnalysis-...@googlegroups.com

Hi Azadeh,

Maybe you can take a look at this:

https://userguide.mdanalysis.org/stable/examples/analysis/trajectory_similarity/clustering_ensemble_similarity.html

Best wishes,

Lu

--
You received this message because you are subscribed to the Google Groups "MDnalysis discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mdnalysis-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-discussion/d01db9a2-67d5-4805-a465-d03f7de5f4c4n%40googlegroups.com.

azade...@gmail.com

unread,

Jun 15, 2021, 12:15:14 PM6/15/21

to MDnalysis discussion

Hi Lu

Thank you very much for the link. That is definitely very useful. I suppose I should use the DBSCAN algorithm. I do not know, however, how to give the center of mass of each protein in the "select" flag of the method.

Furthermore, probably one has to also consider the periodic boundary conditions. I directly used Scikit-learn outside MDAnalysis, and I see that in the frames that one protein is out of the box and enters from the other side, the number of clusters is calculated wrongly.

Best,

Azadeh

Lily Wang

unread,

Jun 15, 2021, 1:02:59 PM6/15/21

to mdnalysis-...@googlegroups.com

Hi Azadeh,

We do not recommend using the DBSCAN class implemented in MDAnalysis due to a bug about not making outliers clear. What you probably want to do is calculate the centre of mass of each protein and the distances between them inside MDAnalysis, then pass the data to scikit-learn. A quick and dirty example is given in the below (untested) code. If you think you will re-use the code and need more structure to it, you could implement it as an AnalysisBase class.

Untested example:

import numpy as np

from MDAnalysis.lib.distances import self_distance_array

from sklearn.cluster import DBSCAN

my_proteins = u.select_atoms("protein")

proteins_split_by_segment = my_proteins.split("segment") # list of atomgroups

def analysis_per_frame():

centers = [protein.center_of_mass() for protein in proteins_split_by_segment]

center_array = np.array(centers)

distance_matrix = self_distance_array(center_array, box=my_proteins.dimensions)

dbscan = DBSCAN(metric="precomputed") # we compute distances in MDAnalysis

all_cluster_labels = dbscan.fit_predict(distance_matrix)

cluster_labels, cluster_counts = np.unique(all_cluster_labels, return_counts=True)

# outliers are labeled -1

if cluster_labels[0] == -1:

# ignore outliers

cluster_labels = cluster_labels[1:]

cluster_counts = cluster_counts[1:]

return cluster_labels, cluster_counts

number_of_clusters_per_frame = []

numbers_of_proteins_in_clusters_per_frame = []

for frame in u.trajectory:

labels, counts = analysis_per_frame()

number_of_clusters_per_frame.append(len(labels))

numbers_of_proteins_in_clusters_per_frame.append(counts)

# number_of_clusters_per_frame will be a list of integers

# so that number_of_clusters_per_frame[3] will give you the number of clusters

# found in the 4th frame

# numbers_of_proteins_in_clusters_per_frame will be a list of lists of integers

# so that numbers_of_proteins_in_clusters_per_frame[3][1] will give you

# the number of proteins in the second cluster of the fourth frame

Cheers,

Lily

To view this discussion on the web visit https://groups.google.com/d/msgid/mdnalysis-discussion/a87edcee-cf5d-483a-bc08-b72fac3ceb9fn%40googlegroups.com.

azade...@gmail.com

unread,

Jun 16, 2021, 11:15:05 AM6/16/21

to MDnalysis discussion

Dear Lily

Thank you very much for your explanations and the code. I appreciate it.

I had to modify it since the DBSCAN does not accept a 1D array. So I used "distance_array" instead of "self_distance_array".

In order to work properly for my case, I had to also add eps=20, min_samples=1 options inside the DBSCAN function.

I should also mention that I use the transformed trajectory by GROMACS, in which all the molecules are considered as a whole. Otherwise the result would be wrong because the PBC is not considered when calculating the COM of the proteins. Actually, I am not sure how this can be considered for a normal trajectory. Perhaps one should use unwrap=True flag in center_of_mass() , but I tried it once in the past and it did not work for me because I had problems with bond parameters. I am actually using MARTINI.

Cheers,

Azadeh

Pablo G Garay

unread,

Jun 22, 2021, 11:40:11 AM6/22/21

to MDnalysis discussion

Dear Azadeh,

I would like to ask to you, how you modify the following line using "distance_array"?:

distance_matrix = distance_array(center_array, box=my_proteins.dimensions)

Because "distance_array" needs a "Reference" and a "Configuration", you use something like this?

distance_matrix = distance_array(center_array, center_array, box=my_proteins.dimensions)

And, for the DBSCAN, you set "eps" based on a distance? or based on trial and error?

Thanks in advance.

Cheers,

Pablo

azade...@gmail.com

unread,

Jun 23, 2021, 8:05:18 AM6/23/21

to MDnalysis discussion

Dear Pablo

Because "distance_array" needs a "Reference" and a "Configuration", you use something like this?

distance_matrix = distance_array(center_array, center_array, box=my_proteins.dimensions)

Yes, this is how I modified it. So you actually give the same array for both arguments. I also gave the box information by "box = u.dimensions". But it should be similar I guess.

And, for the DBSCAN, you set "eps" based on a distance? or based on trial and error?

Yes, this was based on a distance, for which a dimer is formed in my system and therefore the two proteins form a cluster. So this actually defines at which COM cut off distance your objects belong to a cluster.

Cheers,

Azadeh

Pablo G Garay

unread,

Jun 23, 2021, 9:38:12 AM6/23/21

to MDnalysis discussion

Excellent, thanks for your answer!

Reply all

Reply to author

Forward