Clustering on Cytoscape

Charles

unread,

Sep 19, 2018, 6:32:29 PM9/19/18

to cytoscape-helpdesk

Following the results I obtained with four clustering algorithms on the same network

Community Clustering results: Clusters: 12 Average size: 108.75 Maximum size: 175 Minimum size: 5 Modularity: 0.622
FCM results: Clusters: 2 Average size: 881.5 Maximum size: 996 Minimum size: 767 Modularity: 0.602
MCL results: Clusters: 136 Average size: 9.581 Maximum size: 61 Minimum size: 2 Modularity: 0.409
AP results: Clusters: 54 Average size: 24.167 Maximum size: 569 Minimum size: 1 Modularity: 0.37

Why is there such big differences? How will I decide which one to use? Are modularity scores comparable between algorithms or just between partitions within an algorithm? Kindly advice.

Matthias König

unread,

Sep 20, 2018, 8:19:53 AM9/20/18

to cytoscape-helpdesk

Hi Charles,

In my personal opinion the results of cluster algorithm depend strongly on the method you are using to create the clusters and the distance measurement you apply within the clustering. I.e. most of the times you will get varying results depending on the applied methods (for linkage).

The main problem with clustering is that you will always get clusters (and most the the people are just happy with what a given algorithm is giving them), but this means by no means that these clusters are robust, i.e. reproducible with different methods. Other things like using z-scores or normalizing rows/columns of your matrix also play a major role for clustering results.

Unfortunately, this is only very seldom discussed in publications, i.e. that results with cluster analysis strongly depend on the method used. Reported clusters are always presented as a robust reality, which is not very often the case.

I personally use multiple algorithms and distance measures and see which clusters are robustly found, i.e. which members occur always together in clusters. Make sure to report the method for clustering (i.e. the linkage used), normalization, and distance measure used when reporting cluster results (so others can get the same clusters based on your published data, and could compare your results with results obtained with other methods). Than you can at least be sure that your results are fairly robust.

These are routines for agglomerative clustering.

`linkage`(y[, method, metric, optimal_ordering])	Perform hierarchical/agglomerative clustering.
`single`(y)	Perform single/min/nearest linkage on the condensed distance matrix `y`.
`complete`(y)	Perform complete/max/farthest point linkage on a condensed distance matrix.
`average`(y)	Perform average/UPGMA linkage on a condensed distance matrix.
`weighted`(y)	Perform weighted/WPGMA linkage on the condensed distance matrix.
`centroid`(y)	Perform centroid/UPGMC linkage.
`median`(y)	Perform median/WPGMC linkage.
`ward`(y)	Perform Ward’s linkage on a condensed distance matrix.

These are possible metrics on the distance calculation

The distance metric to use. The distance function can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

Good information can be found here

https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage



	Best M

Scooter Morris

unread,

Sep 20, 2018, 11:16:11 AM9/20/18

to cytoscape-helpdesk

Hi Charles,

As Matthias points out, there are lots of things to consider when you are doing clustering, and it's not straightforward to compare across algorithms. In your case, you are partitioning your network based on the edges between your nodes. Here are some things to consider:

Are you using edge weights?
How have you normalized (or converted) the edge weights?
What parameters are you using in the various clustering algorithms?
Do the results make biological sense?

The only "real" way to determine which clustering algorithm works best for you is to compare the results with a biologically meaningful set of knowns (e.g. a "gold standard"). To answer your question directly, though, you can certainly compare the modularity score, but it may not have any biological meaning -- it's just giving you a sense of the number of edges within the cluster vs. the number of edges between clusters. The only real way to see if things make sense is to look at the nodes and edges in your clusters, or look at the clusters within the context of the entire network. Generally, I've had pretty good success with MCL, but I find I have to play with the granularity parameter a bit sometimes.