Groups keyboard shortcuts have been updated
See shortcuts

Cluster analysis to define biogeographical region

Skip to first unread message

Gabriela Procópio Camacho

Mar 13, 2023, 7:47:52 PM3/13/23
to Biodiverse Users
Dear Shawn, 

I am exploring a continental-scale data set (distribution data for 1300 species at 10km cell edges) and am looking at geographic regionalisation of the data using clustering methods. The main goal is to define zoogeographical regions using this dataset, and I have been running RWT, which seems to be the most adequate method. Initially I ran this without using a definition query and without selecting which calculations to run for each cluster node, which I believe takes into account the entire area and all the metrics and randomisations generated troughout the pipeline. Now, with results in hand, I'm trying to make sense of those results and wondering if it would make more sense biologically to use only PD or PE for this calculations. Can you point me to any references that can help me with this, or it is standard practice to run this for all the calculations?

Thank you for your input! 


Gabriela Procópio Camacho

Mar 15, 2023, 2:27:56 PM3/15/23

Gabriela P. Camacho, Ph.D.
Professora e Curadora de Hymenoptera - MZUSP
Pronomes: ela/dela 

T +55 11 2065-8100

Museu de Zoologia da Universidade de São Paulo 

Av. Nazaré, 481 - Ipiranga 

04263-000 São Paulo - SP


Shawn Laffan

Mar 16, 2023, 9:47:30 PM3/16/23
to, Gabriela Procópio Camacho
Hello Gabriela,

Sorry for the slow response. 

The answers all depend on the research question you are asking, i.e. what is it you want to know about your data. 

If you are looking for an overall understanding of the biotic groups in your data then I would suggest you use all of the data.  This enables analyses such as those in Gonzalez-Orozco et al. (2013, 2014a, 2014b) and Bien et al. (2020). 

The definition query is only needed if you wish to see how a subset of groups are related.  It is often used in tandem with CANAPE analyses to see how the zones of significant endemism relate to each other.  Alternately you might have a global data set but wish to only cluster groups (cells) within a particular subregion. 

In terms of the calculations per node, these can be run (and rerun) after the cluster analysis if you realise you need them.  The system will not rerun the clustering, which is usually the longest part of a cluster analysis.  The main point is to consider possible circularity in interpretations of patterns.  For example, the range weighted turnover effectively tries to maximise the endemism of each cluster node, so one will expect comparatively high endemism  scores compared with a non-range weighted analysis. 

If you want to assess the significance of the per-node index scores then you can run a randomisation (it is best not to restart an old one when new calculations have been added as replication becomes harder).  By default the randomisations do not rebuild the tree (in version 4) so this will not take as long as in earlier versions.

I would not recommend running all possible indices as some of them will take a long time, especially under randomisations.  The PhyloCom indices are a case in point.  We have sped them up substantially in version 4 but there are still slow points for non-ultrametric trees.

Links to references (see also ):
González-Orozco et al. 2013:
González-Orozco et al. 2014a:
González-Orozco et al. 2014b:
Bein et al. 2020:

You received this message because you are subscribed to the Google Groups "Biodiverse Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit

Reply all
Reply to author
0 new messages