Clustering error

155 views
Skip to first unread message

Martin Boys

unread,
Nov 9, 2022, 8:51:12 PM11/9/22
to po...@googlegroups.com
Good day everyone

I am in desperate need of your assistance. I am currently trying to do K-means hierarchial clustering with the aim of constructing a phylogenetic tree and I run into the error outlined below. I have a dataset consisting of 3 populations and my plan is to do clustering for all 3 populations combined, however, I got this K-means error. I tried to troubleshoot and did clustering individually for each population and found that there was one specific population that caused the error when prompted to choose the number of PCs to retain. Kindly see below the details of the dataset as genclone object as well as the scripts I ran first for all 3 populations together followed by each population individually. It turns out that the population named Heirloom was the cause of the error when choosing the number of PCs to retain. I tried to vary the number as much as I could but, the error still remained. Any ideas on how I can solve this issue? The outputs are shown in bold below. Also find attached the dataset in csv format as it may help to diagnose the problem. 
I thank you all wholeheartedly for your assistance in advance

>martin25

This is a genclone object

-------------------------

Genotype information:

    4 original multilocus genotypes

   55 haploid individuals

   10 codominant loci

Population information:

    1 stratum - Pop

    3 populations defined - Hybrid, Heirloom, Grape

 

##Total Populations

>HYRRRRsm<-martin25

>HYRRRRsmclust<-find.clusters(HYRRRRsm)

Choose the number PCs to retain (>= 1):

>50

Error in kmeans(XU, centers = nbClust[i], iter.max = n.iter, nstart = n.start) :

  more cluster centers than distinct data points.

 

#######Individual Populations

##Hybrid

>HYRRRRsn<-popsub(martin25,"Hybrid")

>HYRRRRsnclust<-find.clusters(HYRRRRsn)

Choose the number PCs to retain (>= 1):

>50

Choose the number of clusters (>=2):

>3

##Grape

>HYRRRRso<-popsub(martin25,"Grape")

>HYRRRRopclust<-find.clusters(HYRRRRso)

Choose the number PCs to retain (>= 1):

>50

Choose the number of clusters (>=2):

3

##Heirloom

>HYRRRRsp<-popsub(martin25,"Heirloom")

>HYRRRRspclust<-find.clusters(HYRRRRsp)

Choose the number PCs to retain (>= 1):

>50

Error in kmeans(XU, centers = nbClust[i], iter.max = n.iter, nstart = n.start) :

  more cluster centers than distinct data points.



data1.csv

Zhian Kamvar

unread,
Nov 10, 2022, 1:13:19 PM11/10/22
to Martin Boys, po...@googlegroups.com
The key here is the word 'distinct'. The problem is that you have only 4 unique genotypes in your data set, and find.clusters() is attempting to search for 6 clusters (see below) from 3 principle components.
You will be able to run your analysis with find.clusters(martin25, max.n.clust = 3), but the results probably will not tell you much more beyond the fact that you have only three dimensions by which to separate your data.

Moreover, find.clusters has nothing to do with constructing a phylogenetic tree. I am assuming that you are following the tutorial at this point: https://grunwaldlab.github.io/Population_Genetics_in_R/Pop_Structure.html#k-means-hierarchical-clustering.
It is important to think about _why_ you are running the analyses you are running. The reason why we introduce that section in the tutorial is to show a way to test the hypothesis of panmixia, but with only 4 MLG in your data, your sample size is not big enough to test for that. You can still make the tree using aboot and color your tip labels by population. If you want to find out more about clustering, I recommend to read this tutorial and read its warnings about when to use clustering carefully: 
https://github.com/thibautjombart/adegenet/raw/master/tutorials/tutorial-dapc.pdf

Additionally, I have two useful tricks when you get errors:

1. do a google search for the error and the name of the package that the function is in. In this case, if you search for "adegenet more cluster centers than distinct data points", you will find this forum post: https://lists.r-forge.r-project.org/pipermail/adegenet-forum/2012-March/000492.html. In there, Thibaut suggests to lower the "max.n.clust" parameter.
2. look at the default parameters for the function you are runninghttps://www.rdocumentation.org/packages/adegenet/versions/2.0.1/topics/find.clusters. Here, the max.n.clust is set to round(nInd(x)/10), which will give you 6 for your data (round(55/10)). This is effectively assuming that you have somewhere between 1 and 10 clusters in your data, but unfortunately, the function was not written in a way to account for that. 



--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/CAKiZ3%3Dsiuj%2BCUDW4Jg9qhsenA4xqyjn67200dqh7kyg3KGJ-QQ%40mail.gmail.com.

Martin Boys

unread,
Nov 11, 2022, 12:51:20 AM11/11/22
to Zhian Kamvar, po...@googlegroups.com
Thank you for the detailed explanation. 
Reply all
Reply to author
Forward
0 new messages