Group assignment without prior information

5 views
Skip to first unread message

Dominika Bujnakova

unread,
Dec 12, 2025, 6:15:29 AM (7 days ago) Dec 12
to geomorph R package

Dear Dean, Mike, and the extended geomorph community,

I hope you are all doing well.

I am currently working on a single species - the wolf - which is generally quite homogeneous in cranial morphology globally. Because of this low variation, PCA plots show almost complete overlap among populations, which is expected. However, some populations do show way more differentiation using LDA. LDA highlights these differences better, though it requires predefined groups.

In population genetics, tools developed for microsatellites or SNPs can infer population structure or most fitting number of groups given the data without prior location information, and coordinates can then be incorporated to refine the model. I am wondering whether there are analogous methods in morphometrics that could detect structure, or essentially group individuals, directly from shape data, without predefined categories.

I have explored standard options such as PCA-based clustering, bgPCA, LDA, and CVA. I also recognize that morphometric data inherently provides less detail than genetic data, that many structures are homologous despite genetic differences, and that there may be fundamental limitations to detecting structure this way. Nonetheless, I thought it would be worthwhile to ask. If you are aware of any methods or emerging approaches that approximate this kind of unsupervised structure detection in morphometric datasets, I would be very grateful for your insight.

Best regards,
Dominika

Mike Collyer

unread,
Dec 12, 2025, 9:55:26 AM (7 days ago) Dec 12
to geomorph R package
Hi Dominika,

You mentioned clustering, which among the methods you identified does not have to have a pre-defined number of groups or training data, but that is all that comes to mind.  K-means might be overlooked among clustering algorithms because it requires an a priori number of groups, but one could do something like K-means for 2, 3, 4, …, groups and using some goodness of fit criterion decide the optimal number of groups.  In my opinion, all clustering methods are inherently flawed if (like I think most do) use a dispersion criterion to include or exclude observations from putative groups.  One nice attribute about LDA is that posterior probabilities for classification are based on generalized distances of observations to groups, meaning that the shape variable covariances can have something to say about group association as much or more so than the proximity of points in space tangent to shape space (directions of least resistance can be defined in this space).  However, covariance matrices are required for generalized distances, so there are limitations with high-dimensional data.

I have long thought about (and occasionally played with) an idea of a K-means clustering algorithm that clusters points not based on distances but on generalized distances.  For something like this to work, an algorithm would have to calculate numerous pooled within-group covariance matrices to find an optimal matrix from which to calculate generalized distances to group means, and if this is varied for different numbers of groups, the number of random permutations grows exceedingly exponentially.  (For n observations for p variables and 2 groups, n-choose-2 p x p covariance matrices are possible; for three groups, n-choose-3, etc.)  There are probably ways to use machine-learning algorithms to make the task more manageable.  If only there was a way to have the time to explore all of these interesting unexplored data science regions!

Although I do not know the exact algorithms used I can appreciate that with microsatellites the problem is probably a more linear-time computational strategy because of the discrete states variables can take.  Asking a computer to run through a line of values, even if thousands of values are used, and calculate how many and the locations of matches or mismatches is a pretty easy thing to do.  Asking a computer to calculate 1000-choose-8 (2.4113 x 10^19) 200 x 200 possible covariance matrices is nearly impossible.  

I don’t think there is a quick way to reconcile this computational nightmare.  You could attempt to use dispersion-based clustering algorithms and then apply LDA to results, measuring the cases that have the best posterior probabilities of classification as an ad-hoc strategy.  This would be an assumption-laden approach, however.

Sorry to not have a better perspective of this.  Maybe somebody knows of a method that will allow you to disregard everything I just said.

Best,
Mike


--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/geomorph-r-package/8a702b74-2cea-466c-b395-105344a8d802n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages