Interpreting group discrimination when p > n: CVA vs bgPCA

44 views
Skip to first unread message

Dominika Bujnakova

unread,
Sep 22, 2025, 4:29:52 AMSep 22
to geomorph R package

Hi all,

I have a question regarding group discrimination tests in morphometrics.

I’m working with a dataset where p > n (many landmarks, roughly half the number are specimens). In a test for differences between two groups, only ~3% of shape variation is explained by the groups, which is statistically significant with a large effect size (Z ≈ 4.5–5.5).

  • When I plot PC1 vs PC2, there is strong overlap between groups, and together these two axes explain only about 30% of variation. Overlap remains apparent with PC3 and PC4 combinations.
  • I performed CVA to assess discriminatory power. In the CVA histogram, the groups appear well separated, with >80–90% of specimens correctly classified.
  • In contrast, using bgPCA, the histogram shows overlapping groups, with ~73% correctly classified, which seems more consistent with what is observed in the PCA.

Questions:

  1. Given that bgPCA appears to reflect the observed overlap in PCA, is it more trustworthy than the CVA results, or should the classification accuracy of CVA be prioritized? I understand that CVA will always show better discriminatory power than bgPCA, but which one is generally recommended?
  2. How does p > n influence the results of CVA, bgPCA, or even procD.lm?

Thanks in advance for any guidance!

Domi

Mike Collyer

unread,
Sep 22, 2025, 6:19:44 AMSep 22
to geomorph-...@googlegroups.com
Hi Domi,

Questions:

  1. Given that bgPCA appears to reflect the observed overlap in PCA, is it more trustworthy than the CVA results, or should the classification accuracy of CVA be prioritized? I understand that CVA will always show better discriminatory power than bgPCA, but which one is generally recommended?
There are a few things to unpack here.  First, the separation of groups in CVA of bgPCA plots is something of an illusion.  The vectors are linear combinations of variables that best separate groups and because there are so many more variables than observations, there is some combination of variables that appear to make the groups separate.  It is circular reasoning to ask the analysis to show the axes that best separate groups and then assess group differences by their separation on these “biased” axes.  PCA reveals the axes with the most shape variation.  There could be other factors than group differences that explain some of the shape variation.  Groups can be different in shape but it can be difficult to see without explicit rotation to show group differences.  This is what bgPCA attempts to do — rotate the data space to axes that best characterize group differences — but bgPCA can do this too well if p > n, without some accounting of the dimensional disparity with something like cross-validation.

One of the unfortunate things about our discipline is the merging of different analyses into one by name.  There is no classification performed with bgPCA.  This is an eigen decomposition of the fitted values of a linear model (which has groups as an effect) and projection of mean-centered datasets onto the eigen vectors.  It is similar to redundancy analysis, which performs eigen decomposition on both fitted values and residuals.  CVA is eigen decomposition of a matrix product: the inverse of the residual covariance matrix (from the same linear model used in bcPCA) times the fitted values covariance (from the same linear model used in bcPCA), followed by projection of mean-centered data onto the eigenvectors.  CVA DOES NOT PERFORM CLASSIFICATION.  However, most software that performs CVA also performs a form of classification, although there are other ways it could be performed.

Classification should find the posterior probabilities of group association based on the prior probabilities of group association.  The form of classification that is typically presented with CVA assumes equal priority probabilities (which might be silly if group sizes vary greatly).  By doing this, posterior probabilities are directly proportional to Mahalanobis distances (smaller distance to a group mean means higher posterior probability to be associated with that group).  A post-CVA classification analysis might not provide the actual probabilities, but rather, whether individuals were correctly classified.  Keep in mind that with 10 groups, for example, a posterior probability of 11% (larger than 1/10) means correct classification to the correct group, even if underwhelming support.  But Mahalanobis distance requires inverting a covariance matrix, which is impossible if p > n, as the matrix is singular.  Something would have to be done to contrive the distance, whether it means using a generalized inverse or fewer PCs than n - 1.  

To be clear, CVA and bgPCA find eigenvectors (canonical vectors and principal components, respectively) and project data onto these vectors.  These analyses stop at this point.  I have no idea what classification with bgPCA is but if it means using projected scores from bgPCA for classification, this is simply a wrong thing to do and should be avoided.


  1. How does p > n influence the results of CVA, bgPCA, or even procD.lm?
Technically, it does not influence CVA or pbPCA at all.  It influences classification. If p > n, it means finding an alternative for the covariance matrix so that it can be inverted.  Any choice to do this is arbitrary.  If you know of software or a function that does classification along with bgPCA, make sure to know what it is doing.  Something is amiss here.

Because there is no matrix inversion to worry about, procD.lm is unaffected by p > n.

We have a function, prep.lda in RRPP, which allows one to take control of the decision for how many PCs to use for classification and choice of prior probabilities before performing LDA (linear discriminant analysis, same as CVA) using the MASS::lda function in R.  By doing this, one can take ownership of their classification rather than rely on the contrivances some other functions might use.  The take-home point is to know exactly how classification is performed in any function that provides it.  Along the way, some assumptions are made, and they might not align with your analytical goals.

Hope that helps!
Mike




Dominika Bujnakova

unread,
Oct 7, 2025, 3:46:42 AMOct 7
to geomorph R package

Hello Mike,

Thank you very much for your detailed and insightful reply! It was extremely helpful.

I should clarify one point: I actually used groupPCA (from Morpho) rather than bgPCA (geomorph), which I had mistakenly named in my previous message. 

But, based on the function description, it seems to fall into the same conceptual category as bgPCA or CVA.

I have now corrected the analyses accordingly and, in the process, gained a better understanding of the distinctions between these methods and the implications of p > n. 😊

Thank you again for taking the time to explain this so thoroughly!

Best wishes,
Dominika

Mike Collyer

unread,
Oct 7, 2025, 7:03:23 AMOct 7
to geomorph R package
Hi Dominika,

I have just a couple points of clarification so people reading this thread will not get confused.  First, geomorph does not have a bgPCA function.  Second, thanks for clarifying that you used Morpho::groupPCA.  Although you use this function, it produces bgPCA objects, so functions like classify.bgPCA or predict.bgPCA work with the objects produced by groupPCA.  Finally, as, I described before, classification is performed by the Morpho::classify suite of functions and not within groupPCA, itself.  I checked what the classify.bgPCA function does and it either uses the un-cross-validated scores of the cross-validated scores from bgPCA as data in the classification, depending on what was used in groupPCA.  I’m not sure what first cross-validating scores in bgPCA followed by cross-validating Mahalanobis distances in classify would mean, but if cross-validation was not used in bgPCA, then using the bgPCA scores would be the same as using the original data, as long as one did not use, e.g., the first 2-3 components.  The classify.bgPCA function does allow prior probabilities to be defined, so it is very much like the MASS::lda function, in this regard.

One final comment.  The weighting option in groupPCA weights the covariance matrix by group sizes, which is the same as using prior probabilities based on group sizes for classification.  Although there is nothing nefarious about any of these functions, doing something like using the program defaults could mean weighting the covariance matrix and cross-validating bgPCA scores, before using classify with group-weighted priors and then cross-validating already cross-validated scores.  One has to be very careful that they understand exactly what each function is doing based on the arguments used, or spurious results are possible.

Cheers!
Mike

--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/geomorph-r-package/eae57ed0-c592-418a-93ca-64e13e9325a4n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages