Phylogenetic PCA and PACA in several modern human populations

69 views
Skip to first unread message

lv xiao

unread,
Mar 18, 2025, 2:06:38 PMMar 18
to geomorph R package
Dear all,

I am working on cranial shape data of several modern human populations (eg, east Africans, southern Europeans, south Asias).

In the paper "Phylogenetically aligned component analysis" (https://doi.org/10.1111/2041-210X.13515), it was stated that phylomorphospace analysis (PA) and Phy-PCA are applicable only when data are species' traits (rather than observations at the individual level).  Here are my questions:

1. Is it correct to perform Phy-PCA by using the mean shape of each of my studied population, together with a tree constructed at population-level (each tip represents a group, not an individual)? Populations are all of the same species, H. sapiens. I do not fully understand why Phy-PCA is only applicable to species's data, so I cannot tell if my group-level analysis is valid for Phy-PCA.
2. If the group-level analysis is valid for Phy-PCA, what prevents, if any, group size to be 1?
3. For PACA, is it possible to perform group-level analysis and individual-level analysis, respectively?
4. For trees without edge legnth information, is it correct to calculate brach length using compute.brlen(tree, method = "Grafen") in ape package and use the results for phylogenetic V-COV matrix construction in gm.prcomp?

Best regards,
Xiao

Adams, Dean [EEOB]

unread,
Mar 18, 2025, 2:40:24 PMMar 18
to geomorph R package
Xiao,

The short answer to your question is that no, it is not correct to use a population-level phylogeny to conduct a phylogenetic ordination (PACA, a phylomorphospace, phylogenetic PCA). This is for the same reason that one should  not use a population or individual level phylogeny in any phylogenetic comparative method.  Here is why. 

For phylogenetic comparative analyses, the goal is to account for the non-independence of observations (species) during the analysis. Here the species means are the data, and we estimate the nonindependence among them using an object covariance matrix, which is based on the phylogeny. That phylogeny is usually bifurcating, and each bifurcating event describes the separation of those species after their speciation event. In other words, after the bifurcation, the two species evolve independently. 

With populations, this is not the case, as there is frequently gene flow between sets of populations. What that means is that their nonindependence is more of a network, and not a phylogeny, and using a phylogeny to describe their expected covariation is incorrect.  In other words, a phylogeny  is the wrong model of expected nonindependence. Instead, we require an object covariance matrix that that describes the network of gene flow among populations, and this will not be bifurcating. For instance, using a migration matrix to account for nonindependence would be far more correct biologically. This issue was described at length in Felsenstein, 2002; Stone et al. 2011.

Now the same logic applies to phylogenetic ordination methods. One can represent populations by a bifurcating phylogeny, but that ignores any gene flow between them (which for populations with a species can occur). This is incorrect. So a better option is to represent the nonindependence using some other matrix, such as a migration matrix.  

To implement this, one can estimate the object covariance matrix using some data that allows the migration matrix to be estimated (genetic correlations between populations work to a first-order approximation), and then use the function 'ordinate' in RRPP (which geomorph calls) to perform the population-level 'phylogenetic' ordination. As data one uses the population means.

Then, if you want to have the individual level data in the plot, one can simply project them into the ordination space directly. Fortunately, the 'ordinate' function allows this, with the 'newdata' component of the function. 

Hope this helps,

Dean
--
Dr. Dean C. Adams
Distinguished Professor
Department of Ecology, Evolution, and Organismal Biology
Iowa State University


From: geomorph-...@googlegroups.com <geomorph-...@googlegroups.com> on behalf of lv xiao <lxi...@gmail.com>
Sent: Tuesday, March 18, 2025 1:06 PM
To: geomorph R package <geomorph-...@googlegroups.com>
Subject: [geomorph-r-package] Phylogenetic PCA and PACA in several modern human populations
 
--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/geomorph-r-package/6ec61c9a-8fb2-4155-9a63-6692591ada46n%40googlegroups.com.

Omar Mejía

unread,
Mar 18, 2025, 9:11:48 PMMar 18
to geomorph-...@googlegroups.com
Dean explains  very well, I think that  you can use FST, RST or another divergence  estimator,  we used in a population  analysis  of land snails  in the past

Regards

--

lv xiao

unread,
Mar 19, 2025, 1:40:02 PMMar 19
to geomorph R package
Thank you Prof. Dean for the insightful explanation. Back to the phylogenetic context, I have some basic questions regarding what inferences could be made from Phy-PCA and PACA.

1. For Phy-PCA, regressing all PC scores on ecological variables of interest provides improved statistical power compared to using PC scores from PCA. I am not sure if this is understanding is correct and is why Phy-PCA is needed at all.

2. For PACA, data are aligned with phylogeny, what variables could be associated with PAC scores and what kind of questions could be addressed? (Really new to phylogenetics, sorry if that is somwhat naive)

3. For both Phy-PCA and PACA, will the the inference differ from using scores from PCA, given Phy-PCA/PACA, like PCA, just rotates the data, as has been mentioned in the PACA paper. Just to confirm regssion is not rotation invariant.

Thank you.

Xiao

Adams, Dean [EEOB]

unread,
Mar 19, 2025, 2:00:46 PMMar 19
to geomorph-...@googlegroups.com
Xiao,

The point of phylogenetic comparative methods is that we are interested in the relationship between sets of variables (such as shape versus ecology). but our observations are species means, not individuals. Normally we would use linear models (ANOVA, regression etc.) to address the question but because our observations are species values, they are not independent. Thus, the data violate the 'independent' part of the iid assumptions for ordinary least squares.

The solution is to use generalized least squares methods of ANOVA and regression. These allow us to account for nonindependence among observations; in this case, because of phylogenetic relationships. 

Strictly speaking, phylogenetic PCA, PACA and other phylogenetic ordination methods are not required for the analyses above. Instead, the ordinations are used just to view the patterns in morphospace. Thus one should not do regression of ecology vs. phy-PCA of shape, just like it is unnecessary to do to regression of ecology vs PCA of shape.  Just use the full set of shape variables. Also, if the full set of variables from an ordination are used (phylogenetic or otherwise), statistical results should be the same for most methods. 

Dean
--
Dr. Dean C. Adams
Distinguished Professor
Department of Ecology, Evolution, and Organismal Biology
Iowa State University

Sent: Wednesday, March 19, 2025 12:40 PM

To: geomorph R package <geomorph-...@googlegroups.com>
Subject: Re: [geomorph-r-package] Phylogenetic PCA and PACA in several modern human populations
 

lv xiao

unread,
Mar 21, 2025, 2:08:55 AMMar 21
to geomorph R package
Thank you again for the clarification. I noted gm.prcomp provided reconstructed ancestral states through geomorph:::anc.BM. Is it appropriate to instead construct ancestral states using ace function in ape, using coordinates (as below) or PC scores:

library(geomorph)
library(ape)

# Load example data
data(plethspecies)
Y.gpa <- gpagen(plethspecies$land)  
Y.gpa.2d <- two.d.array(Y.gpa$coords)  
phy <- plethspecies$phy 
Y.gpa.2d <- Y.gpa.2d[phy$tip.label,]

# Compute ancestral states with geomorph:::anc.BM for reference
ANCgeomorph <- geomorph:::anc.BM(phy, Y.gpa.2d)

# Compute ancestral states with ace in ape package
ANCace  <- ace(Y.gpa.2d[, 1] phy, type = "continuous", method = "ML") $ace

Using ace feels much like a per-landmark coordinate or per-PC style analysis. Should one instead use geomorph:::anc.BM in the contex of GM?

Xiao

Adams, Dean [EEOB]

unread,
Mar 21, 2025, 8:17:26 AMMar 21
to geomorph R package
Both ought to be equivalent.   Note that in the multivariate world, parameter estimates are dimension by dimension for all manner of statistical model. It is the statistical estimation using covariance matrices (rather than trait by trait variance) that makes a difference.

Dean
--
Dr. Dean C. Adams
Distinguished Professor
Department of Ecology, Evolution, and Organismal Biology
Iowa State University

Sent: Friday, March 21, 2025 1:08 AM

Mike Collyer

unread,
Mar 21, 2025, 9:25:37 AMMar 21
to geomorph-...@googlegroups.com
To follow up on what Dean said, here is a demonstration, using ace and geomorph:::and.BM, to show why we made this function for use in other functions.  Note that geomorph uses the Ané and Ho (2014) algorithm.

> library(geomorph)
> library(phytools)
> n <- 500
> p <- 500
> tree <- pbtree(n = n)
> Y <- fastBM(tree, nsim = p)
>
> # One trait
>
> system.time(ANCaceML  <- ace(Y[, 1], tree, 
+                type = "continuous", method = "ML")$ace)
   user  system elapsed 
  7.906   0.459   8.393 
> system.time(ANCacePIC  <- ace(Y[, 1], tree, 
+                              type = "continuous", method = "pic")$ace)
   user  system elapsed 
  0.000   0.000   0.001 
> system.time(ANCgeomorph <- geomorph:::anc.BM(tree, 
+                                              as.matrix(Y[,1:2]))[,1])
   user  system elapsed 
  0.002   0.000   0.002 

> # similar results?
> pairs(cbind(aceML = ANCaceML, 
+       acePIC =ANCacePIC, 
+       geoPIC = ANCgeomorph), pch = 16, cex = 0.5)
PastedGraphic-1.tiff
Now compare using ace with it’s fastest, PIC, by variable, and geomorph::anc.BM

> system.time(apply(Y, 2, function(x){
+   ace(x, tree, 
+       type = "continuous", method = "pic")$ace
+ }))
   user  system elapsed 
  0.125   0.005   0.131 
> system.time(geomorph:::anc.BM(tree, Y))
   user  system elapsed 
  0.013   0.000   0.013 

10-fold difference

geomorph::anc.BM is intended as an internal function that does one thing but does it fast.  It is not a universal tool, like ace and some other functions are.  One can use our internal functions, but this one is not intended to replicate everything some other top-level functions do.  It is intended only to make sure you do not have to wait hours for analyses to complete.

Best,
Mike



lv xiao

unread,
Mar 23, 2025, 11:58:41 AMMar 23
to geomorph R package
Thank you Prof. Dean and Mike. I digged into the theory and code of Phy-PCA and PACA and wish to confirm some issues.

1. In the following source code of gm.prcomp, should the Cov object be renamed as C? The if...else... statement seems to evaluate how C should be differently scaled in different conditions. The Cov object is not used anywhere in the source code of gm.prcomp.
1.jpg

2. What does unit.tree mean?

3. gm.prcomp reconstructs shape along PCs using shape.predictor, which adds OLS mean shape M in source code of shape.predictor calculated from mshape(A) to arrayspecs(x, p, k)[, , 1]. However, when GLS centering is used in Phy-PCA and PACA, is it necessary to add GLS-weighted mean (1^T %*% C^-1 %*% 1)^-1 %*%1^T %*% C^-1 %*% Y, which is what Revell 2009 called "phylogenetic means"? Additionally, when transform = T, scores along principal axes from Phy-PCA represent transformed GLS residual shape. For visualization of shape, should we "untransform" by premultiplying T from Z=TT^T and add to the GLS-weighted mean?

4. In source code of ordinate, whic his used by gm.prcomp, line 217 below suggests Z is premultiplied by Pcov, which is inverse of T from Z=TT^T. This suggests the GLS residuals Z in eq (4) becomes Z_new = Pcov %*% Z? If this is true, I do not understand why in cases where Cov is non-null but transform = F, Saz is still calculated as cross product of A and  Pcov %*% Z, not cross product of A (equivalent to C when align.to.phy = T) and Z as indicated by eq (4)?
2.jpg
3.jpg

Xiao

Mike Collyer

unread,
Mar 23, 2025, 1:28:24 PMMar 23
to geomorph R package
Xiao,


1. In the following source code of gm.prcomp, should the Cov object be renamed as C? The if...else... statement seems to evaluate how C should be differently scaled in different conditions. The Cov object is not used anywhere in the source code of gm.prcomp.


Yes, that appears to be a mistake that would leave the matrix unscaled.  We will have to fix that.

2. What does unit.tree mean?

It means the height of the tree is 1, also meaning the diagonal of the covariance matrix of phylogenetic covariances has a diagonal of 1s.  There is no reason to do this unless it is desired to have eigenvalues in physignal.eigen that are scaled.


3. gm.prcomp reconstructs shape along PCs using shape.predictor, which adds OLS mean shape M in source code of shape.predictor calculated from mshape(A) to arrayspecs(x, p, k)[, , 1]. However, when GLS centering is used in Phy-PCA and PACA, is it necessary to add GLS-weighted mean (1^T %*% C^-1 %*% 1)^-1 %*%1^T %*% C^-1 %*% Y, which is what Revell 2009 called "phylogenetic means”?

Shape predictor can only use the shape data you provide it.  It does not anticipate how x was obtained.  If one would rather shift the GLS-mean to be the origin, they would have to adjust A, the shape data.  Or if one would rather untransform the scores in x — something we do in picknplot.shape — they can do that.  But no, we are not going to design shape.predictor to be that sophisticated.  Sometimes the sophistication has to be left to the user.

Additionally, when transform = T, scores along principal axes from Phy-PCA represent transformed GLS residual shape. For visualization of shape, should we "untransform" by premultiplying T from Z=TT^T and add to the GLS-weighted mean?

This is what we do in picknplot.shape, except we use an inverse (not transform) to back-transform the scores produced by one of our plots.


4. In source code of ordinate, whic his used by gm.prcomp, line 217 below suggests Z is premultiplied by Pcov, which is inverse of T from Z=TT^T. This suggests the GLS residuals Z in eq (4) becomes Z_new = Pcov %*% Z? If this is true, I do not understand why in cases where Cov is non-null but transform = F, Saz is still calculated as cross product of A and  Pcov %*% Z, not cross product of A (equivalent to C when align.to.phy = T) and Z as indicated by eq (4)?


I’m not sure what is missing.  First:

  tf <- if(!is.null(Cov) && transform.) TRUE else FALSE

indicates that residuals are to be transformed, if TRUE.  It is important to note that Cov is the phylogenetic covariance matrix and A might be also, but for PCA or phy-PCA it is an identity matrix.  Then this step:

  if(tf) Z <- as.matrix(Pcov %*% Z)

transforms residuals if any only if residuals are meant to be transformed.  Then this step, does one of two things:
  
  Saz <-  if(tf || is.null(Cov)) crossprod(A, Z) else crossprod(A, 
                                                        Pcov %*% Z)

It skips multiplying by Pcov if Z was already transformed or if Cov is null (in which case Z is untransformed). If neither tf is TRUE nor Cov is NULL (meaning cov exists), then residuals still need transformation before alignment, which is what the “else” portion does.  SVD is performed on Saz.  But you will notice later, that:

  sy <- if(tf || is.null(Cov)) sum(svd(Z)$d^2) else 
    sum(svd(Pcov %*% Z)$d^2)
  s$d <- s$d^2/sum(s$d^2) * sy / max(1, n - 1)
  s$sdev <- sqrt(s$d)

finishes the job of scaling eigenvalues to be consistent with any of the methods.

I think what could be throwing you off is A can be one of two different matrices, and whether rotating or transforming residuals, Pcov %*% Z is necessary, but not both.  Where the transformation is complete is in the calculation of the eigenvalues.

Not sure if that helps…
Mike

lv xiao

unread,
Mar 24, 2025, 6:50:19 AMMar 24
to geomorph R package
Thank you for all the explanations. For Q4, what I do not understand is exactly your statement "If neither tf is TRUE nor Cov is NULL (meaning cov exists), then residuals still need transformation before alignment", or equivalently "whether rotating or transforming residuals, Pcov %*% Z is necessary". When cov exists, I understand GLS residuals of Y is obtained. But when cov exists and when tf is explicitly set to FALSE, why is it still necessary to calculate Saz used for sbusequent SVD as Saz <- crossprod(A, Pcov %*% Z)? Since tf is FALSE, why isn't Saz <- crossprod(A, Z)?

I tried to understand the code from eq (3) and (4) from the PACA paper and its appendix, but still struggle to understand the above issue. Thank you for any clarifications.
1.jpg

Xiao

Mike Collyer

unread,
Mar 24, 2025, 8:04:38 AMMar 24
to geomorph-...@googlegroups.com, geomorph R package
Xiao,
The gm.prcomp function uses the ordinate function. In ordinate, A and Cov can be different things. In gm.prcomp they are the same matrix. If transforming residuals, the cross – product has already been calculated. If not transforming residuals, but still aligning data to the, covariance matrix, the cross – product needs to be calculated. The series of steps that are confusing you do this.

You are free to make your own calculations and compare them to the results that are produced. In terms of programming, the code does not necessarily translate well to the equations that are found in an article. Statistical computing is an art form that incorporates stratagems  to achieve results, efficiently. I would not look at the coding and expect it to make sense. But if you understand the algebra, you can compute the results with your own equations and determine whether the results are consistent. It is possible that there are mistakes in the coding and of course, we have made some and had to correct them. But please don’t ask me to defend my coding without evidence of mistakes. My only response to that is, I’m sorry that the code does not make sense, but if it produces the correct results, it does not have to make sense. Let me know if it’s not producing the correct results.

Mike


Sent from my iPhone

On Mar 24, 2025, at 6:50 AM, lv xiao <lxi...@gmail.com> wrote:

Thank you for all the explanations. For Q4, what I do not understand is exactly your statement "If neither tf is TRUE nor Cov is NULL (meaning cov exists), then residuals still need transformation before alignment", or equivalently "whether rotating or transforming residuals, Pcov %*% Z is necessary". When cov exists, I understand GLS residuals of Y is obtained. But when cov exists and when tf is explicitly set to FALSE, why is it still necessary to calculate Saz used for sbusequent SVD as Saz <- crossprod(A, Pcov %*% Z)? Since tf is FALSE, why isn't Saz <- crossprod(A, Z)?

I tried to understand the code from eq (3) and (4) from the PACA paper and its appendix, but still struggle to understand the above issue. Thank you for any clarifications.
Reply all
Reply to author
Forward
0 new messages