Understanding procD.pgls outputs

Werner de Gier

unread,

Sep 28, 2022, 7:54:42 AM9/28/22

to geomorph R package

Hi all,

Although I'm fairly new to geomorph, I thought I was doing quite well in understanding my current R code - a externally suggested PGLS analysis on my data, however, changes this a bit. I'll sketch the situation below, hopefully some of you can help me place this in a better context.

I'm currently analysing 182 specimens, based on 22 landmarks per specimen. I already made a few morphospaces, using both a pre-defined in- and outgroup. As an explanatory variable, I'm using "host association" - a 9-level factorial variable, which I used to colour my morphospace. In addition, I use this variable to test the mean shape differences of the grouped specimens, using procd.lm() and pairwise() (this analysis was already discussed in this forum).

For 34 of the 182 species, I have the phylogenetic placement, which I used to make a (limited) phylomorphospace. Now, I want to do a PGLS to find out if the phylogeny is influencing the landmark data in any way, for which I wrote the following code:

y<-two.d.array(Shapes_phy) #Shapes_phy are the alligned procrustes coordinates of the 34 species
x<-as.factor(pruned_data_sorted$group) #This is the explanatory variable (host associaton).

Phylo_anova <- procD.pgls(y~x, tree, lambda = 0.482777) #tree = phylogenetic tree, and the lambda was calculated for the shape data to be 0.482777, based on phylosig().
summary(Phylo_anova)

The resulting P-value is calculated as P = 0.451 (higher than 0.05) -the full output of summary() can be found here:

Analysis of Variance, using Residual Randomization
Permutation procedure: Randomization of null model residuals
Number of permutations: 1000
Estimation method: Generalized Least-Squares (via OLS projection)
Sums of Squares and Cross-products: Type I
Effect sizes (Z) based on F distributions

Df SS MS Rsq F Z Pr(>F)
x 8 2.6601 0.33252 0.2406 0.9901 0.13363 0.451
Residuals 25 8.3961 0.33584 0.7594
Total 33 11.0562

Call: procD.lm(f1 = y ~ x, iter = iter, seed = seed, RRPP = TRUE, SS.type = SS.type,
effect.type = effect.type, int.first = int.first, Cov = Cov, data = data, print.progress = print.progress)

Am I correct in understanding that the landmark data and the placement within the morphospace of the 34 included species is fully independent of their phylogenetic position and relatedness, when grouping the species based on their host-association? I'm having difficulties understanding the results from these types of analyses.

Thanks again, this forum is amazing for a PhD student like myself starting to learn morphometrics!

-Werner de Gier

Adams, Dean [EEOB]

unread,

Sep 29, 2022, 5:18:11 AM9/29/22

to geomorph-...@googlegroups.com

Werner,

The interpretation you stated is backwards. The results are telling you that shape is not associated with host group, once phylogenetic non-independence is taken into consideration.

The models you have performed are both of the form: Y ~ Xb + e . Here you have shape (Y) and host group (X). You are therefore evaluating whether X (host groups) explains variation in Y (shape). This linear model is an ANOVA.

With PGLS, the model is still the same except for one change: now you have: Y ~ Xb + e | phy (‘given’ the phylogeny). The phylogeny is not incorporated as an explanatory (X) variable, but instead is modeling the expected error covariation among observations. In the analysis not including the phylogeny, observations are assumed to be independent. This is an ordinary least squares (OLS) linear model. In the PGLS, the observations are correlated, with an expected correlation as described by the phylogeny and some evolutionary model (in your case, Brownian motion at a given phylogenetic signal level). This is a generalized least squares (GLS) linear model. Hence the name: phylogenetic generalized least squares.

Best,

Dean

Dr. Dean C. Adams (he/him)

Distinguished Professor of Evolutionary Biology

Department of Ecology, Evolution, and Organismal Biology

Iowa State University

https://faculty.sites.iastate.edu/dcadams/

phone: 515-294-3834

--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/95ef0e24-6c0f-45a6-b3c3-242d87f50e6an%40googlegroups.com.

Mike Collyer

unread,

Sep 29, 2022, 8:14:03 AM9/29/22

to geomorph R package

Werner, and everyone who is interested,

Dean addressed the main point of your inquiry but I would like to address another point, in case others reading this find themselves in a similar analytical situation. You indicate that you set lambda = 0.482777 based on the optimized value obtained from phytools::phylosig. That function only finds a univariate phylogenetic signal (lambda) and if multivariate data (like shape data) are used, it will “conveniently” just use the first variable. Here is a demo:

> phy <- pbtree(n = 20)

> Y <- cbind(fastBM(phy), fastBM(phy))

> dim(Y)

[1] 20 2

> phylosig(phy, Y, method = "lambda”) # entering multivariate data… it works but shouldn't

Phylogenetic signal lambda : 0.987641

logL(lambda) : -26.7988

> phylosig(phy, Y[,1], method = "lambda”) # evidence that phylosig uses just the first variable

Phylogenetic signal lambda : 0.987641

logL(lambda) : -26.7988

> phylosig(phy, Y[,2], method = "lambda")

Phylogenetic signal lambda : 1.00191

logL(lambda) : -19.3934

As you can see, for the multivariate data, Y, lambda is estimated only for the first variable. THIS IS DANGEROUS! Having pointed this out, what is the best solution? Well, currently we (the field of macroevolutionary biology) do not have an explicit optimization strategy for finding lambda for multivariate data. If the data are not highly multivariate (more variables than observations) or close to being highly multivariate, one could attempt to re-create phylosig with multivariate estimation of log-likelihood but if data are highly multivariate, log-likelihood cannot be estimated. This approach assumes that lambda should be constant for every variable in the multivariate data, which is another topic ready for debate.

Indeed, in procD.pgls, it is assumed that one scalar of lambda is used for every variable, so how could one employ an optimization strategy? Although not perfect, if we assume that the highest model likelihood should be found for a model that minimizes the dispersion of residuals, one could employ an optimization strategy by varying lambda and finding which value of lambda yields the smallest residual error. Here is an example that starts to show how that works, using the same data from the previous example (and truncated for clarity):

> summary(procD.pgls(Y~1, phy = phy, lambda = 0))

df SS MS

Residuals 19 18.733 0.98593

> summary(procD.pgls(Y~1, phy = phy, lambda = 0.1))

df SS MS

Residuals 19 14.997 0.78934

> summary(procD.pgls(Y~1, phy = phy, lambda = 0.2))

df SS MS

Residuals 19 13.304 0.70022

> summary(procD.pgls(Y~1, phy = phy, lambda = 0.4))

df SS MS

Residuals 19 12.256 0.64506

Notice that (1) the model only contains an intercept, meaning I am just entertaining a scaling of the Brownian motion (BM) model of evolutionary divergence, and that (2) the residual SS (RSS) is decreasing as I increase lambda in increments of 0.1. I stopped at lambda = 0.4 but I could continue this procedure until lambda = 1. (Since I simulated data with lambda = 1, procD.pgls with lambda = 1 would probably be near the best outcome.) With real data, there might be a point where RSS is lowest and continued increase of lambda will cause it to rise. Finding the lambda where RSS is lowest is a good solution for determining what lambda should be for your multivariate data. (When you get to an interval, say 0.6-0.7, you might start using smaller increments to zero in on a good value of lambda. If lambda is a little off, say 0.65 instead of 0.646222546, it will have little consequence for you analysis.)

As a caveat for the rigorous analytical minds reading this: this approach assumes that the trace of the residual covariance matrix (RSS above) is comparable and a useful surrogate to the determinant of the residual covariance matrix, which is used to estimate the log-likelihood of models. For only slightly multivariate data, this is likely the case. For shape data, it is probably not a huge concern, even for more complex data. For highly multivariate data, there really is not much choice, since likelihoods cannot be estimated.

Regardless of the quibbles we might have for multivariate lambda optimization, please do not input shape data into phylosig and think that it found an optimized lambda value for you. It actually only found an optimized value for your first variable of the multivariate data set, only. Trial and error would be a much better approach for attempting to optimize lambda for your multivariate data!

Hope this helps.

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/CO6PR04MB8427F761F3DBCB26D2FC3509A2579%40CO6PR04MB8427.namprd04.prod.outlook.com.

Werner de Gier

unread,

Oct 3, 2022, 11:52:06 AM10/3/22

to geomorph R package

Dear Mike, Dean,

Thank you both for your extensive answers, this helps a lot!

Best wishes,
Werner

Op donderdag 29 september 2022 om 14:14:03 UTC+2 schreef mlco...@gmail.com:

Reply all

Reply to author

Forward