PERMANOVA/CVA - how to do in geomorph?

Bob LePaul

unread,

Apr 14, 2021, 7:25:15 AM4/14/21

to geomorph R package

Dear colleagues,

I have been reading this paper (DOI: 10.1590/10.1590/1982-0224-2018-0085) and they state that they conducted a PERMANOVA/CVA, using the R packages geomorph, shapes and multigroup in order to derive the information of how many % of variance are explained by the first, second, third.. canonical variate. However, in none of these packages can I find a function that allows me to do that. Might anyone here know which functions I have to use for this purpose?

Thank you for your time!

Bob LePaul

Mike Collyer

unread,

Apr 14, 2021, 7:49:13 AM4/14/21

to geomorph R package

Dear Bob,

It looks to me they reference this package: https://cran.r-project.org/web/packages/multigroup/index.html. It also appears that in this package they could perform CVA. The PERMANOVA reference could be for a function in this same package or procD.lm in geomorph. Because they did not cite specific functions in the packages they used, it is not possible to fully reconcile how they did their analyses. Maybe there is more information in some supporting information — I did not investigate that strongly — but their description is vague, regardless.

If you want to perform CVA, the best function in R is the lda function of the MASS package (lda = linear discriminant analysis — same as CVA). There is a function in the RRPP package, on which the geomorph package depends and whose functions are all available to geomorph objects, called prep.lda. This useful function allows one to take a procD.lm object and prep it for use in lda, so one does not have to wrestle with figuring out how to make lda work if they have covariates other than “group” in their model.

Hope that helps!

Mike

--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/5a9c7780-c6a0-48bc-bcf2-9fc1bd45c188n%40googlegroups.com.

Bob LePaul

unread,

Apr 14, 2021, 8:24:19 AM4/14/21

to geomorph R package

Dear Michael,

thank you for your help. I have no trouble conducting a CVA itself. What i want is to know how much percent of variance in shape is explained by the first, second, third, fourth.. canonical variate. That is an information that most CVA analysis functions do not provide. Might you know how to extract this information?

Cheers

Bob

Mike Collyer

unread,

Apr 14, 2021, 9:17:40 AM4/14/21

to geomorph R package

Hi Bob,

First, there is no such thing as “percent variance explained” for canonical variates. What the paper appears to do — what many have done — is use the same notion that one can use eigenvalues in PCA divided by the sum of eigenvalues to find the percentage explained by PCs. This is possible because the sum of the eigenvalues from PCA is the multivariate variance. However, CVA performs a singular value decomposition (SVD) of a matrix product, the between-group covariance matrix times the inverse of the within-group covariance matrix; i.e., a multivariate generalization of an F-value from ANOVA. What some people do is find the squared singular values from SVD divided by their sum and call this “percent explained variation”. But this is wrong. It would be more appropriate to call it “percent of multivariate F”. Some might call it the percent of between-group variance relative to within-group variance”, which might get closer to accurate, but still is misleading.

Second, one reason this can be silly is if there are two groups, the first axis will ALWAYS explain 100% of the summed squared singular values. This could be deemed impressive, but it does not convey if the between-group variance is large or small compared to within-group variance. For example, the MANOVA might say the group differences are not significant, but that axis still explains 100% of what is going on.

Third, the perhaps only utility for doing this is to ascertain, for example, if one has 8 groups and thus 7 total CVs, how clear of a picture can we see in 2 dimensions for inferences we might make for 7 dimensions?

Fourth, I must reiterate, what the authors did in that paper should not be repeated. This is a mis-use of CVA, to suggest anything about the dispersion of shapes in shape space. PCA can do that; CVA cannot. CVA can give one a picture of an ability to classify individual observations to correct groups. This would require projecting confidence ellipses into the plot, not convex hulls. The authors appear to be mixing concepts from PCA and CVA together.

Finally, if one wishes to do this with the lda function, there is an $svd object that returns the singular values. Square these, find the sum of the squared values, and divide them by that sum, and you those percentages you seek. If you are using a different function, try to see if it returns singular values — these are what you need.

Hope that helps!

Mike

PS the paper also calls singular values “eigenvalues” in Table 6. Singular values and eigenvalues are not the same!

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/1badb777-8d77-464b-974d-ec69d759f82dn%40googlegroups.com.

Bob LePaul

unread,

Apr 14, 2021, 10:07:58 AM4/14/21

to geomorph R package

Dear Michael,

thank you for the help. In my case, I have 12 groups, and I want to see how much of the variation is explained by the first two CVs (relative to the CVs left out), just to make clear to readers that the contribution of later CVs relatively low and unimportant. So it seems to be one of the cases where doing so may be worthwhile. Now I guess I have to find the singular values, because right now I have only eigenvalues for every CV. Another colleague of mine has just pointed out that maybe I could use the eigenvalues I have to calculate the "percent of multivariate F for CV1" = eigenvalue (CV1) / sum of eigenvalues across all CVs, as stated here: https://www.researchgate.net/post/How-to-calculate-the-explained-variance-per-factor-in-a-principal-axis-factor-analysis

Would that also be a way to go about it, or would it be possible with a different way to calculate the singular values with the eigenvalues that I already have obtained from a different package? The dataset takes a long time to compute, as it is n=14 000, and 12 groups so I would like to avoid redoing that over and over if possible.

Thank you again for your help!

Cheers

Bob

Mike Collyer

unread,

Apr 14, 2021, 11:11:04 AM4/14/21

to geomorph R package

Hi Bob,

Maybe it would help if you explained what package/function or program you are using. It is possible that what they are calling eigenvalues and what I am calling singular values are the same, or the eigenvalues are the square of the singular values. My point was the the vectors obtain from SVD of the matrix product are generally referred to as canonical vectors, not eigenvectors and the values from SVD are singular values. One can perform SVD on a covariance matrix and the singular values are the same as eigenvalues. (Now that I think of it, the singular values of a product of covariance matrix should not be squared, like we normally do for matrix crossproducts, because the units are already squared. So summing the singular values rather than squared singular values would be more appropriate.)

One can also find the characteristic roots via “eigenanalysis” and call these eigenvalues. However, eigenanalysis as a process on a square matrix, even if the matrix is not symmetric, like the product between the between-group covariance matrix and the inverse of the within-group covariance matrix is, is not really eigenanalysis in my opinion (which is often defined for symmetric matrices) Iit is a stretch to call the characteristic roots, “eigenvalues”. If this is what your software is doing, they might call them eigenvalues out of convenience. If you software is performing singular value decomposition — like MASS::lda does — these are singular values. Again, your software might call these eigenvalues. Note that these approaches will not yield the same values, unlike they would if you performed both on a standard covariance matrix.

So, add up the eigenvalues and divide each by that sum to get the percent of whatever explained. I would encourage you, however, to know exactly how they are calculated. This will affect the distributions of “percent whatever explained”.

Cheers!

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/78436e56-96c2-4eb3-8005-43ecbeb655cen%40googlegroups.com.

Bob LePaul

unread,

Apr 14, 2021, 11:16:34 AM4/14/21

to geomorph R package

Dear Mike,

thank you for your insight. I am using the software CoordGen and CVAgen, which have been the de-facto standards for CVA analyses prior to geomorph. In CVAgen, the eigen values are eigenvalues from the (variance covariance between groups)/(variance covariance within groups) matrix, rather than the variance-covariance matrix. I will try adding up the eigenvalues to get a value for "percent explained" for each CV, I hope this approach is accurate. Thank you for your advice.

Cheers

Bob

Miriam Zelditch

unread,

Apr 14, 2021, 11:18:46 AM4/14/21

to geomorph R package

My guess is that what the paper reported is the "percent of the trace" value from the shapes.cva function in the shapes package. The paper is difficult to interpret because the authors refer to three packages, and none of them does what the paper calls a PERMANOVA/CVA. geomorph does what the authors (and Anderson) call PERMANOVA (a permutational anova of the distance matrix, a Procrustes Anova), and that is presumably done using the procD.lm function in geomorp based on the table. The only CVA function in those packages is the shapes.cva function, and that does return that "percent of the trace".

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/48AB1B41-BE48-427F-8A15-695715E92F6B%40gmail.com.

--

Miriam Zelditch
Associate Research Scientist
Museum of Paleontology
University of Michigan
Ann Arbor, MI 48109-1079

cell: 734-353-1671
phone: 734-764-0489
fax: 734-936-1380

Mike Collyer

unread,

Apr 14, 2021, 11:35:39 AM4/14/21

to geomorph R package

Note that shapes.cva wraps MASS:::lda and the trace and sum of singular values are the same thing. To me this is what CVA does (performs SVD). The print.lda function also returns "proportion of trace”.

And to emphasize, don’t square singular values. (This is something we have to do with matrix cross-products and for a moment I crossed wires.)

Maybe somebody can verify if this is what CVAGen does and calls the singular values, eigenvalues, or if it solves characteristic roots. As it is not an R package, I cannot as easily rip it apart and look at the coding.

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/CA%2BxQ-HQVEuMiyz5_mQM_WhrSerxEnYOrPuYXZOgf%2BbB20YexQw%40mail.gmail.com.

Miriam Zelditch

unread,

Apr 14, 2021, 11:48:12 AM4/14/21

to geomorph R package

I was going to reply after trying to figure out what CVAGen is reporting. The manual wasn't clear, so I was going through the Green Book to figure that out. It's on p. 159, where it is giving the formula for Wilk's lambda. as the product of the eigenvalues of W(W+B)^-1.

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/2100BF2B-EE0E-435B-84C3-45C8637C3E64%40gmail.com.

Miriam Zelditch

unread,

Apr 14, 2021, 11:50:24 AM4/14/21

to geomorph R package

I would recommend doing this in another program, for two main reasons. One is reproducibility. Many journals now want your data and code, and reviewers also do, especially when they do not understand what you did, given your text. If you refer to "variance explained" or "eigenvalues" in the context of a CVA, they will likely ask for your code so they can track down what meant. As evident from the paper you cited, just naming packages is not enough (there is no function that does a PERMANOVA/CVA in any of the packages they mention, but there are two that do PERMANOVA by another name (Procrustes Anova, and the output looks like geomorph). There is one that does a CVA (shapes.cva in the shapes package) and it does report a "percent of trace" (which is what you are presumably looking for). The second is that you need to know what the program is reporting, and in this case, it was not at all obvious. If it's not clear for a function in R you can just look at the function. Third, and I know that I said "two main reasons"--I would expect that this would run much faster in R...

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/7e3e0c4e-c042-4f97-90f5-bb7730983928n%40googlegroups.com.

Bob LePaul

unread,

Apr 14, 2021, 1:24:07 PM4/14/21

to geomorph R package

Thank you for your input everyone. I have now been actively trying to replicate this analysis using the shapes.cva function and I indeed get the trace values that may be very helpful. However, I have two issues:

1) This function puts out only LD1-LD3 data, although there should be 11 CVs to report for 12 groups - not just 3. I would like to have complete data for all 11 CVs. Does anyone know how to access this data?

2) The raw CV values generated by shapes.cva do not match the output of CVAgen at all - theoretically at least the raw CV values should match, unless the algorithm is completely different, right?

Cheers

Bob

Miriam Zelditch

unread,

Apr 14, 2021, 2:08:22 PM4/14/21

to geomorph R package

Try the CVA function in Morpho. I'm more familiar with that one, and it does give different results than the shapes.cva function does.

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/e6e760e9-3d8d-4784-83a6-d38812f8d594n%40googlegroups.com.

Mike Collyer

unread,

Apr 14, 2021, 2:17:27 PM4/14/21

to geomorph R package

Bob,

1) This function puts out only LD1-LD3 data, although there should be 11 CVs to report for 12 groups - not just 3. I would like to have complete data for all 11 CVs. Does anyone know how to access this data?

Yeah, you’re a little screwed if you use shapes.cva, at least this version. You can ask the author to update to avoid the unfortunate mistake it makes.

Here is the problem. The shapes.cva function wraps the MASS::lda function, which can be seen in the code here:

out <- lda(ans$scores[, 1:pracdim], groups)

print((out))

The code creates output but rather than return it, it prints it to the screen. Internally, the code shaves it down to 3 dimensions and once printed, that is all you have, and you can’t recover the other dimensions. So you have no choice. I recommend using prep.lda. This RRPP function chooses not to wrap lda but instead deliver arguments to lda, to give the user full control to use lda. You would have to do GPA beforehand, but that seems to make more sense, anyway. Here is an example with the pupfish data.

data(pupfish)

fit <- procD.lm(coords ~ Pop * Sex, data = pupfish)

prep.lda(fit, inherent.groups = TRUE) # see groups available

lda.args <- prep.lda(fit)

library(MASS)

CVA <- do.call(lda, lda.args)

CVA # 3 CVs produced

CVA$scaling # will return all CVs

## If one has covariates, like size, they can also do this

fit <- procD.lm(coords ~ CS + Pop * Sex, data = pupfish)

lda.args <- prep.lda(fit)

library(MASS)

CVA2 <- do.call(lda, lda.args)

CVA2

Note that this is the only CVA approach that allows one to use covariates. alternatively, one would have to first regress shape versus size, obtain residuals, then perform CVA. The prep.lda functions does this, inherently. ALL ARGUMENTS one would use in lda are available this way. all analyses one would do with lda are avaialable.

2) The raw CV values generated by shapes.cva do not match the output of CVAgen at all - theoretically at least the raw CV values should match, unless the algorithm is completely different, right?

Could be a number of reasons for this. Whether eigen-analysis was attempted rather than SVD. Scaling of vectors. How variances were calculated (divided by n or n - 1). Maybe you should see if they are correlated rather than exact to know if it is just a scaling issue.

Best,

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/e6e760e9-3d8d-4784-83a6-d38812f8d594n%40googlegroups.com.

Miriam Zelditch

unread,

Apr 14, 2021, 2:29:14 PM4/14/21

to geomorph R package

There is more wrong with the shapes.cva function than that. You definitely get different results for the scores than you would in any other program (like scores of -600 to 400 (CVA in Morpho gives scores from -20 to 20) and the plots of the scores aren't just differently scaled--there's almost nothing along CV2 visible in the cva.shapes plot but in the Morpho CVA, there are three clusters of points, two separated along CV1 and the third on CV2. I don't know when the errors were introduced into the shape package; I've used it before and I hope I'd have noticed this very odd result.

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/3F5034ED-6241-477B-B5AF-EB00820A8EBD%40gmail.com.

Bob LePaul

unread,

Apr 15, 2021, 4:38:05 AM4/15/21

to geomorph R package

Good morning everyone,

I have now conducted the analysis as specified by Michael. However, I still do not see a way to obtain "raw" CV values for every individual.

CVA$scaling produces only coefficients of linear discriminants, but not the linear discriminants themselves.

Neither can I write "CVA" into a csv file, it errors out with "cannot coerce class '"lda"' to a data.frame"

Might anyone know how I can actually derive and save the raw CV values from the lda function?

Thank you for your time!

Bob

Mike Collyer

unread,

Apr 15, 2021, 5:44:47 AM4/15/21

to geomorph-...@googlegroups.com

Bob,

You can obtain scores with (using the previous code):

predict(CVA)

Note that this produces posterior probabilities, classification, and scores. The scores are $x, which can be plotted.

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/ceac2afa-67af-44ed-839d-2d0f7197ddf7n%40googlegroups.com.

Bob LePaul

unread,

Apr 15, 2021, 6:03:35 AM4/15/21

to geomorph R package

Thank you, Michael.

However, now that I see the output, I see that the CV order has been scrambled somewhere - it neither matches the order of the tps file or the order of the grouping. Is it possible to put this data out in the original order - because I need to assign it to other variables, and comments, which are specific for every individual. Or would it be possible to append the IDs that were present from readland.tps to the CV output - that would allow me to "unscramble" the order as well.

Cheers

Bob

Bob LePaul

unread,

Apr 15, 2021, 6:10:08 AM4/15/21

to geomorph R package

Furthermore, now that I look at the data in more detail, I can definitely say that the output (at least the output range) from lda is vastly different to what CVAgen produces - lda values are up to ~1.5 whereas they had been in the range of 0.0000x in CVAgen. That really makes me wonder how comparable CVA scores from lda are to CVA scores from CVAgen.

Cheers

Bob

Mike Collyer

unread,

Apr 15, 2021, 6:27:27 AM4/15/21

to geomorph-...@googlegroups.com

For everyone reading, a further comment.

In case you are wondering with the lda function, why would one use the predict.lda function to get scores to use for plotting, it’s because it’s what a linear discriminant analysis does. The predict.lda function expects to find an lda object and new data, meaning one obtains the linear discriminant functions from one set of data and makes predictions for a set of data with associations to groups unknown. The goal is to ask if, given the variables measured, one can discern to which group a research subject belongs?

In showing Bob *how* to do this, notice there are no new data added to the predict function, so R just makes predictions for the same data used to obtain the discriminant functions. The process produces results, but many (I, included) would question whether one *should* do this. In other words, if you do something like this and submit a manuscript, and I happen to be a reviewer, I will have a lot of comments.

The lda function allows for (leave one out) cross-validation (aka Jackknife), which is a less circular line of reasoning than making predictions about research objects that were used to establish prediction criteria. The lda function allows for alternative data sets to be used for training the functions and testing the predictions. The lda function allows one to use alternative approaches for establishing prior probabilities of classification before calculating posterior classification probabilities; i.e., lda allows for Bayesian inference.

Discriminant analysis is a fluid process that should allow a user to:

1. establish prior probabilities for classification of test data

2. define training and testing data, separately

3. perform cross-validation

4. estimate posterior classification probabilities, not just classify subjects as this or that

If the “CVA” function or program does not allow these things, it means the creator of that function or program has asserted an a priori definition of the things above and you assume these assumptions with your analysis. I cannot advocate this blind faith in an analysis. It is important, in my opinion, to understand what these things mean and to be able to defend in your manuscript your decisions to use certain prior probabilities, etc.

I do not know what all other programs/functions do. It looks like the CVA and classify functions in Morpho do all of these things. The shapes.cva function does not seem to offer any flexibility. I am not sure what MorphoJ or CVAGen do. But I can attest that I have received manuscripts as a reviewer that indicate, “We did a CVA using program/function X” that appear to have blind faith in program/function X to do the right thing. If you wish to perform a CVA I encourage you to fully understand its purpose and not use it because it produces a plot that makes your groups look different.

Finally, that it produces a plot that makes your groups look different is a whole other issue that has been discussed frequently, and could be misleading. I won’t launch into that. But if you are doing CVA just to get the plot, you are probably not using CVA in the most appropriate way.

Cheers!

Mike

Miriam Zelditch

unread,

Apr 15, 2021, 7:09:20 AM4/15/21

to geomorph R package

I may be the only one reading this who has a copy of CVAGen, and knows how to produce a file in IMP format....I just ran the same data in LDA and CVAGen.

In LDA, I used this to project the data onto the CVs (Z refers to the data matrix of PC scores)

proj1<-Z%*%ldaZ$scaling

I'd have to look up what CVAGen does. The scales are different, but the plot looks the same in both programs, and both yield a similar correct classification rate (77.3%, 79.8%).

Mike gave more substantive arguments about how one ought to use CVA. All I can say about the two programs is that I see no reason to keep using CVAGen if what you are doing in it is projecting data onto the axes, plotting the CVs , estimating a misclassification rate and the posterior probabilities for the classification of the individuals.

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/97134543-095a-4373-9ff5-bf2042498cd5n%40googlegroups.com.

Bob LePaul

unread,

Apr 15, 2021, 7:25:57 AM4/15/21

to geomorph R package

Thanks Miriam,

in my case I am using the raw CV scores not just for plotting but rather for more complex statistical analyses (linear mixed-effect models - lmer from lme4 package, PERMANOVAs - adonis from vegan package) to see between which treatments there are statistical differences in shape - we already know for two of these treatments what these differences in shape represent on a PCA scale and what the real-world consequences of this shape difference is - and we just want to know where the other 10 treatments (which are all instances of less/more of the same treatments) are placed on the same relative morphospace - are they more close to the shape of this or that treatment? Are shape differences between treatment pairs significant? There can be no misclassification since all treatments were controlled and every individual experienced just one treatment. I wonder how the different output in LDA would affect these (many many) statistical analyses that were already done using the raw CV scores from CVAgen.

It would be greatly appreciated if you could work this issue out and tell us why CVAgen and LDA produce different (absolute) data, as this would improve our understanding (and the understanding of others that have only used CVAgen before) a lot. My personal understanding of the specific mathematical algorithms behind both of these softwares is not good enough to grasp this difference.

Cheers

Bob

Bob LePaul

unread,

Apr 15, 2021, 7:39:39 AM4/15/21

to geomorph R package

And to reiterate, my biggest gripe with LDA is still that the data order seems to be scrambled so I cannot just easily re-connect the data to other information (such as family identity, age...) that I have on the same individuals. I still would be glad if there was a way to keep individual IDs from the tps file or the data order throughout the LDA process.

Cheers

Bob

Miriam Zelditch

unread,

Apr 15, 2021, 7:55:46 AM4/15/21

to geomorph R package

I'd suggest looking at the RRPP package, which can fit linear mixed-models for shape (and the pairwise function would give you the comparisons among all the treatment groups). It will give you the (Procrustes) distances between the treatment means, and tests of statistical significance using a residual randomization permutational procedure (hence the name, RRPP). This should improve upon the tests done in adonis unless that package also now has that permutational procedure. In those analyses, you'd use the shape data, not the CV scores.The (mis)classification rate is an important part of a CVA because the objective of the method is classification.

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/1b4a4eef-0550-4f15-ad7a-93c208f0fe9en%40googlegroups.com.

Bob LePaul

unread,

Apr 15, 2021, 8:19:11 AM4/15/21

to geomorph R package

Thank you Miriam,

I have checked out the RRPP package and while it does seem to do quite well for standard linear models, I do not see a way to specify complex random effect structures such as random slopes, random intercepts, crossed random effects, nested random effects within this package. That is the main reason why I used lmer on individual CVs here. Adonis was just used to support these results by giving me the ability to detect shape differences across multiple CVs. Might you know of a way to specify complex random effect structures in the RRPP package?

Cheers

Bob

Mike Collyer

unread,

Apr 15, 2021, 8:20:55 AM4/15/21

to geomorph-...@googlegroups.com

Bob,

I’ll say this directly. Do not use CV scores as data in other analyses. They are not data. They should not be perceived as data. They are statistical scores. They are results from a statistical analysis.

If you have shape data, inferences should be made on analyses of shape data, not a non-linear transformation of them based on a statistical model.

Good luck!

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/1b4a4eef-0550-4f15-ad7a-93c208f0fe9en%40googlegroups.com.

Mike Collyer

unread,

Apr 15, 2021, 9:30:36 AM4/15/21

to geomorph R package

Bob,

Sorry, I missed this. I’m not sure what you mean because the lda function should not be scrambling the order of data. The predict.lda function does, as Miriam showed in her email, Data %*% Scaling to obtain scores. However, if you do not trust this, you can do the following:

predict(CVA, newdata = original.data)

where original data are the data used to obtain CVA. This will force the order to be the same.

Cheers!

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/7e74c77a-a98c-4d5f-991b-9742cf3e2233n%40googlegroups.com.

Miriam Zelditch

unread,

Apr 15, 2021, 9:38:30 AM4/15/21

to geomorph R package

The function that did the scaling that I showed returns the scores in the same order as the data.

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/28C73394-5924-4329-98B4-FD138DDDBEDC%40gmail.com.

Adams, Dean [EEOB]

unread,

Apr 15, 2021, 10:27:08 AM4/15/21

to geomorph-...@googlegroups.com

I will reiterate Mike’s point. Do not use CV scores as data in further downstream analyses. Inferences based on these are using data that are statistically manipulated via the CVA model (one could say distorted). They do not represent the actual patterns in the data, but are predictions from the CVA model.

Yes I (and Mike) recognize that using CV scores in subsequent analyses is unfortunately quite common in many biological fields. That does not make it an appropriate practice. It simply is not.

Dean

Dr. Dean C. Adams

Director of Graduate Education, EEB Program

Professor

Department of Ecology, Evolution, and Organismal Biology

Iowa State University

https://www.eeob.iastate.edu/faculty/adams/

phone: 515-294-3834

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/2BC516BB-300F-43A4-A87D-3BA4EB34CE06%40gmail.com.

dthu...@gmail.com

unread,

Apr 15, 2021, 1:04:10 PM4/15/21

to geomorph R package

I have a question about using the CV scores in downstream analyses. If I had 2 groups that are discriminated in a CVA, would the scores accurately show a relative difference in shape? For example, when plotted, group A has all scores that are negative and group B has all scores that are positive (ideally). For example, the scores are distributed from zero to -10 in group A and zero to +10 in group B. Could I use scores of individual specimens to infer the relative amount of shape differences?

Thus, a shape of -10 would be much more different from a shape of +10 than it would be from one that is +1. The numbers have no meaning other than as a measure of relative or degree of difference.

Dave Thulman

Mike Collyer

unread,

Apr 15, 2021, 2:16:02 PM4/15/21

to geomorph R package

Dave,

Simply, no. I guess one could argue what you mean by “accurate” because maybe they do or don’t show much shape difference where there is or is not, respectively, a shape difference between groups. But they CAN imply shape differences where there are none.

Here is an example:

n <- 40

p <- 200

shape <- matrix(rnorm(n*p), n, p)

group <- factor(rep(1:2, each = n/2))

library(RRPP)

PCA <- ordinate(shape)

plot(PCA, pch = 19, col = group) # no difference in shape

fit <- lm.rrpp(shape ~ group)

anova(fit) # no difference in shape

CVA <- lda(x=shape, grouping = group)

boxplot(predict(CVA)$x ~ group, ylab = "CV Score") # apparent shape difference

Here are the results I got in my random run:

Analysis of Variance, using Residual Randomization

Permutation procedure: Randomization of null model residuals

Number of permutations: 1000

Estimation method: Ordinary Least Squares

Sums of Squares and Cross-products: Type I

Effect sizes (Z) based on F distributions

Df SS MS Rsq F Z Pr(>F)

group 1 222.6 222.55 0.02837 1.1097 1.0162 0.151

Residuals 38 7620.8 200.55 0.97163

Total 39 7843.4

Call: lm.rrpp(f1 = shape ~ group)

The CV scores show a tendency to reveal group differences in shape for simulated noise. This might be exacerbated with more than two groups and many variables. CVA will find the vectors that maximize difference between two groups. Increasing variable number increases the possible ways to separate groups and the first — and only — CV does a good job of visualizing that.

These scores neither are nor should be considered data. One should not try to convince themselves of when it is acceptable to consider them as data.

What you can infer from your example — and probably better with posterior classification probabilities — is whether you have an ability to look at the shape of a research subjects and declare is from group A or group B. If you want to know the amount of shape difference, measure the Procrustes distance between shape means. That is the amount of shape difference between groups.

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/62576378-155c-43d0-b897-28ef604da553n%40googlegroups.com.

David Thulman

unread,

Apr 15, 2021, 2:25:59 PM4/15/21

to geomorph-...@googlegroups.com

I think I wasn't clear. I first performed a DFA on a group of points and found 2 groups that were highly significantly different and in the jackknife misclassification table, properly identified ~90% of the specimens to group (pretty good for archaeology).

Based on that result, I used the DF scores as capturing relative differences among shapes in each of the groups. Then, those got plotted geographically to look for how the relative differences plotted and used those results to infer information exchange (think gene flow).

My question should have been predicated on the initial DF discrimination.

You received this message because you are subscribed to a topic in the Google Groups "geomorph R package" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/geomorph-r-package/SRs0P_zSL18/unsubscribe.
To unsubscribe from this group and all its topics, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/8999436F-4551-4A78-B43A-AF1FBF78C5AB%40gmail.com.

David Thulman

unread,

Apr 15, 2021, 2:33:36 PM4/15/21

to geomorph-...@googlegroups.com

Following on, my goal is to test an anthropological version of Sewell Wright's genetic neighborhood hypothesis, i.e., people in closer contact should make artifacts that are more similar to one another than those farther apart. Here, similarity is modeled as artifact shape.

Adams, Dean [EEOB]

unread,

Apr 15, 2021, 3:06:04 PM4/15/21

to geomorph-...@googlegroups.com

David et al.,

Using DF scores in this fashion is a long climb for a short slide (the older among us may appreciate that quote!).

There is nothing to be gained by using the mean DF scores for each group to generate TPS deformation grids, versus using the mean shapes to generate those TPS grids. They will be virtually indistinguishable. An example of this is shown using the code below.

Of course, the problem with the DFA plotting approach is the fact that one must incorporate several additional mathematical gyrations and R-coding steps; meaning there are more places where one can go astray. For instance, DFA will not work on the GPA-aligned coordinates, as they contain redundant dimensions (which results in a singular covariance matrix). Thus one must first perform PCA and remove the redundant dimensions. Then one performs the DFA, then makes predictions, then finds the means, etc. More steps, which leads to the danger of messing things up.

I reiterate (again) that I would avoid CVA/DFA. I avoid it for generating shape deformation (TPS) plots, for statistical plots, and especially for all downstream statistical analyses. For the first they are redundant with standard mean-based shape TPS grids, for #2 and #3 they are misleading at best; flat-out wrong at worst.

Dean

##

library(geomorph)

library(MASS)

data("plethodon")

Y.gpa <- gpagen(plethodon$land)

mean <- mshape(Y.gpa$coords)

## Overall means

fit <- procD.lm(Y.gpa$coords~plethodon$species)

Y.hat <- fit$fitted

gp.mns <- arrayspecs(Y.hat[c(1,11),],p=12,k=2) #brute-force b/c I know this dataset

## Means from DFA

##NOTE: first must remove redundant dimensions, or LDA is working with singular matrices

shape.pc <- prcomp(two.d.array(Y.gpa$coords))$x[,1:20] #remove GPA redundancies

lda.fit <- lda(shape.pc,plethodon$species)

lda.scores <- predict(lda.fit)$x

pred.shape <- shape.predictor(Y.gpa$coords, x = lda.scores,

pred1 = tapply(lda.scores,plethodon$species, mean)[1],

pred2 = tapply(lda.scores,plethodon$species, mean)[2] )

# Now plot!

par(mfrow = c(2,2))

plotRefToTarget(mean,gp.mns[,,1], mag = 2,links = plethodon$links)

mtext("mean1")

plotRefToTarget(mean,gp.mns[,,2], mag = 2,links = plethodon$links)

mtext("mean2")

plotRefToTarget(mean,pred.shape$pred1, mag = 2,links = plethodon$links)

mtext("DFmean1")

plotRefToTarget(mean,pred.shape$pred2, mag = 2,links = plethodon$links)

mtext("DFmean2")

par(mfrow = c(1,1))

Dr. Dean C. Adams

Director of Graduate Education, EEB Program

Professor

Department of Ecology, Evolution, and Organismal Biology

Iowa State University

https://www.eeob.iastate.edu/faculty/adams/

phone: 515-294-3834

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/CAFJ%3DPKPH2dQ6iUt7TM3wxdHhzJZdd%3D1L7i_UR6mUwqogTwr9mQ%40mail.gmail.com.

Ellen R.

unread,

Dec 22, 2021, 8:02:33 AM12/22/21

to geomorph R package

Hello,

I have a question regarding the function prep.lda(). Is it ok to use this function when we have size as covariate in our model (along with groups) but allometries are heterogenous?

Ellen

Mike Collyer

unread,

Dec 22, 2021, 8:34:26 AM12/22/21

to geomorph R package

Dear Ellen,

Yes.

In fact, the example in the help file for the function does just that.

Best,

Mike

To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/d22e9456-7919-4620-9642-6ac817dc46f8n%40googlegroups.com.

Ellen R.

unread,

Dec 22, 2021, 9:40:40 AM12/22/21

to geomorph R package

Dear Dr. Collyer,

Thank you for answering. I saw that example (on Pupfish dataset) and thought I could try to check that myself, and after plotting, allometries really seemed homogeneous to me. I guess I made a mistake by rushing and relying only on visuals.