Incorrect grouping suddenly, plot.gm.prcomp

246 views
Skip to first unread message

Anthony Lapsansky

unread,
Jun 3, 2021, 9:39:46 PM6/3/21
to geomorph R package
Hi everyone,

I ran into a strange issue this week that I have been unable to resolve. I decided to re-run some analyses for the final draft of my dissertation and noticed that the grouping for plot.gm.prcomp (done via bg = as.factor(gdf$groups) is no longer correct. Plot 1 (below) is from 2 months ago, Plot 2 is from today. 

The PCA plot is inverted, which does not matter to me. The spacing of species is slightly different (look at the three isolated points near the bottom of Plot 1 and top of Plot 2), which I assume is due to slight changes with the new features in gm.prcomp. 

But, most concerning to me is that the grouping factors no longer map correctly. I am worried that when I re-run my analyses, this problem will persist during statistical analyses. Or, maybe it is just a plotting issue?

I manually examined the grouping factors and all looks good. I also plotted a simmap and the groups are correct. All of the tips of the tree are present and seem to be in the correct order. 

Any thoughts? I have not changes the code at all from when I generated Plot 1 (below).

Thank you for your help!

Tony


Plot 1 - from 2 months ago - 

Rplot01.png
Plot 2 - from today:
Rplot02.pngPlot from 2 months ago - 

Bryan H. Juarez

unread,
Jun 3, 2021, 10:06:51 PM6/3/21
to geomorph-...@googlegroups.com
Hi Tony,

You can flip the axes using the flip = argument. Seems to me like you are comparing two different versions of plot.gm.prcomp. The old version uses bg =, whereas the new version uses node.bg within the phylo.par argument. Is your code updated to where you supply the phylo.par argument a list of plotting parameters such as node.bg?

Might be worth confirming that you matched both the data and grouping factor to the tips on the tree.

Best,
Bryan

--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/0b86397d-6d7e-4609-b588-271912f9bd2cn%40googlegroups.com.


--
Bryan H. Juarez, PhD 
He/Him
Adams Lab
EEOB Dept.
Iowa State University
Twitter: @bhjuarez


Mike Collyer

unread,
Jun 3, 2021, 10:12:27 PM6/3/21
to geomorph R package
Hi Tony,

Your description isn’t quite helpful, as you don’t indicate what versions of geomorph you were using.  I suspect you used a pre-geomorph 4.0 version and then geomorph 4.0 the second time, based on how your results changed.  But you are incorrect about one thing.  You did change the code (even though you didn’t realize it.)  I suspect you used 5 GPA iterations the first time and used 10 the second.  This is because the default arguments changed from max.iter = 6 to max.iter = 10 and I think you just used the defaults.  (Personally, I would use around 50 if it did not mean waiting too long.)  The second time around, you had different data after GPA.

There could be an unanticipated issue with the order of the PC scores.  Make sure they are in the same order as your data.  If not, we would like to hear about it.  But also if not, then gdf$groups would also not be in the same order as the PC scores.

This is a guess without further evidence.

Best,
Mike

<Rplot01.png>
Plot 2 - from today:
<Rplot02.png>Plot from 2 months ago - 

--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/0b86397d-6d7e-4609-b588-271912f9bd2cn%40googlegroups.com.
<Rplot02.png><Rplot01.png>

Bryan H. Juarez

unread,
Jun 3, 2021, 10:23:23 PM6/3/21
to geomorph-...@googlegroups.com
So as not to waste your time, it is not necessary to use phylo.par unless one wants finer control over plotting parameters, bg = would just get passed to plot. Apologies!

Anthony Lapsansky

unread,
Jun 4, 2021, 12:28:24 AM6/4/21
to geomorph R package

Alright, I figured it out! My apologies for not providing the necessary details. I tried rolling back my geomorph version and tried messing with the plotting inputs. I suspect you are right about the versions, Mike. I am using version 4.0.0 now and was likely using 3.3.2 before. But the real problem is that my .csv file containing all of my factors was in taxonomic order (because I was confirming that I categorized everything correctly), not alphabetical by genus_species! Something about mapping the species names to the data must work differently for simmap, otherwise, I am not sure how that worked.  I am going to see if I can't re-write my code so that alphabetical order of species is not required, just so that doesn't happen again.

That is a great detail about the GPA iterations! I will switch to using 50!

Thank you for the quick responses!

Regards,
Tony

Mike Collyer

unread,
Jun 4, 2021, 10:05:45 AM6/4/21
to geomorph-...@googlegroups.com
As just a general comment for phylogenetic comparative data, it is always a challenge for us programmers to align the data with the structure of the tree.  For example, if a covariance matrix is created for the expected phylogenetic covariances under Brownian motion, even though the rows and columns of that matrix are ordered the same, they are probably not ordered the same as the data.  Some functions reorder the data to match the covariance matrix or — as we do internally in geomorph — reorder the covariance matrix to match the data.  There are also functions that require a step of data processing in order to make sure data and trees are aligned.  The function in caper, comparative.data, is a good example.  The philosophy in caper is one has to get the data arranged appropriately to use other functions like pgls; our philosophy in procD.pgls is to use tip name matches internally in the function.  Either approach will throw an alarm is taxa are missing or spelled differently.

But in the general R environment, most functions expect that the user has taken the time to align data in their construction of data frames.  No errors will be thrown if the rows of Y (the data) and X (the linear model design) are not aligned.  You made the data frame (or chose not to bother); you assume responsibility.  For perhaps benevolent reasons, most of us programmers who make functions to deal with comparative data take steps to help users align data with tree structure.

In my opinion, however, users have to remain vigilant, and take it upon themselves to check the order of their data.  It could be that the simmap function Tony used made the effort to rearrange either the data or the group factor, based on tip names.  Such a convenience could translate to issues in another function that does not do that.  For example, gm.prcomp is general plotting function that can incorporate phylogenetic information but is not specifically and only a phylogenetic comparative method.  Before one uses something like a group factor to color code points or change their symbols — general R plotting attributes — it is imperative to make sure that the factor is ordered the same as the data, especially if using a combination of other functions that might reorder data, trees, or independent variables.

As yet another comment, there was a time for us older folks where using R meant knowing how to code in R, at least in some capacity.  There are so many packages today that require little coding and do a lot of things for users, somewhat using R as a medium for a more “hands free” experience.  This can lull a user into a false sense of security, that the machine is doing all the right things.  Some of us who stray outside of R for programming, maybe in a language like C++, become quite accustomed to everything having to be programmed just right, or the analysis blows up (and if it happens too often, computer screens get smashed).  One of the reasons that R is universally popular — besides its nice price tag (free) — is that its style of programming means programmers can catch errors, perhaps helping the user to understand why with useful error comments, or better or worse, depending on perspective, circumvent the errors and produce a result, anyway.  I encourage all R users to entertain the thought in any analysis or graphical summary, “Maybe I received results I should not have received.”  I encourage this with trepidation, because I do not want to encourage you to doubt geomorph’s programming at every step, but as Tony’s example highlights, sometimes unintended results are produced although the coding looks right.  (This could also be worse.  The results might not have been conspicuously misleading!).  

Tony’s example also elucidated another common issue, the use of function defaults.  Functions have arguments and choices have to be made arbitrarily for the default value (for example, whether to use bending energy or minimized distances for sliding semilandmarks).  There is no intention for function arguments to be static — if that were the case, we would hard-wire functions and not allow the flexibility.  But from my experience, most users just use what we set as defaults.  Understanding how to vary arguments and why one might do that makes a user more intimately knowledgable about their data and the methods available in GM.  In addition to the vigilance, I encourage responsible exploration.  (Responsible meaning flipping switches, pulling levers, and turning knobs without any appreciation for what these variable devices do could be a bit dangerous.)  Exploration that includes modifying arguments can be enlightening.

Apologies for the long soliloquy. I’m probably avoiding the real work I need to do today and this was a nice distraction.

Cheers!
Mike 

Anthony Lapsansky

unread,
Jun 4, 2021, 3:07:27 PM6/4/21
to geomorph R package
Hi Mike,

Thank you for the details!  I really appreciate (and I suspect others do, as well) when you and the rest of the geomorph team give these sorts of long answers. I have a bookmark folder full of them!

I do have one additional question, if you have the time!

First, version details:
geomorph          * 4.0.0      2021-04-02 [1] CRAN (R 4.0.5)

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          4                           
minor          0.5                         
year           2021                        
month          03                          
day            31                          
svn rev        80133                       
language       R                           
version.string R version 4.0.5 (2021-03-31)
nickname       Shake and Throw 

Question:
When I increased my GPA max.iter to 50, the plotAllSpecimens and plotOutliers functions indicate that things have gone off the rails. I suspect the result is related to the sliding of my semi-landmarks, as it seems that everything slides more with increasing iterations to the point that they are no longer representative of the original shape (an outline of a wing and the covert feathers). In other words,  I never reach convergence. No matter the iterations, the summary of the GPA indicates that it took the max.iter + 1 to converge. 

Call:
gpagen(A = shapesGeo, curves = shapesGeo$curves, PrinAxes = TRUE,  
    max.iter = 50, ProcD = TRUE, approxBE = TRUE, Parallel = FALSE) 



Generalized Procrustes Analysis
with Partial Procrustes Superimposition

4 fixed landmarks
86 semilandmarks (sliders)
2-dimensional landmarks
51 GPA iterations to converge
Minimized squared Procrustes Distance used


I have a large number of semi-landmarks relative to fixed landmarks (86 to 4, respectively), which may be related to the issue.

I tried some 'lever pulling' (responsibly, I hope), and the lack of convergence does not depend on whether I use the approximation of bending energy or not.

So my question is, should I stick with the shapes that look more real to me? But if that is the case, it would seem counter to the idea of letting the semi-landmarks slide at all.

Below are some images of the alignments based on the number of max.iterations from 10 - 50. As you can see, things are fairly consistent from 10 to 30 iterations, but get a bit wild at 40 and above!

Thank you!
Tony 

10 iterations.png

20 iterations.png


30 iterations.png
40 iterations.png
50 iterations.png

Anthony Lapsansky

unread,
Jun 4, 2021, 3:23:21 PM6/4/21
to geomorph R package
Apologies, that function call is a bit off, because I tried using Procrustes Distance too. All of the images are with GPAs performed by minimizing bending energy.

Call:
gpagen(A = shapesGeo, curves = shapesGeo$curves, PrinAxes = TRUE,  
    max.iter = 50, ProcD = FALSE, approxBE = TRUE, Parallel = FALSE) 



Generalized Procrustes Analysis
with Partial Procrustes Superimposition

4 fixed landmarks
86 semilandmarks (sliders)
2-dimensional landmarks
51 GPA iterations to converge
Minimized Bending Energy used




Mike Collyer

unread,
Jun 4, 2021, 4:15:15 PM6/4/21
to geomorph R package
Hi Tony,

With 86 sliders and only 4 fixed landmarks, the approximated TPS option is probably not going to be a good solution for you.  I just noticed the help file cut off (for some reason) when explaining that if the landmark configurations are nearly 100% sliders, the approximation probably won’t save much time.  We will have to fix that note.

The reason for the max.iter + 1 outcome is that the max.iter applies to the number of iterations to use with sliding semilandmarks, but one GPA rotation is performed after they are done.

The reason I suggested 50 iterations, is because unlike with fixed landmarks (or maybe minimized Procrustes distance) the convergence criterion (Q) might not decrease in every iteration.  For example, using the scallop data, look at these results:

> library(geomorph)
> data(scallops)
> m.iter <- seq(3,50,1)
> result <- sapply(1:length(m.iter), function(j){
+   gpagen(scallops$coorddata, curves = scallops$curvslide, surfaces = scallops$surfslide,
+          print.progress = FALSE, max.iter = m.iter[[j]])$Q
+ })
> plot(m.iter, result, type = "b", xlab = "Iterations", ylab = "Q")


You can see why, in this example, more iterations would be better.  You can also see that things get worse around iteration 20, and again at 30, before getting better.  
One might decide around iteration 15 that things are “good enough”.  One other argument that can be adjusted is the tolerance (to which Q is compared), which is the value used to decide if a convergence has happened.  Adjusting the tolerance to 0.01 in the previous example, we get this:


The flat line is because convergence happened well before the maximum iterations in each permutation and was stopped.  Maybe that was not sensitive enough.  Let’s use 0.0065:



You can see now that the arbitrary choice of iteration number or tolerance can yield different outcomes.  There are a lot of areas of statistics like this, where a solution is sought through iterative procedures and lack of convergence is a problem, which forces one to make some arbitrary decisions.  Having said that, I would not have thought things could get so much worse!  I just have never saw a case where it happened, but like I said, most people use the function defaults and maybe never probed the limits of this technique.  Yours is a good example that weird s**t can happen

Repeating the example I just did here might take a very long time with your serious data but you could consider doing it over shorter intervals.  You could also adjust the tolerance (default = 0.0001, which is really, really small).

Regardless, don’t let things go off the rails.  I suspect that with too many iterations, curves can become more sinuous than they really are, especially for data like yours which are practically all curves.  Furthermore, with the scallops, we have surface sliders, which use a plane comprising several neighboring points; curves use tangent vectors — two different techniques.  The ability of points to slide off curves has been noted before.

In all of GM, the spatial constraints on semilandmarks remains — in my opinion — the most underdeveloped area of research.  We have one theoretical paper introducing BE as the way to do it, involving arbitrary decisions (like how many neighbors to use) and we have minimized Procrustes distance (a poor model, assuming semilandmarks are rather independent).  Nobody has really explored other relevant spatial models/processes, or how something like the density of sliders or arbitrary choice of neighbor points sampled to make a tangent plane affects outcomes.

But this is more why I made the point for responsible exploration.  In the case of gpagen and semilandmarks, our defaults are not based on years of conscientious research.  This truly is a trial and error analysis, which involves tinkering.  For those of us who learned to drive cars that had carburetors, GPA with semilandamrks is like turning the carburetor screw until the engine sounds right!

One last note.  The Morpho package functions, sliders2d and sliders3d, have an argument to adjust a weight (dampening factor) to help prevent sliders from sliding off curves (another arbitrary choice though).  The sliders3d function also has as an argument to check if the convergence criterion increases and stop the analysis, with a default that is TRUE.  The plots above show that can be dangerous.

So, to reiterate…  never trust the defaults in any function in any package of R, ever!  (Stated better, know what the arguments do and know how it can affect your analysis.)

Cheers!
Mike



<10 iterations.png>

<20 iterations.png>


<30 iterations.png>
<40 iterations.png>
<50 iterations.png>

--
You received this message because you are subscribed to the Google Groups "geomorph R package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geomorph-r-pack...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/geomorph-r-package/431d62e8-fefe-42c9-a9af-4d8dba959e3cn%40googlegroups.com.
<40 iterations.png><10 iterations.png><50 iterations.png><30 iterations.png><20 iterations.png>

Adams, Dean [EEOB]

unread,
Jun 5, 2021, 10:13:25 AM6/5/21
to geomorph-...@googlegroups.com

To add one other point that ties this wonderful discussion to the ‘classic’ theory of GMM:

 

With only fixed landmarks, the mathematics and algorithm of GPA is guaranteed to converge. With sliders, it is not.

 

Mike’s explanation and examples show this in detail. And as Mike also mentioned, this should not give alarm, as it is rather common for numerous procedures in computational statistics. Rather, one must be aware of it and have care in its application.

 

Dean

 

Dr. Dean C. Adams

Director of Graduate Education, EEB Program

Professor

Department of Ecology, Evolution, and Organismal Biology

Iowa State University

https://faculty.sites.iastate.edu/dcadams/

phone: 515-294-3834

Reply all
Reply to author
Forward
0 new messages