Phylogenetic comparative analysis with INLA

394 views
Skip to first unread message

Richard Bailey

unread,
Aug 18, 2021, 8:35:11 AM8/18/21
to R-inla discussion group
Hi,

I'm currently trying to replicate the following MCMCglmm model using INLA, but without success so far. The objective is to estimate the coefficients of effects of 'fixed effect' predictors on the response while controlling for non-independence among data points (species) due to their evolutionary relationships ('phylogeny'). According to a Brownian motion model, which I'm assuming, the covariance among data points (species) drops linearly and proportionally with increasing distance in evolutionary time.

A phylogeny in this case is an ultrametric bifurcating representation of evolutionary distances in time ('branch lengths') among species, with tips (the individual species) connected by internal nodes (which also connect to each other via other nodes), and a root node connecting everything. Branch lengths vary among each pair of connecting nodes and tips, but all tips are the same distance from the root.

#############################################
#Example MCMCglmm analysis setup on a simulated dataset#
#############################################

library(MCMCglmm);
data(bird.families);#This is a representation of the phylogenetic relationships among bird families (although I refer to 'species' elsewhere, in this example it's families)#

#Simulate some data
set.seed(123456)
phylo.effect<-MCMCglmm::rbv(bird.families, 1, nodes="TIPS")#"Random Generation of MVN Breeding Values and Phylogenetic Effects". "TIPS" are the families - we don't have trait values for internal nodes. The '1' here refers to the variance of the simulated values#

set.seed(23456)
phenotype<-phylo.effect+rnorm(dim(phylo.effect)[1], 0, 1)#Adding the non-phylogenetic component of trait variation#

test.data<-data.frame(phenotype=phenotype, taxon=row.names(phenotype)) 

#Create a sparse inverse matrix of the phylogeny#
 Ainv<-MCMCglmm::inverseA(bird.families)$Ainv#This includes all tips (species) and internal nodes (for the latter we have no trait data)#

#Run the analysis#
prior<-list(R=list(V=1, nu=0.002), G=list(G1=list(V=1, nu=0.002))); 

model<-MCMCglmm(phenotype~1, random=~taxon, ginverse=list(taxon=Ainv), data=test.data, prior=prior);

This estimates the intercept and the phylogenetic and residual variance components. The estimate and uncertainty of the intercept in this model are modified due to the non-zero phylogenetic random effect for bird families.

I've tried several INLA model setups to replicate this, but without success. These include copying the animalINLA model setup, with either model="generic0" or model="generic2", and with the inverse matrix for the bird families input using Cmatrix (including only tips, but created by subsetting the Ainv object above to just tips, rather than inverting the matrix based on tips alone; according to MCMCglmm documentation this would seem to be okay). Here's the model:

inla(phenotype~f(spN, model="generic2", hyper = list(theta1 = list(param = c(0.5, 0.5), fixed = FALSE),theta2 = list(param = c(0.5, 0.5), fixed = FALSE)), constr = TRUE, Cmatrix = AinvSM) + 1,data=testdata,control.family = list(hyper = list(theta = list(initial = 10, fixed = TRUE))));

I'm totally guessing here, but I wonder if INLA is assuming that the bird families are equidistant in evolutionary time? I used numeric 1:N families as rownames in Cmatrix, and made an equivalent numeric variable in the data table (spN) to cross reference with it. But if INLA is assuming equidistance, what is the purpose of the Cmatrix?

I compared MCMCglmm with INLA (using the animalINLA setup with "generic2" as above) for the intercept and phylogeny random effect using 100 datasets simulated as above. I found that the intercept means were at least proportional (with r^2=0.5) but with much narrower credible intervals in INLA, almost exclusively non-overlapping with the true population value. For the phylogenetic random effect, there is a relationship between the MCMCglmm and INLA estimates (r^2~0.2) but it's more noisy and not proportional (slope~0.5).

If I've done something completely wrong please let me know!

I see there have been related questions previously, but so far I don't have the impression there's a conclusive answer.

Many thanks for your time!

All the best, Richard.




Helpdesk

unread,
Aug 18, 2021, 9:20:43 AM8/18/21
to Richard Bailey, R-inla discussion group
Hi,

it seems like the models are not the same?

in inla, you fix the observational log precision to 10, but I do not see
you do that using MCMCglmm (which I think you estimate that one from
data).

also the priors does not seems to be the same to me, or ?

also with Gaussian observations like here, there are no approximations
beyond numerical integration, hence its only an 'error' if there is a
real programming error.
> --
> You received this message because you are subscribed to the Google
> Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/r-inla-discussion-group/709a68fa-ee3f-45e4-9ef1-5b86ea70525fn%40googlegroups.com
> .

--
Håvard Rue
he...@r-inla.org

Gregor Gorjanc

unread,
Aug 19, 2021, 3:05:09 AM8/19/21
to R-inla discussion group
Richard,

You must not subset your precision matrix (Ainv)! Doing this destroys the model structure, since precision matrix is indicating conditional model structure. You can subset covariance matrix (this indicates marginal model structure) and invert it if you want, but given that we can setup the phylogenetic precision matrix directly I see no point in doing this.

Gregor

Richard Bailey

unread,
Aug 19, 2021, 5:41:26 AM8/19/21
to R-inla discussion group
Many thanks Håvard and Gregor,

I'm at the bottom of the learning curve with how to set up models in INLA - basically just copying and pasting model setups from other packages (specifically animalINLA and phyr) at the moment, and hoping they do what I want. I'm trying to exactly match the MCMCglmm model setup as much as possible. The possibilities for then combining phylogenetic with spatio-temporal effects seem exciting.

>  in inla, you fix the observational log precision to 10, but I do not see
you do that using MCMCglmm (which I think you estimate that one from
data).

Thanks, I changed
control.family = list(hyper = list(theta = list(initial = 10, fixed = TRUE)))
to
control.family = list(hyper = list(theta = list(initial = 0, fixed = FALSE)))

Is that an improvement? I'm not sure exactly what this is doing. The simulated data are sampled with phylogenetic trait variance =1 and the residual variance =1, and I want to estimate both.

>  also the priors does not seems to be the same to me, or ?

Thanks for pointing this out. Is there a tutorial on how to set up priors for this model? I think the MCMCglmm priors for random effects and residuals are gamma, but I need to check. The INLA model I pasted above is copied from animalINLA and I don't yet know what every option is doing.

>  You must not subset your precision matrix (Ainv)! Doing this destroys the model structure, since precision matrix is indicating conditional model structure. You can subset covariance matrix (this indicates marginal model structure) and invert it if you want, but given that we can setup the phylogenetic precision matrix directly I see no point in doing this.

Thanks Gregor. I've now tried subsetting then inverting the covariance matrix, rather than using the whole inverse matrix, just because I'm not sure what to do with the rownames for the sparse matrix if I include the full matrix with internal nodes. At the moment, the tips (i.e. all the rows) in the matrix are labelled 1:N to match spN in the data table.

Richard.

Gregor Gorjanc

unread,
Aug 20, 2021, 5:09:34 AM8/20/21
to R-inla discussion group
Richard,

I advise against subsetting the covariance matrix and inverting it compared to building the inverse (precision) matrix directly. Why? Because inversion with floating point introduces some error. Best to go directly with the precision matrix you can build form the phylogeny directly. Furthermore, you could be interested in the inferred value of "internal" nodes in your downstream analysis (after model fitting). Also, subsetting covariance matrix and inverting it is in essence still estimating all the nodes, but not directly - the removed nodes become part of your model with more involved precision matrix between the sample nodes (tips) - precision matrix of the subset will be denser and indicate the relationships between your samples nodes (tips) due to removed internal nodes.

I will not comment on the priors - I suggest you look at each model documentation (in PDF), which describe model parameterisation and default priors, which you can control with functions such as control.family() and others!

gg 

Richard Bailey

unread,
Aug 20, 2021, 11:11:06 AM8/20/21
to Gregor Gorjanc, R-inla discussion group
Thanks Gregor, ultimately I'll run it as you advise, once I'm sure I'm correctly cross-referencing the full inverse matrix with the data table. As a side note, in the ?MCMCglmm documentation under the option 'nodes' it says: "...For phylogenies "TIPS" estimates effects for the tips only, and for pedigrees a vector of ids can be passed to nodes specifying the subset of individuals for which animal effects are estimated. Note that all analyses are equivalent if omitted nodes have missing data but by absorbing these nodes the chain max mix better. However, the algorithm may be less numerically stable and may iterate slower, especially for large phylogenies." I guessed from this that subsetting the full inverse matrix would give the same result given that internal nodes have no data, but I may have misinterpreted.

However, I still think there could be an issue with the setup of the INLA analysis. After some changes to the model I presented before, I'm now getting good estimates of the phylogenetic random effect, BUT including the phylogenetic random effect in the model decreases uncertainty in the 'fixed effect' model relative to excluding it, whereas it should increase uncertainty due to statistical non-independence (as happens when running the MCMCglmm model I presented). I think the INLA model may be interpreting the phylogenetic random effect as an additional 'phylogenetic noise' component on top of the linear fixed-effect model, rather than treating it as statistical non-independence. So fitting a model like:

y = A + u

where A is the intercept and u is the covariance matrix based on phylogenetic relationships. Like a GAM where some non random noise is to be controlled for to better estimate the linear model. I think what I want is to just estimate y=A without the 'u', but accounting for the fact that each row in the data represents < 1 independent data point according to some known pattern of data point covariances (i.e. the phylogenetic covariance matrix).

I'll probably post another question focused on this.

Best, Richard.


--
You received this message because you are subscribed to a topic in the Google Groups "R-inla discussion group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/r-inla-discussion-group/wRNZCqqTRVQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to r-inla-discussion...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/f7263ab3-66cb-4155-b17c-463cf7ae15b9n%40googlegroups.com.

Gregor Gorjanc

unread,
Aug 20, 2021, 11:36:49 AM8/20/21
to Richard Bailey, R-inla discussion group
Hi Richard,

You are running

inla(phenotype ~1 + f(spN, model="generic2", ..., Cmatrix = AinvSM), ...)

which is Gaussian likelihood (the default)

p(phenotype|Intercept, u, sigma2e) ~ N(1Intercept + Zu, Isigma2e), so phenotype = 1 Intercept + Zu + e
p(a, A, sigma2a) ~ N(0, A sigma2a)
p(sigma2e) ~ ...
p(sigma2a) ~ ...

to match IDs between your phenotype data (spN) and precision matrix (AinvSM) its best to ensure that you code your full list of spN (internal nodes and tips) from 1:N so that precision matrix rows and columns refer to 1:N effects and then in the phenotype data you will only have those 1:N that are actually observed. I hope this makes sense. Say you have 5x5 precision matrix, so 1:N is 1:5. Then say that you have observations on 3, 4 and 5 (the tips), then your spN column in the phenotype data will only have 3, 4 and 5. You will then get estimates for all 5 effects, spN1, spN2, spN3, spN4 and spN5 - since they are all connected via the precision matrix.

Are you fitting the same model in MCMCglmm? I am not fully clear, but from your description it seems you are thinking about phenotype = 1 Interecept + u (so not Zu, and also no residual). Does this mean you have only a single record per each level of spN? If so, then indeed you won't have sufficient data structure to separate u and e (the "Zu + e" is just "u + e"). You can use control.family to fix sigma2e to a "numerical" zero then, but be careful that the theta in control.family is (see documentation).

gg

Richard Bailey

unread,
Aug 24, 2021, 6:03:27 AM8/24/21
to Gregor Gorjanc, R-inla discussion group
Thanks for all your help Gregor. It looks like I've found a model that works slightly better than the MCMCglmm version for both the intercept and phylogenetic random effect, at least for the simulated datasets I made based on intercept=0, phylogenetic var=1, non-phylogenetic var=1:

inla.m1=inla(phenotype~ 1 + f(spN, model="generic2", hyper = list(theta1 = list(param = c(0.5, 0.5), fixed = FALSE),theta2 = list(param = c(0.5, 0.5), fixed = FALSE)), constr = FALSE, Cmatrix = AinvN),data=testdata,control.family = list(hyper = list(theta = list(initial = 0, fixed = FALSE))),control.compute = list(waic = TRUE, config = TRUE));

It's very similar to the animalINLA model with "generic2", but I made changes within control.family from "...(initial = 10, fixed = TRUE)" to the above, and changed within f() "constr = TRUE" to "constr = FALSE". The latter solved the problem of intercept posteriors being too tight and non-overlapping with the true value.

I was able to use the full inverse covariance matrix as you described, thanks! Just as a side note, the function MCMCglmm::inverseA correctly scales the inverse covariance matrix for the phylogeny (by default). I'm not sure if this is true for the package 'phyr' - I followed the instructions in the example under ?pglmm and got a differently scaled matrix that produced incorrect estimates in the analysis. 

Does this mean you have only a single record per each level of spN?

In these simulated datasets, yes (and that's pretty common in phylogenetic comparative analyses). In the MCMCglmm model, the residual variance is the variance in y left unexplained by the phylogenetic covariance matrix (the random effect), which assumes Brownian motion evolution.

Are you fitting the same model in MCMCglmm? I am not fully clear,

Neither am I! Ultimately I want to fit a more complex glmm, so I'll continue testing on simulated data for the more complex scenarios to make sure I'm getting meaningful results.

Cheers, Richard.
Message has been deleted

r.di...@gmail.com

unread,
Apr 29, 2022, 12:13:50 PM4/29/22
to R-inla discussion group
Hi Richard,

I just stumbled across your post from last year while searching for something else, and read the thread with interest. In particular since I am a coauthor of "phyr", and coded up the INLA based portion of the package, and I'm always looking to improve it. I was unaware of the "generic2" formulation of the model (phyr uses "generic0"). Reading the paper linked to in animalINLA, my interpretation is that the "generic2" and "generic0" setup are equivalent accept that the "generic2" makes it easier to estimate the derived h^2 heritability parameter (which is rarely used in phylogenetic analyses in my experience). The paper also claims "generic2" setup does not give correct DIC measures but "generic0" does. Since model comparison is of interest in phyr, so far I am leaning towards keeping that formulation. In your experience have you found any other benefits of using "generic2"? I don't know if the same issue applies to WAIC though, which I prefer to use over DIC anyway. Anyone here know about this?

I wanted to ask what you meant by the scaling used in the phyr example leading to "incorrect" estimates? Incorrect estimates of the fixed effects? Note that the scaling done by MCMCglmm:inverseA simply scales the phylogeny to have a root to tip distance of 1, before doing the inversion (and only works on an ultrametric tree). The scaling in the phyr example (Vphy <- Vphy/(det(Vphy)^(1/nspp))) is instead scaling the covariance matrix itself in order to have a generalised variance of 1. This is to make results of the model more comparable across models, even ones using different phylogenies. This rescaling however changes the scale of the precision matrix such that the prior might have to be reformulated with this taken into account, which is possibly why you got different estimates than you were expecting.

Note that phyr also uses MCMCglmm::inverseA internally to calculate the correct precision matrix, however, at the moment we use the tips-only option. I also came to the conclusion some time ago that we should use the whole phylogeny, including internal nodes to calculate the precision matrix, and include missing data for the internal nodes. This as mentioned above should be more accurate but I have also found for large phylogenies it is orders of magnitude faster, presumably because the full precision matrix is sparse, whereas the tips-only covariance matrix is dense, and INLA is very efficient for sparse matrices (this still seems very counter-intuitive to me, but I have been convinced by experience that it is true). Anyway, I have not yet had the time to implement this change in phyr yet partly because the main effects of interest in phyr are the 'nested' phylogenetic effects, which actually use the Kronecker product of the phylogenetic covariance and an iid covariance for the "sites", and I need to think carefully about how to do this correctly with the internal nodes (it is probably not as complicated as I think). I have been wondering for awhile whether I could use the "group" argument of f() to accomplish an equivalent model, but haven't had time to look into it properly. If anyone has any thought on this, I would love to hear them!

All the best,
Russell
--
Russell Dinnage PhD
Research Assistant Professor
Institute of Environment
Department of Biological Sciences
Florida International University
Miami, Florida, United States

Richard Bailey

unread,
May 16, 2022, 9:28:25 AM5/16/22
to r.di...@gmail.com, R-inla discussion group
Hi Russell,

Thanks for getting in touch, and sorry to be slow to reply - just back from fieldwork. The current state of play is that I switched back to using MCMCglmm, as I know the package so much better and I don't have enough time to sort out all the question marks I have in my mind to be confident enough to use INLA/phyr. Initially I wanted to use INLA because it includes beta family, which MCMCglmm doesn't have, but I've decided to just logit-transform the proportions instead. I'll try to answer your questions, although I may mis-remember some details as it's a few months since I ran the analyses.

In terms of testing the methods, I made a very simple set of simulated datasets using MCMCglmm::rbv, exactly following the example code for that function, which no doubt uses the root-tip=1 scaling. I don't remember what else I did with respect to simulating data. I then fitted an intercept only model with just the inverted phylogeny as a random effect.

The biggest issue, if I remember correctly, was that in phyr, the Bayesian=TRUE and Bayesian=FALSE models produced results for the intercept that were not comparable (I can't remember if there were differences in the random effects variance/precision estimates), and while the non-Bayesian model confidence intervals overlapped as expected with the true value for the intercept, the Bayesian credible intervals did not. (Again, apologies if I'm mis-remembering something) One issue I found when I followed up, using INLA directly, was that, within the f() argument, there is an option "constr=TRUE/FALSE". When this is set to TRUE (as it is ?? for phyr ??) the fixed effects credible intervals were tight around the sample mean, and rarely overlapped with the true value. Whereas when I changed this to "constr=FALSE", the credible intervals were much wider and usually overlapped with the true value. Maybe "constr=TRUE" produces better random effects precision estimates (I think with animalINLA they used this option because they said it's more 'stable')? But for using the model to control for non-independence when estimating fixed effects, it seems like it should be "constr=FALSE". I have no idea what this option is doing...


> Reading the paper linked to in animalINLA, my interpretation is that the "generic2" and "generic0" setup are equivalent accept that the "generic2" makes it easier to estimate the derived h^2 heritability parameter (which is rarely used in phylogenetic analyses in my experience). The paper also claims "generic2" setup does not give correct DIC measures but "generic0" does. Since model comparison is of interest in phyr, so far I am leaning towards keeping that formulation. In your experience have you found any other benefits of using "generic2"? I don't know if the same issue applies to WAIC though, which I prefer to use over DIC anyway. Anyone here know about this?

This was also my interpretation of "generic2", but when I ran it, I always got different estimates for the first parameter (which I assume to be the precision of the random effect) than for the single parameter using "generic0". So, I'm not sure if my understanding of the second parameter was correct.

I'm also unclear on whether "generic0" is equivalent to fitting the matrix inverse in MCMCglmm, which according to Hadfield, is equivalent to assuming brownian motion evolution along a phylogeny. 

One other potential issue with making an equivalent model with INLA to the MCMCglmm model (which was my original intention, and the main reason I posted a question) is that INLA estimates precision whereas MCMCglmm estimates variance, so I doubt (though I haven't verified) it would be possible to create equivalent priors. I never got as far as figuring out how to set the priors, I just used defaults.

> I wanted to ask what you meant by the scaling used in the phyr example leading to "incorrect" estimates? Incorrect estimates of the fixed effects?

I'm trying to remember why I used the word "incorrect". Maybe I was assuming that the scaling used by MCMCglmm was the only correct one, and that the one you suggested ought to be doing the same.

> Note that phyr also uses MCMCglmm::inverseA internally to calculate the correct precision matrix, however, at the moment we use the tips-only option. I also came to the conclusion some time ago that we should use the whole phylogeny, including internal nodes to calculate the precision matrix, and include missing data for the internal nodes. This as mentioned above should be more accurate but I have also found for large phylogenies it is orders of magnitude faster, presumably because the full precision matrix is sparse, whereas the tips-only covariance matrix is dense, and INLA is very efficient for sparse matrices (this still seems very counter-intuitive to me, but I have been convinced by experience that it is true).

Yes, I think the improvement is because including the internal nodes makes the precision matrix sparse. Apparently, in MCMCglmm this improves mixing of the MCMC and leads to higher effective sampling with fewer iterations. In MCMCglmm it's completely straightforward, the internal nodes (and also any species in the phylogeny with no response data) are just treated automatically as missing response data, and their values can be predicted post hoc via the phylogenetic relationships.

Anyway, I have not yet had the time to implement this change in phyr yet partly because the main effects of interest in phyr are the 'nested' phylogenetic effects, which actually use the Kronecker product of the phylogenetic covariance and an iid covariance for the "sites", and I need to think carefully about how to do this correctly with the internal nodes (it is probably not as complicated as I think). I have been wondering for awhile whether I could use the "group" argument of f() to accomplish an equivalent model, but haven't had time to look into it properly. If anyone has any thought on this, I would love to hear them!

I've tried something similar with MCMCglmm and it also works fine without any special adjustments when including internal nodes - I guess the non-existent site:internal node combinations are just treated as missing response data like before (they exist in the kronecker product matrix, so as predictors they're not missing). I don't know if it's equally straightforward with INLA.

Thanks again for your feedback, best wishes,

Richard.




On Fri, Apr 29, 2022 at 6:13 PM r.di...@gmail.com <r.di...@gmail.com> wrote:
Hi Richard,

I just stumbled across your post from last year while searching for something else, and read the thread with interest. In particular since I am a coauthor of "phyr", and coded up the INLA based portion of the package, and I'm always looking to improve it. I was unaware of the "generic2" formulation of the model (phyr uses "generic0"). Reading the paper linked to in animalINLA, my interpretation is that the "generic2" and "generic0" setup are equivalent accept that the "generic2" makes it easier to estimate the derived h^2 heritability parameter (which is rarely used in phylogenetic analyses in my experience). The paper also claims "generic2" setup does not give correct DIC measures but "generic0" does. Since model comparison is of interest in phyr, so far I am leaning towards keeping that formulation. In your experience have you found any other benefits of using "generic2"? I don't know if the same issue applies to WAIC though, which I prefer to use over DIC anyway. Anyone here know about this?

I wanted to ask what you meant by the scaling used in the phyr example leading to "incorrect" estimates? Incorrect estimates of the fixed effects? Note that the scaling done by MCMCglmm:inverseA simply scales the phylogeny to have a root to tip distance of 1, before doing the inversion (and only works on an ultrametric tree). The scaling in the phyr example (Vphy <- Vphy/(det(Vphy)^(1/nspp))) is instead scaling the covariance matrix itself in order to have a generalised variance of 1. This is to make results of the model more comparable across models, even ones using different phylogenies. This rescaling however changes the scale of the precision matrix such that the prior might have to be reformulated with this taken into account, which is possibly why you got different estimates than you were expecting.

Note that phyr also uses MCMCglmm::inverseA internally to calculate the correct precision matrix, however, at the moment we use the tips-only option. I also came to the conclusion some time ago that we should use the whole phylogeny, including internal nodes to calculate the precision matrix, and include missing data for the internal nodes. This as mentioned above should be more accurate but I have also found for large phylogenies it is orders of magnitude faster, presumably because the full precision matrix is sparse, whereas the tips-only covariance matrix is dense, and INLA is very efficient for sparse matrices (this still seems very counter-intuitive to me, but I have been convinced by experience that it is true). Anyway, I have not yet had the time to implement this change in phyr yet partly because the main effects of interest in phyr are the 'nested' phylogenetic effects, which actually use the Kronecker product of the phylogenetic covariance and an iid covariance for the "sites", and I need to think carefully about how to do this correctly with the internal nodes (it is probably not as complicated as I think). I have been wondering for awhile whether I could use the "group" argument of f() to accomplish an equivalent model, but haven't had time to look into it properly. If anyone has any thought on this, I would love to hear them!

All the best,
Russell
--
Russell Dinnage PhD
Research Assistant Professor
Institute of Environment
Department of Biological Sciences
Florida International University
Miami, Florida, United States


On Tuesday, 24 August 2021 at 6:03:27 am UTC-4 Richard Bailey wrote:

Russell Dinnage

unread,
May 18, 2022, 9:35:31 AM5/18/22
to Richard Bailey, R-inla discussion group
Hi Richard,

Thanks very much for getting back to me. This is very helpful feedback. I think the issue with the intercept is an interesting one. Out of curiosity, in your simulations, did you always assume the root value was zero? Using constr = TRUE is necessary to make the intercept identifiable, so when they say the model is more 'stable', it means that the estimate of the intercept is not too strongly influenced by its prior. Note that the default prior is centred at zero, so it is possible when you set contr = FALSE the model was giving the correct intercept estimate because it happened to be that zero was the correct answer. However, setting constr = TRUE is forcing a constraint that is not biologically realistic, that is, there is no reason that the evolutionary process would have necessarily led to deviations in the traits that exactly cancel each other to zero. I think the overall issue is that the Brownian motion phylogenetic model is overall just weakly identified. Afterall it is implicitly estimating (n + (n-1)) rate parameters for n effective data points. With the gaussian restriction the effective parameters is lower than that, but if you look at the effective number of parameters estimated by INLA for the Brownian motion model, it is typically between n/2 and n, so there is generally an effective sample size of less than 2. So expecting it to accurately estimate both the intercept (e.g. ancestral state at the root), and all the evolutionary rates is asking a bit much.

My understanding of "generic2" is that the first parameter is the precision on the phylogenetic effect, the second parameter the precision of the "environmental effect", e.g. the iid 'random noise' component, which should be equivalent to the gaussian error precision term when using a "generic0" gaussian model. This is why you need to fix the gaussian error term in the "generic2" model, otherwise you have two iid terms. I notice in the code you posted you didn't fix the family error precision term, so this may be why the results were different between the two models. Anyway, it sounds like you have moved from the INLA models for the time being, so I suppose we should leave it at that. Just FYI, I am working on an R package that will be able to fit flexible phylogenetic models in INLA, with a user-friendly interface that you might be interested in. I should be doing an initial release within the next month or so. If you are interested I can keep you posted on that?

All the best,
Russell

Thomas F Johnson

unread,
Jun 15, 2023, 7:41:57 AM6/15/23
to R-inla discussion group
Interesting discussion that I have been bouncing around for about two years. Just on the constr = F discussion, my results are definitely looking dodgy whenever I set constr = T

set.seed(3.141)

library(ape)
library(mvtnorm)
library(phytools)
library(INLA)
library(ggplot2)
library(brms)


tmp_tr = pbtree(n = 20) #Generate a random tree of 20 tips
cov = vcv.phylo(tmp_tr, corr = T) #Produce a covariance matrix scaled to a maximum of 1
rownames(cov) = c(1:20) #Index the matrix with numbers instead of tip names
colnames(cov) = c(1:20)
coef = data.frame(
  id = c(1:20),
  vec = as.vector(rmvnorm(1, rep(0,20), cov))) #Simulate data from this covariance
sd(coef$vec) #Look at the standard deviation of this simulated data
phylosig(tmp_tr, coef$vec, method = "lambda") #Test phylogentic signal in the simulated data - its high (big surprise...)!

#Use the simulated coefficients to produce random slopes with a degree of error
cmb_df = NULL
for(a in 1:20){
  tmp_df = data.frame(
    x = 1:10,
    y = c(1:10)*coef$vec[a] + rnorm(10,0,2),
    group = a)
  cmb_df = rbind(cmb_df, tmp_df)
}


#Visualise the data - some very variable slopes on show. The whole premise of this analaysis is that we expect slopes to be more similar in closely related species, and this needs to be accounted for when trying to derive an estimate for the y ~ x slope (the parameter we are most interested  in)
ggplot(cmb_df) +
  geom_line(aes(x = x, y = y, group = group))

prec = solve(cov) #COnvert the correlation matrix into a precision matrix


pcprior = list(prec = list(
  prior="pc.prec",
  param = c(10, 0.1))
) #Simple pc prior

#The first model we is a simple random slope model with an iid model sturucture
inla_iid = inla(y ~ x +
                  f(group,
                    x,
                    model = "iid",
                    constr = T,
                    hyper = pcprior),
                data = cmb_df,
                control.family = list(hyper = pcprior))
summary(inla_iid) #95% CIs around x vary from 0.45 - 0.64

#Same model as above, but without the pc prior. This difference in prior should cause small discrepancies in the outputs.
brm_iid <- brm(
  y ~ x + (x|group),
  data = cmb_df,
  family = gaussian(),
  chains = 1
)
summary(brm_iid) #95% CIs around x vary from 0.20 - 0.87. Quite different, but perhaps that's just because of differences in the prior... bare with me, as I'm not really interested in the absolute difference between inla and brms. But how different options in inla change things (or not)


#Now we specify that the precision matrix from the phylogeny should describe the slopes (e.g. slopes exhibit phylo signal). Our expectation here would be that, by addressing non-independence in slopes, our uncertainty around the y ~ x relationship should increase
inla_constrT = inla(y ~ x +
                      f(group,
                        x,
                        model = "generic0",
                        Cmatrix = prec,
                        constr = T,
                        hyper = pcprior),
                    data = cmb_df,
                    control.family = list(hyper = pcprior))
summary(inla_constrT) #But no! We have an identical estimate to the simple random slope model (0.45 - 0.64)

brm_cov <- brm(
  y ~ x + (x|gr(group, cov = A)),
  data = cmb_df,
  family = gaussian(),
  data2 = list(A = cov),,
  chains = 1
)
summary(brm_cov) #In constract, with brms the 95% CI is now -0.44 - 1.58. The standard deviation around the y ~ x relationship is now almost an order of magnitude greater than INLA, and nearly three times larger than the initial brms model

#Interestingly, if we re-run the phylogentic inla model, but set constr = F, the result all of a sudden become more comparable to brms again
inla_constrF = inla(y ~ x +
                      f(group,
                        x,
                        model = "generic0",
                        Cmatrix = prec,
                        constr = F,
                        hyper = pcprior),
                    data = cmb_df,
                    control.family = list(hyper = pcprior))
summary(inla_constrF) #95% CIs range from -0.82 - 1.96

#And importantly, are comparable to what we would expect based on our simulation
print(quantile(coef$vec, probs = c(0.025, 0.975)))

#I have no idea what is going on under the hood here, but setting constr = T seems to screw with inference. Constr = F feels far more reliable despite all of the advice disagreeing with this. I would love it if someone could explain what is going on here

Helpdesk (Haavard Rue)

unread,
Jun 15, 2023, 7:48:42 AM6/15/23
to Thomas F Johnson, R-inla discussion group

There is no correct choice of using constr=T or F, its up to you.

with

~ ... + x + f(group, x, model="iid", constr=T or F)

then with 'T' you'll get sum(group)=0, but what you might want is the
weigted sum to be 0, so that

x + f()

has the interpretation of 'mean slope + variations around'.

if you know the weights, you can use 'extraconstr' instead.

It think the main issue is the variance of 'x' and f(), as without
constr they can be higher due to confounding, but with some data I think
the effect is small.

H

On Thu, 2023-06-15 at 04:41 -0700, 'Thomas F Johnson' via R-inla

Richard Bailey

unread,
Jun 15, 2023, 9:02:12 AM6/15/23
to Thomas F Johnson, R-inla discussion group
Hi Thomas,

(I know you know a lot of this stuff already, but in case anyone else is reading and wants to know the objectives of these analyses).

I haven't gone through your email in detail, but regarding constr=T, what I found specifically was that my fixed-effect posterior estimates of the intercept (true value the data were simulated from = zero) were centred on the sample mean, with very tight credible intervals. Therefore, they didn't overlap with the true value of the intercept (0). Whatever is causing this outcome, it is unhelpful for standard phylogenetically informed comparative analysis. The typical objective is to incorporate the increased uncertainty due to correlations (evolutionary non-independence) among data points - reducing the effective number of data points - into estimation of fixed-effect parameters. Sometimes the objective is to estimate the phylogenetic variance components, but that's not how most people use this method - they are focused on dealing with the statistical non-independence issue while estimating fixed effects.

For me, setting constr=F produced results that made far more sense, with increased uncertainty in the intercept estimate compared to the non-phylogenetic model, and credible intervals from almost all simulated datasets overlapping with the true value (0). I didn't try adding any more fixed effects, or using a different intercept than zero.

Best regards,
Richard.

Reply all
Reply to author
Forward
0 new messages