Statistical Comparisons and Adjacency

Théo Driancourt

unread,

Mar 20, 2024, 10:29:48 AM3/20/24

to BioGeoBEARS

Hi everyone, I'm discovering biogeography through BioGeoBEARS and I'm currently dealing with a problem.

So for my project, we set 9 areas and constrained the possibilities in the states_list manually.

Running the models to statistically compare them, I ran a DEC model with no change to the state_list (I am therefore calling 'unconstrained'), a DEC models with the manual changes to the state_list (I am therefore calling 'constrained'), a DEC+J model unconstrained and a DEC+J model constrained.

The idea here is to approach a similar table we can found in K.V. Klaus & N.J. Matzke (Systematic Biology, 2020).

The goal behind that table is to statistically compare these models ; but I'm not sure about how I can do it. Are my constrained models nested inside of DEC-unconstrained and thus, am I allowed to do LRT on these pairs : (DEC-unconstrained & DEC-constrained), (DEC+J-unconstrained & DEC+J-constrained), (DEC-unconstrained & DEC+J-constrained).

If not, am I even allowed to compare their AIC ?

Thanks everyone,

Théo.

Théo Driancourt

unread,

Mar 21, 2024, 7:47:50 AM3/21/24

to BioGeoBEARS

Edit :

I was also wondering if changing the states_list had to be considered as a parameter ? I guess not ; because, just as the +d models (dispersal multipliers), it's not adding any free parameter. But we never know !

Thanks.

Nick Matzke

unread,

Apr 11, 2024, 6:52:32 PM4/11/24

to BioGeoBEARS

Apologies for missing these questions. Answers...

On Friday, March 22, 2024 at 12:47:50 AM UTC+13 etudesbi...@gmail.com wrote:

Edit :

I was also wondering if changing the states_list had to be considered as a parameter ? I guess not ; because, just as the +d models (dispersal multipliers), it's not adding any free parameter. But we never know !

A. Short answer: no. Likelihoods, AIC, AICc, BIC, etc, are only comparable when you have different models fit to *identical* datasets. The list of states/ranges, including the part about how AB is one of the allowed states in a list of 8 states from Null to ABC, is the data. So if you change the list of allowed states, you are basically changing the data, so the likelihoods are incomparable. You can see this in the reductio ad absurdum case where you reduce the list of allowed ranges to exactly match the list of observed ranges. This would probably get a good log-likelihood, but what are you then assuming? If you had another clade in the same regions, would you reduce their list of allowed ranges to a different list of observed ranges? So, the conventional wisdom is: decide your areas/ranges based on a first-principles decision taking into account the known discreteness in the geography and how the clade likely is responding to it.

B. The long answer is: kind of, maybe, but no one has published a detailed argument for it. Technically, one can imagine a transition matrix where certain states are not accessible -- where all of the rates going to those states are 0.0. So, you could put parameters on those rates leading to unobserved states, and fit them, and then do that model comparison, because what you are changing is the rate parameters, not the data. But, again, you could easily end up in a situation where you have a free parameter for every state, and have more parameters than observations.

(C. This is actually basically what the "DEC*" model did for the null-range state -- see the bioRxiv paper, mentioned in the Google Group archives. The effects were dramatic for certain datasets, because if you have a state that is unobserved (like the null range state), the model optimization will try *very hard* to get parameters that will make that state unlikely. For datasets with all single-area ranges, this means that DEC would usually produce e=0.0 or low, but when the null state is removed, DEC* would produce e=Infinity because there was no way for single-area ranges to turn to null, which basically converts the DEC model to a standard unordered character model. The real solution here would be a full SSE model which uses lineage extinction instead of a null range, but these have their own issues, ie more parameters, complexity, getting reasonable lineage extinction rates etc. etc.)

I suspect there is a way to compromise between A and B/C, perhaps a complex Bayesian analysis where the list of allowed states can change and itself be optimized. But whether it would be worth the time & effort to set it up, for typical datasets with 10s to 100s of species/ranges, is dubious to me...but people are welcome to try!

Thanks.
On Wednesday, March 20, 2024 at 3:29:48 PM UTC+1 Théo Driancourt wrote:
Hi everyone, I'm discovering biogeography through BioGeoBEARS and I'm currently dealing with a problem.
So for my project, we set 9 areas and constrained the possibilities in the states_list manually.

Running the models to statistically compare them, I ran a DEC model with no change to the state_list (I am therefore calling 'unconstrained'), a DEC models with the manual changes to the state_list (I am therefore calling 'constrained'), a DEC+J model unconstrained and a DEC+J model constrained.
The idea here is to approach a similar table we can found in K.V. Klaus & N.J. Matzke (Systematic Biology, 2020).

The conventional meaning of "constrained" vs. "unconstrained" is about rates, e.g. constraining parameters or dispersal rates to 0.0, so I would keep those words for that. The "nesting" language is also about parameters. j=0.0 is nested instead of a +J model with j=free parameter, so DEC is nested inside DEC+J.

The goal behind that table is to statistically compare these models ; but I'm not sure about how I can do it. Are my constrained models nested inside of DEC-unconstrained and thus, am I allowed to do LRT on these pairs : (DEC-unconstrained & DEC-constrained), (DEC+J-unconstrained & DEC+J-constrained), (DEC-unconstrained & DEC+J-constrained).
If not, am I even allowed to compare their AIC ?

It's definitely not a case of nesting statistical models.

Can likelihoods be compared anyway? I am dubious. Typically removing unobserved states will improve the likelihood, but what does that say? Do you actually know that the removed range was an impossible range, one which would not be reached if you re-ran the evolutionary history of this group 1000 times? Or are you just tweaking the data in a kind of manual, unrecorded human fitting of the model around the observations?

Decisions about the list of allowed areas and states/ranges list are, I think, best based on a priori considerations, and additionally computational speed where necessary.