The +w model option in BioGeoBEARS

552 views

Skip to first unread message

Nick Matzke

unread,

Apr 8, 2016, 12:06:44 AM4/8/16

to bioge...@googlegroups.com

I wrote this up in response to a question, but figured it would be better to share on the group:

========================================

BASICS OF THE +w MODEL IN BIOGEOBEARS

========================================

Hi - thanks for the question about the +w model. There may not be scripts for the specific case of using the +w model. The manuscript using +w for the first time is in review, its current form is:

Dupin, Julia; Matzke, Nicholas J.; Sarkinen, Tiina; Knapp, Sandra; Olmstead, Richard; Bohs, Lynn; Smith, Stacey (2015). Bayesian estimation of the global biogeographic history of the Solanaceae. Submitted.

It is somewhat similar to the +x model, which modifies dispersal probability by distance^x, described in:

Van Dam, Matthew; Matzke, Nicholas J. (2016). Evaluating the influence of connectivity and distance on biogeographic patterns in the south-western deserts of North America. Journal of Biogeography. Special paper, published online 3 March 2016.

http://dx.doi.org/10.1111/jbi.12727

LONG VERSION OF THE +w MODEL

The parameter "w" modifies user-specified manual dispersal multipliers. The main reason to do this is that you, the researcher, might not feel comfortable pulling dispersal multipliers between regions out of a hat. Should cross-ocean dispersal be a 0.1 multiplier or a 0.001 multiplier? Etc.

The parameter "w" causes this change to the dispersal rates (and dispersal weight, for j):

dispersal rate = (base dispersal rate) * (manual dispersal multiplier matrix) ^ w

During the ML search, various values of "w" are tried (if you have set w to be free), and if they result in the model conferring a higher likelihood on the data, they those values are favored. If w=1, then the user's manual dispersal multipliers were the best fit. If w>1, then the effect of the dispersal multipliers is increased, and the differences between different dispersal multipliers are increased. If w<1, then the differences between different dispersal multipliers are decreased. If w=0, then the manual dispersal multipliers are wiped out completely.

The (potential) advantage of this is that the researcher does not have to pretend that they know for sure what the correct dispersal multipliers should be -- instead, these can be estimated from the data.

However, one weakness of this approach is that the relative scaling between different dispersal multipliers is still a user decision; this is only partially solved by having the w problem. In general, I prefer using the +x model (dispersal probability multiplied by distance^x), since distances are more objective. However, distances might not always cover all hypothesized geography of interest.

Users can, of course, combine +w, +x, and other features (different base models, the +n environmental distance model, multiple types of distances, etc.). WORD OF WARNING: AS YOU ADD MORE FREE PARAMETERS, THE MAXIMUM LIKELIHOOD OPTIMIZATION ALGORITHMS CAN HAVE MORE PROBLEMS FINDING THE MAXIMUM LIKELIHOOD

ALSO: I THINK IT IS SOMEWHAT DUBIOUS, SCIENTIFICALLY, TO JUST RUN EVERY MODEL YOU CAN THINK OF, CREATING HUNDREDS OF DIFFERENT MODELS, BY COMBINING ALL OF THE DIFFERENT MODEL OPTIONS YOU CAN THINK OF, AND THUS BEING MAXIMALLY OBJECTIVE/AGNOSTIC.

At this stage of biogeographical science, it is probably better to have clear hypotheses in mind, and aim at testing those. Running zillions of models is (a) wasteful of your time/money, (b) increases the chance of having some ML searches fail, invalidating your model comparisons; (c) increases the chances of some model doing best "by chance", appearing to give a scientifically interesting signal, but actually just representing the fact that you tried a zillion models and one fit better by chance.

Reasonable people can disagree about what the right balance is on these issues, but people should at least think about them. [end editorial - discussion welcome]

SHORT VERSION OF THE +w MODEL

You load a dispersal multipliers file by uncommenting this:

#BioGeoBEARS_run_object$dispersal_multipliers_fn = "manual_dispersal_multipliers.txt"

You turn on "w" by editing the BioGeoBEARS_model_object parameters table:

# Add w as a free parameter

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["w","type"] = "free"

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["w","init"] = 0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["w","est"] = 0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["w","min"] = 0

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["w","max"] = 3

So, when w=0,

dispersal rate = (base dispersal rate) * (manual dispersal multiplier matrix) ^ 0

dispersal rate = (base dispersal rate) * 1

dispersal rate = (base dispersal rate)

Thus, you have 2 null models to test:

Null model 1: w fixed to 0: this equals the situation you have when you no dispersal multiplier matrix at all

Alternate model 1a: w free. This model has 1 extra parameter.

Compare the two with a Likelihood Ratio Test.

Null model 2: w fixed to 1: this equals the situation when you assume your dispersal multiplier matrix is correct

Alternate model 1b: w free. This model has 1 extra parameter.

Compare the two with a Likelihood Ratio Test.

NOTE ON ZEROS IN MANUAL DISPERSAL MULTIPLIER MATRICES

Sometimes people have BioGeoBEARS crashes because of "hard zeros" in their manual dispersal multipliers matrix. If you have a zero dispersal rate between two areas, but your data require that a dispersal event occur (particularly in a particular time-bin), then your data will have -Inf log-likelihood, and the analysis will crash. One solution is to change the 0 values in the manual dispersal multipliers matrix to e.g. 0.00000001.

OPTIMIZATION ISSUES

Alternate models 1a and 1b are actually the same model, but you might start the search ("init"=initial value) for 1a at w=0, and for 1b at w=1. If both searches end up with the same estimate ("est") for w, then you can be reasonably sure the ML search is working well. If they don't, then, probably, the likelihood surface is fairly flat (your data don't have a strong preference for the value of w), and you may have to use these sorts of strategies to find the ML solution:

- parameter rescaling (BioGeoBEARS_run_object$rescale_params=TRUE),

- tighten the limits

- use longer searches ($speedup=FALSE, $use_optimx=FALSE)

- different start points

and/or

- manual tries with human guesses at the value of w

Some of this could be automated in the program, kind of, but I believe it is better to have humans *looking* and *thinking about* and *understanding* their optimzation results, rather than treating ML optimization as a black box that always works.

Once you have ML estimates, you can compare null model 1, null model 2, alternate model 1a-1b with AIC, AICc, etc.

Advice on the general process of optimization and model choice:

Advice On Statistical Model Comparison In BioGeoBEARS

http://phylo.wikidot.com/advice-on-statistical-model-comparison-in-biogeobears

BioGeoBEARS Mistakes To Avoid - Issues with Maximum Likelihood (ML) optimization

http://phylo.wikidot.com/biogeobears-mistakes-to-avoid#optim

File formats: http://phylo.wikidot.com/biogeobears#links_to_files

NOTE: THE AREAS IN THE MANUAL DISPERSAL MULTIPLIERS FILE SHOULD BE IN THE SAME ORDER AS IN THE HEADER OF THE GEOGRAPHY FILE. That means *both* the columns *and* rows should be in the order of the areas in the geography file.