Run Biogeobears on multiple trees?

Liming Cai

unread,

Aug 12, 2015, 9:46:38 PM8/12/15

to BioGeoBEARS

Hi Nick,

I am doing a Biogeobears analysis across multiple bootstrap trees. I notice that there is a source code available for such purpose in BioGeoBEARS_on_multiple_trees_v1.R. I prepared all the input files, including one that contains the filenames of the tree files. I scan this file into the newick_fns argument as follows:

newick_fns = scan(“treelist.txt", what="character", sep=NULL)

And trying to run the run_bears_optim_on_multiple_trees function with following line:

res = run_bears_optim_on_multiple_trees(BioGeoBEARS_run_object, newick_fns, model_name="Elatinaceae_DEC", geog_fns, resfns, run_or_collect="run", start_treenum=1, end_treenum=length(newick_fns), runslow=TRUE, plot_params=FALSE)

As a result, it turns out I can only run the first tree. And when the analysis goes into the second tree, the following error occurred (this is a test run, so there are only three trees):

Could you please help me with that? In addition, is it correct to have each bootstrap tree separately in a file, and list the filename as a string in newick_fns? There’s also a function to summarize the result the plot the parameters onto a master tree. I think I need guide to use those functions as well: what the input argument should be like and how to call the function correctly in a R console.

ps. My input files and R scripts are attached here.

Thank you very much and best regards,

Liming

BioGeoBears_Analysis_1508.zip

Nick Matzke

unread,

Aug 13, 2015, 8:22:02 AM8/13/15

to BioGeoBEARS

Thanks for the question, and files.

OK, so the initial problem here was that geog_fns should have as many filenames as newick_fns. I originally programmed this for a case where OTUs and geography were not identical between runs, because of uncertain fossil placement. But, usually, geography stays the same, so I've added a check.

I also discovered that even when it was looping through successfully, it was using the same master tree (!) rather than actually reading the different input trees (the script was missing the readfiles_BioGeoBEARS_run and section_the_tree commands inside the loop. So, if anyone else was using this feature, please re-run. I've uploaded the revised file to PhyloWiki.

Script is below, and modified zipfile. Cheers!

Nick

PS: I should say -- this is beta code, so "use at your own risk." I put up the multiple-trees file semi-accidentally, the "pseudo-Bayesian" approach of running models over a bunch of trees is not a high priority for me. I know it is popular and commonly requested, but:

(a) The cost-benefit of the pseudo-Bayesian approach seems dubious to me, at least for typical analyses (the tree is actually pretty confident, non-time-stratified analysis, the pseudo-Bayesian averaged result looks about the same as the standard ML result)

(b) There are some theoretically dubious features about it (see mini-rant at the end of the script, below) -- plus, another one I just thought of -- if you have a collection of bootstrap or posterior trees, when are nodes "equivalent". E.g., if:

- node 1 in tree 1 has species A and species B above it, and

- node 1351 in tree 2 has species A and species B above it,

- are they equivalent if (A,B) is sister to (C,D) in tree 1, and sister to (Z, X) in tree 2?

That's what we are averaging when we average the pseudo-Bayesian analyses. It gets even more complex when you think about "corners" at the branch bottoms. I made some fairly arbitrary decisions about this stuff in the script, others might prefer a different way.

(c) I figure anyone who knows how to run BioGeoBEARS can probably figure out a for-loop and customize it themselves for whatever scale of problem they are doing; it's difficult to program for all of the different ways people might want to do it (as there are significant issues of speed, file storage, etc., at least for large analyses).

==================

###########################################################

# Run Biogeobears on multiple trees and store the output

###########################################################

#Load libraries and additional source code

###########################################################

library(optimx)

library(FD)

library(snow)

library(parallel)

library(BioGeoBEARS)

library(stringr)

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_add_fossils_randomly_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_basics_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_calc_transition_matrices_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_classes_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_detection_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_DNA_cladogenesis_sim_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_extract_Qmat_COOmat_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_generics_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_models_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_on_multiple_trees_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_plots_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_readwrite_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_simulate_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_SSEsim_makePlots_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_SSEsim_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_stochastic_mapping_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_stratified_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/BioGeoBEARS_univ_model_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/calc_uppass_probs_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/calc_loglike_sp_v01.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/get_stratified_subbranch_top_downpass_likelihoods_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/runBSM_v1.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/stochastic_map_given_inputs.R")

source("http://phylo.wdfiles.com/local--files/biogeobears/summarize_BSM_tables_v1.R")

calc_loglike_sp = compiler::cmpfun(calc_loglike_sp_prebyte) # crucial to fix bug in uppass calculations

calc_independent_likelihoods_on_each_branch = compiler::cmpfun(calc_independent_likelihoods_on_each_branch_prebyte)

############################################################

# Universal setup, input file path

############################################################

wd = "/Users/lcai/Dropbox/Davis\ Lab/2013-2014_Elatinaceae/BioGeoBears_Analysis_1508/"

wd = "/drives/Dropbox/_njm/__packages/BioGeoBEARS_setup/z_help/Liming/"

setwd(wd)

extdata_dir = np(system.file("extdata", package="BioGeoBEARS"))

extdata_dir

list.files(extdata_dir)

trfn = "RAxML_best0821.newick"

geogfn = "speciesrange.data"

max_range_size = 3

newick_fns = scan("treelist", what="character", sep=NULL)

#geog_fns = "speciesrange.data"

geog_fns = scan("geoglist", what="character", sep=NULL)

resfns = scan("outputnames", what="character", sep=NULL)

############################################################

# Run DEC on multiple trees

############################################################

BioGeoBEARS_run_object = define_BioGeoBEARS_run()

BioGeoBEARS_run_object$force_sparse=FALSE

BioGeoBEARS_run_object$speedup=TRUE # shorcuts to speed ML search; use FALSE if worried (e.g. >3 params)

BioGeoBEARS_run_object$use_optimx = TRUE

BioGeoBEARS_run_object$calc_ancprobs=TRUE # get ancestral states from optim run

BioGeoBEARS_run_object$timesfn = "time"

BioGeoBEARS_run_object$dispersal_multipliers_fn = "sevenarea.mat.rm.ram"

BioGeoBEARS_run_object$areas_allowed_fn = "areas_allowed.txt"

BioGeoBEARS_run_object$geogfn = geogfn

BioGeoBEARS_run_object$trfn = trfn

BioGeoBEARS_run_object$max_range_size = max_range_size

BioGeoBEARS_run_object$num_cores_to_use=5

BioGeoBEARS_run_object$force_sparse=FALSE

BioGeoBEARS_run_object = readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)

BioGeoBEARS_run_object = section_the_tree(inputs=BioGeoBEARS_run_object, make_master_table=TRUE, plot_pieces=FALSE)

BioGeoBEARS_run_object$return_condlikes_table = TRUE

BioGeoBEARS_run_object$calc_TTL_loglike_from_condlikes_table = TRUE

BioGeoBEARS_run_object$calc_ancprobs = TRUE

BioGeoBEARS_run_object

BioGeoBEARS_run_object$BioGeoBEARS_model_object

BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table

check_BioGeoBEARS_run(BioGeoBEARS_run_object)

# multiple tree settings

#' @BioGeoBEARS_run_object Set up the inference model you want to run on

#' each tree, like you would for a normal single-tree run.

#' @newick_fns A list of the Newick files (e.g., you should extract some trees

#' from a BEAST NEXUS MCMC output)

#' @model_name The name you would like added to output filenames

#' @geog_fns A list of corresponding geography files (by default, these are just .geog instead of .newick)

#' @resfns A list of results filenames for .Rdata files, either for saving BioGeoBEARS analyses on each tree, or

#' for loading previously-saved runs (by default, these are just _model_name.Rdata instead of .newick)

#' @run_or_collect If you want to run BioGeoBEARS on each tree (slow), use "run". If you just want to

#' collect the results over all the trees, use "collect". For both, pick "both".

#' @start_treenum Default 1. Change if you want to skip some trees

#' @end_treenum Default is length(newick_fns). Change if you want to run a subset of trees.

#' @runslow If FALSE, old, saved .Rdata results are loaded via load(). Default TRUE generates new .Rdata files and

#' saves them via save()

res = run_bears_optim_on_multiple_trees(BioGeoBEARS_run_object, newick_fns, model_name="Elatinaceae_DEC", geog_fns, resfns, run_or_collect="run", start_treenum=1, end_treenum=length(newick_fns), runslow=TRUE, plot_params=FALSE)

res

# Now, collect the saved results from each individual run

res2 = run_bears_optim_on_multiple_trees(BioGeoBEARS_run_object, newick_fns, model_name="Elatinaceae_DEC", geog_fns, resfns, run_or_collect="collect", start_treenum=1, end_treenum=length(newick_fns), runslow=TRUE, plot_params=FALSE)

#res2

names(res2)

# This has the (averaged) state probabilities at nodes and corners

# for the master tree (whatever you put in BioGeoBEARS_run_object)

dim(res2$state_probs_at_nodes_across_all_trees)

dim(res2$state_probs_at_corners_across_all_trees)

# This has the averaged parameter values

res2$optim_results_mean

# To plot with the standard BioGeoBEARS plot function you will need to

# - run a standard analysis on the master tree, and

# - then copy these average state probabilities into that res object

# (into the state probabilities at nodes, and state probabilities at branch bottoms)

# - then copy res2$optim_results_mean over the standard optim results

# Nick's random rant on the pseudo-Bayesian approach

#

# After you've done all this, you'll have a plot of probabilities of states,

# averaged over many trees. Yay.

#

# It will probably look pretty much like a plot of one run on the single ML or MCC tree.

# This is why I am skeptical of the popularity of "Bayes DIVA", "Bayes Lagrange", etc.

# approaches. They aren't fully Bayesian -- more like pseudo-Bayesian. It is mostly

# done to mollify reviewers who heard somewhere it was a good idea. The danger is that

# people think they are "accounting for uncertainty" by doing this approach. The REALLY BIG

# uncertainty, though, is in things like model misspecification, missing species (due to

# sampling or extinction). And the "single most probable state" plots hide the uncertainty

# just the ML ancestral state estimates anyway (do stochastic mapping to real feel that

# uncertainty).

#

# It would be better to spend time thinking about and discussing all that stuff,

# rather than spending days running a model on a bunch of (typically) very

# similar trees that (typically) all say about the same thing.

#

# The best argument for doing the pseudo-Bayesian approach is if you have a particular

# node you are interested in, and the geography estimated at that node is strongly

# affected by nearby uncertainty in the tree. In that case,

# running a model over a bunch of trees can be useful.

#

# I may be overly skeptical. But I have yet to see a case where the pseudo-Bayesian approach

# gave any real additional insight. It's only likely if your tree is very uncertain, but

# typically people get enough data to avoid that. Very often, I guess, people just want to

# be able to say, "we ran it on 100 trees and got the same result". But really, if your

# tree is decently estimated, you actually knew that going in, from first principles.

#

# The pseudo-Bayesian approach might be useful if your dating is highly uncertain (as is common),

# and you are doing a time-stratified analysis where uncertain nodes are near the time-strata.

# This would be the case I would focus on.

#

==================

Liming.zip

Liming Cai

unread,

Aug 17, 2015, 12:25:43 PM8/17/15

to BioGeoBEARS

Hi Nick,

Thank you very much for the thoughtful advice and comments.

Best regards,

Liming

Liming Cai

unread,

Aug 23, 2015, 9:08:12 PM8/23/15

to BioGeoBEARS

Hi Nick,

I have a following question. This might be naive for you, but after I ran the analysis and got the averaged state probs at corners/nodes across all trees, I ran into problems when I want to plot this on my master tree. Supposing that the results from master tree are loaded by resDEC and results from multiple trees are loaded with res2. I simply copy res2 to resDEC as follows:

> resDEC$optim_result=res2$optim_results_mean

> resDEC$ML_marginal_prob_each_state_at_branch_bottom_below_node=res2$state_probs_at_corners_across_all_trees

> resDEC$ML_marginal_prob_each_state_at_branch_top_AT_node=res2$state_probs_at_nodes_across_all_trees

Then I tried to plot the result what I usually do with single-tree analysis. The following error occurred:

> plot_BioGeoBEARS_results(results_object, analysis_titletxt, addl_params=list("j"), plotwhat="pie", label.offset=0.3, tipcex=0.3, statecex=0.3, splitcex=0.3, titlecex=0.8, plotsplits=TRUE, cornercoords_loc=scriptdir, include_null_range=TRUE)

Read 9 items

Read 404 items

Error in floating.pie.asp(XX[i], YY[i], pie[i, ], radius = xrad[i], col = piecol) :

floating.pie: x values must be non-negative

Called from: floating.pie.asp(XX[i], YY[i], pie[i, ], radius = xrad[i], col = piecol)

Browse[1]>

I wonder whether I did it correctly to plot the results like this. Or are there any additional analysis should be conduct to summarize the results before I can plot them onto a master tree?

Best regards,

Liming

ps. Here is the google drive link for the files:

https://drive.google.com/file/d/0B7vdvJ-uDyIFT0Q3eUpJZllONGc/view?usp=sharing

On Thursday, August 13, 2015 at 8:22:02 AM UTC-4, Nick Matzke wrote:

Nick Matzke

unread,

Aug 23, 2015, 9:21:49 PM8/23/15

to bioge...@googlegroups.com

That's pretty weird. It suggests you have negative values in the state_probs somehow, which then cause the pie chart to malfunction. But I don't see how you could get that out of an averaging operation. I am doing workshops today/tomorrow, so it will be awhile before I can look. You could look for negative values in the state_probs and try to trace where they came from. Perhaps more probable is NA values caused by some mis-match between your actual master_tree and some other tree that got put in accidentally?

Cheers, Nick

--
You received this message because you are subscribed to the Google Groups "BioGeoBEARS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogeobears...@googlegroups.com.
To post to this group, send email to bioge...@googlegroups.com.
Visit this group at http://groups.google.com/group/biogeobears.
To view this discussion on the web visit https://groups.google.com/d/msgid/biogeobears/59d49dc9-933b-403d-a511-9ca807dfc5ad%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Pau Carnicero Campmany

unread,

Jan 21, 2016, 8:24:40 AM1/21/16

to BioGeoBEARS, mat...@nimbios.org

Hi all,

I got the same error as Liming.

When I checked the objects res2$state_probs_at_nodes_across_all_trees and res2$state_probs_at_corners_across_all_trees I found out that they contain the probabilities of each node and each tree, not the average probabilities of all trees analyzed. These objects have n times (the number of trees analyzed) the number of rows of res$ML_marginal_prob_each_state_at_branch_top_AT_node (probabilities at nodes for the single tree analysis), I expected them to be the same size..

So maybe this is the problem? How can I obtain the averaged probabilities at nodes and corners?

Thanks,

Nick Matzke

unread,

Jan 27, 2016, 1:45:37 AM1/27/16

to BioGeoBEARS, mat...@nimbios.org

I took a look at this, somehow I left out the summarization step. I have revised the multiple-trees script, and run it on the dataset. I am pasting the script below, with an updated mini-essay on why the whole "averaging over many runs" approach of "Bayes-DIVA", "Bayes-Lagrange", and/or "Bayes-BioGeoBEARS" is kind of dubious in the first place (for starters, it's not really Bayesian, more like pseudo-Bayesian).

Cheers!

Nick

EXAMPLE SCRIPT -- RUNNING BIOGEOBEARS ON MULTIPLE TREES,

AND SUMMARIZING VIA A DUBIOUS AVERAGING OF "COMMON" ("SHARED") NODES AMONG TREES

Input/output files: http://phylo.wikidot.com/local--files/biogeobears/running_BioGeoBEARS_multiple_trees.zip

=====================

###########################################################

# Run BioGeoBEARS on multiple trees and store the output

tipranges = getranges_from_LagrangePHYLIP(lgdata_fn=geogfn)

tipranges

# Run the analysis, on the master tree

# For a slow analysis, run once, then set runslow=FALSE to just

# load the saved result.

runslow = TRUE

resfn = "master_tree_single_run_v1.Rdata"

if (runslow)

{

master_tree_res = bears_optim_run(BioGeoBEARS_run_object)

master_tree_res

save(master_tree_res, file=resfn)

resDEC = master_tree_res

} else {

# Loads to "master_tree_res"

load(resfn)

resDEC = master_tree_res

}