Anonymized: Interpreting Biogeographical Stochastic Mapping (BSM) in BioGeoBEARS

1,106 views
Skip to first unread message

Nick Matzke

unread,
May 8, 2016, 3:03:47 AM5/8/16
to bioge...@googlegroups.com
Here is an email I just sent on the basics of Biogeographic Stochastic Mapping (BSM), and interpreting plots. 

I am anonymizing it to post more generally, as it may help others. I've also added more headers and links.

Cheers!
Nick


> On Sun, May 8, 2016 at 3:51 PM, XXX wrote:
> Dear Nick,
> I'm struggling to understand the BSM outputs, but I have found little (or none) advance so far.
> I was able to run the analysis with your wonderfully clear scripts, but I still have no idea how to interpret the histograms (I have attached here one example).
> Today I came across this nice paper on New World bats:
> and that helped me to better formulate my questions:
> 1) What exactly does "the most frequent modes of range inheritance" mean? How can I interpret that from the histogram? For example, in the bats paper the authors say "Range copying and range subset within the same large area represent 84.1% of the total number of events at cladogenesis, or 160 of the 191 nodes in the phylogeny".
> 2) Which also brings my next question: How can I correlate the histogram with each node of my phylogeny? How on Earth do they know that "Anagenetic dispersal was more frequent between the two continental regions, particularly from South America to Central and North America"? I would love to get that information for my dataset.
> I think the bottomline of my problem is to really understand what does the histogram mean, how to interpret it, and how I can correlate that information with my tree.
> I think that understanding this is the 'last frontier' to come with a full circle with my PhD investigation. Even if this analysis is just a complement to the main BGB results, I am sure that understanding it would enrich my discussion, and definitely help me to continue to scratch the surface of this amazing model-based biogeography.
> Thank you so much for all your attention so far! Have a nice weekend!
> XXX



Hi XXX!

Glad you're looking at BSM stuff!  Apologies that the documentation isn't all there, I haven't even had time to get the basic paper on the BSM method out yet! Thanks for the link to the SysBio paper ( http://sysbio.oxfordjournals.org/content/65/3/432.abstract ), I'm glad this stuff is getting used so fast, although it should be considered "beta" I guess until it is officially published.



DEFINITION OF "A HISTORY" (on a phylogeny)

As you know, in real-life biogeographic history, there was only one series of events (dispersal events etc.) that took place as a particular clade evolved.  The problem is that we don't know that exact history.  So, we are trying to *infer* that history, using various phylogenetic inference methods.

The True History, if we had it, would have an exact count of the number of events of each type that had occurred on the phylogeny, and each event would have taken place at an exact time.  Similarly, if we had the exact history of a DNA base, or of a morphological character, every substitution event would have occurred at some exact time point.


INFERRING ANCESTRAL STATE PROBABILITIES WITH MAXIMUM LIKELIHOOD / BAYESIAN INFERENCE

Of course, we don't have this history, so we have to infer it. The most common methods are likelihood and Bayesian, which both rely on stochastic models (transition rate matrices).  These methods are actually calculating the probability of the observed data *under all histories possible under the model*, conditional on the data and the inferred model parameters.  This is impressive, because there are an infinite number of exact histories.  These are the methods that, at each node, give you the probability of that node being in state 1, state 2, etc.  You can actually calculate these probabilities anywhere on the tree, not just at nodes, but nodes are computationally convenient.


TYPICAL MISTAKES MADE WHEN INTERPRETING ANCESTRAL STATE PROBABILITY PLOTS (or ancestral geographic range probability plots)

The main problem with these methods is that people often want to know more than just the probabilities -- they want to know when events occurred, and how many.  You can get a sense of this from looking at the probabilities plots, but it can be vague, and humans tend to over-simplify and interpret the probabilities plots in a "parsimony-thinking" sort of way.  

Trying to count events "by eye" from the probability plot tends to produce:

(a) too few events, 

(b) the timing is very vague ("probably between these two nodes"), 

(c) over-interprets the probabilities (logic like this: "The most-probable range at node 10 is ABC, so I'll just assume it was ABC, even though ABC only had 10% of the total probability"). 

Problem (d) is that people want to directly convert ML-inferred parameters like d, e, and j into relative probabilities of events, but this cannot be done for several overlapping reasons
- j is a weight not a rate, 
- DEC and other biogeography models have additional events at cladogenesis in addition to d and e events, and these interact in a complicated way; and 
- everything interacts with the distribution of the ranges at the tips, especially if the models don't perfectly fit the data, and models never perfectly fit the data



WHAT BIOGEOGRAPHIC STOCHASTIC MAPPING IS DOING

All that stochastic mapping does is produce a series of exact histories that are all approximately equally probable "realizations" of the true history.  Each of these histories -- each "stochastic map" -- is a sample from the distribution of possible histories (conditional on the tree, model, model parameters, and data). Each exact history contains the exact events, their exact direction, and the exact time of each event.  Stochastic mapping is basically a form of conditional simulation.

By generating 50 or 100 of these histories (or 1000, as in that SysBio paper, although that's more than are really needed, 100 ought to be plenty, but for smaller datasets 1000 could still be fast), you can then get statistical distributions on (1) the number of events of each type, (2) their timing, (3) their direction, etc. 





Physically counting the events, and looking for specific things (like a specific dispersal between area A and area B between 10 and 20 million years ago) can totally be done, by simple operations on R tables (data.frames).  But this requires some basic R skills with lists and tables, in order to manipulate the BSM output tables. (Try here: http://phylo.wikidot.com/2014-summer-research-experiences-sre-at-nimbios-for-undergra )

Basically, each stochastic map produces a table listing the cladogenetic events (=events at nodes), and their times and types etc., and a table listing the anagenetic events (= on branches).  These tables contain the time of the event, the type of event, etc.  Once you've learned out to count things in tables, etc., it's easy. (Also, you will have to understand the table column headings, but I think that should be fairly self-explanatory -- dangerous assumption I guess.)



FIGURING OUT NODE NUMBERS ON PHYLOGENETIC TREES, USING prt()

To find possible events near a particular node, you will have to find the APE node number for that node.  There are various methods, I like the BioGeoBEARS function prt(), which stands for "print tree to table".

This code is one way to do that:

trfn = "tree.newick"
tr = read.tree(trfn)
tr_table = prt(tr, get_tipnames=TRUE)

The table "tr_table" now has a row for each node or tip, the node number of that node, the length of the branch/edge below it, the list of tip names that descends from that node, etc.  I find this much easier to read than the default R tree format.

You will note that the events tables produced by BSM have a similar format to the prt() output tables.

Cheers, Nick



INTERPRETING HISTOGRAMS

P.S.: Interpreting the histograms: in that figure you send, all that was done was:

1. Look a one stochastic map (one possible exact history).

2. Count how many of each type of event there were in that one history.  E.g., 5 jump dispersal events, 30 sympatry/range copying events, 3 vicariance events, etc.

3. Repeat 1 and 2 50 times

4. Summarize those counts as histograms.  The histograms say these are the average number of events in one history:

anagenetic dispersal (d): 0
narrow sympatry (y): ~ 30
subset sympatry (s): ~10
vicariance (v): ~5
founder events (j): ~6

We could easily get standard deviations and 95% credibility intervals if we wanted.

Note that, in DEC or DEC+J, the weights of y, s, and v are the SAME. So you might naively expect the same number of each type of event. But this does not occur, because (a) weights convert to rates depending on the number of each type of event that is possible, and (b) the data will favor certain events over others. 

Parameters and events under various models:

Interpreting the weight parameters y, s, v, and j: 

Cheers!
Nick

DEC+J_Anonymous_BSM_histograms_of_event_counts.pdf
Reply all
Reply to author
Forward
0 new messages