*beast summarizing tree

Jfuchs

unread,

Mar 25, 2010, 2:45:03 PM3/25/10

to beast-users

Dear all,

i am using *beast to reconstruct the phylogeny of a group with 54
species (1 individual sampled per species) and 10 loci.
I am playing with the data and levels of complexity, basically
starting the analyses with two unlinkd loci, then adding an other one,
etc etc until the 10 are in the matrix.
I used the tutorial and builded all the xml files using beauti 1.5
(unlinked topologies and model parameters, and checking on the example
files with the pocket gophers). On the first analyses, all ESS are
above 200 for all parameters in the first two-loci analyses.

Yet, i am now unsure how to visualize the species practically. Since i
unlinked the topologies from the two loci (and did two independant
runs on the data set), does the ‘final tree’ (that is the species
tree) get reconstructed by combining the four tree files in Log
Combiner and then summarized in Tree Annotator ? I could not find it
on the wiki or online help

Thanks a lot
Best
Jerome

pepster

unread,

Mar 25, 2010, 3:25:49 PM3/25/10

to beast-users

*BEAST logs the species tree in the XXX.species.trees (XXX being the
analysis base name). You can use any tool whose input is a trees nexus
file, such as tree annotator, treelog, or DensiTree.

Note: Remember that a single member per species is not recommended for
a species tree analysis.

-Joseph

Message has been deleted

Peter Unmack

unread,

Mar 31, 2010, 7:14:26 PM3/31/10

to beast...@googlegroups.com

G'day there

> Note: Remember that a single member per species is not recommended for
> a species tree analysis.

I have actually been told the opposite. My understanding is that it
depends on what type of analysis you are doing. If you are doing a within
species analysis then you want all of the individuals that you have data
for included. If you are doing dating between species then you want to
have complete sampling of all of the species included (if at all
possible). However, if you start including individuals within species for
a between species analysis then things get tricky as you should set up
different tree priors for calculating within species divergences to the
between species divergences which starts getting complicated if you have a
lot of species in your analysis. Thus it is far easier to simply include
one representative per species. Obviously if your within species
variation is very low then this won't affect dates for most nodes.
However, if variation within is quite high then this may be more of an
issue.

When I used the same tree prior with multiple representatives within
species versus single representatives I got very similar mean dates, but
my confidence intervals were larger with multiple representatives.

I don't really claim to know anything about what I am talking about, so if
this seems incorrect please correct me.

Cheers
Peter

pepster

unread,

Apr 1, 2010, 5:32:49 PM4/1/10

to beast-users

On Apr 1, 12:14 pm, "Peter Unmack" <peter.goo...@unmack.net> wrote:
> G'day there
>
> > Note: Remember that a single member per species is not recommended for
> > a species tree analysis.
>
> I have actually been told the opposite.

Can you please cite the article you are referring to?

> My understanding is that it
> depends on what type of analysis you are doing. If you are doing a within
> species analysis then you want all of the individuals that you have data
> for included. If you are doing dating between species then you want to
> have complete sampling of all of the species included (if at all
> possible). However, if you start including individuals within species for
> a between species analysis then things get tricky as you should set up
> different tree priors for calculating within species divergences to the
> between species divergences which starts getting complicated if you have a
> lot of species in your analysis. Thus it is far easier to simply include
> one representative per species. Obviously if your within species
> variation is very low then this won't affect dates for most nodes.
> However, if variation within is quite high then this may be more of an
> issue.

The main purpose of *BEAST is to perform a "species tree" analysis
given multiple individuals from multiple species.
Using a single representative per species is not forbidden, but may be
problematic. Still, given multiple loci it might be better than the
alternatives, or not, I have not made a comprehensive simulation
study.

-Joseph

Chris

unread,

Apr 1, 2010, 9:03:00 PM4/1/10

to beast-users

Hi Peter, Jerome, et al...

For what it is worth, I can offer some anecdotal thoughts, since I've
run a number of exploratory *BEAST analyses using empirical data,
looking at different combinations of individuals and loci.

With only a single individual per taxon, estimates of piecewise
effective population sizes seem to have somewhat larger variances, and
there also may be a slight upward bias in Ne. However, given an
adequate number of MCMC iterations, I've obtained results that are
very similar to data with multiple individuals. Joseph should correct
me if I'm wrong, but in the absence of multiple individuals per taxon,
the estimates of Ne are based largely on inferred patterns of lineage
sorting among loci. So if you have enough loci, excluding the
potential information provided by multiple individuals within each
locus may (or may not) be a reasonable strategy for simplifying other
aspects of your analysis.

Regarding sampling of species in the focal clade of interest (e.g.,
genus, section, etc.), I think one concern is the interaction with
your tree prior. Under a birth-death or Yule model of speciation,
incomplete taxon sampling will bias divergence times toward the
present. There is a way to correct for this using a <sampleRate>
element as a subset of the <birthDeathModel> element, if you can
estimate the proportion of extant taxa included in the data (under the
assumption of random sampling).

In general, more loci are clearly better, provided there is sufficient
coverage of taxa across multiple loci. When species are missing
sequence data at one or more loci, they can be coded as dummy entries
(e.g., a single sequence coded as missing data) at that locus.
However, these dummy sequences tend to float around in their
respective gene trees, which can lead to "rogue taxa" in the species
tree... depending on how much information is available for those taxa
at other loci. This problem can be especially acute when a species
only has sequence data for a single locus, regardless of how many
individuals are included. So there is a potential tradeoff between
more loci vs. proportion of missing data.

Considering the number of parameters and complexity of the *BEAST
method, all of these factors can interact in unpredictable ways, with
potential implications for bias, variance, computational time, etc.
As Joseph pointed out, these questions are ripe for some simulation
work... but I suspect there will be no general answer to what the
optimal sampling design should be for any given study. At a minimum, I
would strive to sample taxa as completely as possible across multiple
loci/individuals, and perform extensive sensitivity analyses to
determine how your estimates respond to different data structures and
model assumptions.

Best,
Chris

pepster

unread,

Apr 2, 2010, 3:14:57 PM4/2/10

to beast-users

On Apr 2, 2:03 pm, Chris <cdrum...@uidaho.edu> wrote:
> Hi Peter, Jerome, et al...
>
> For what it is worth, I can offer some anecdotal thoughts, since I've
> run a number of exploratory *BEAST analyses using empirical data,
> looking at different combinations of individuals and loci.
>
> With only a single individual per taxon, estimates of piecewise
> effective population sizes seem to have somewhat larger variances, and
> there also may be a slight upward bias in Ne. However, given an
> adequate number of MCMC iterations, I've obtained results that are
> very similar to data with multiple individuals. Joseph should correct
> me if I'm wrong, but in the absence of multiple individuals per taxon,
> the estimates of Ne are based largely on inferred patterns of lineage
> sorting among loci. So if you have enough loci, excluding the
> potential information provided by multiple individuals within each
> locus may (or may not) be a reasonable strategy for simplifying other
> aspects of your analysis.

Yes, but the main point is that Ne for extant taxa *can't* be inferred
even with an infinite number of loci.
Ne for extant taxa then simply follows the prior, and can then create
a bias in divergence times or even the topology.

>
> Regarding sampling of species in the focal clade of interest (e.g.,
> genus, section, etc.), I think one concern is the interaction with
> your tree prior. Under a birth-death or Yule model of speciation,
> incomplete taxon sampling will bias divergence times toward the
> present. There is a way to correct for this using a <sampleRate>
> element as a subset of the <birthDeathModel> element, if you can
> estimate the proportion of extant taxa included in the data (under the
> assumption of random sampling).

Personally this would be the least of my worries. Modeling speciation
as a birth-death process is a convenient choice (perhaps I am wrong
and there are studies showing why that should be the case). In any
case inferring a death rate or a sample rate is almost impossible
(from the data when you have to infer the tree as well).
I am not saying the prior is not important - it is a Bayesian method
and the prior can dominate when the amount of information in the data
decreases. But I think here is another candidate for a nice simulation
study. I have the tools but not the time, and am willing to
collaborate :)

-Joseph

Chris

unread,

Apr 2, 2010, 4:08:10 PM4/2/10

to beast-users

Thanks Joseph for the clarification... I wasn't thinking about the
difference between estimating Ne for terminal vs. internal branches.
But aren't the estimates for terminal Ne also influenced by the
inferred mean value (species.popMean) across all branches? So even
though the prior distribution might exert disproportionate influence
in both directions, the data from internal branches could still inform
the estimates for terminals?

Along this line, I've been wondering whether it would be useful to
implement a model where Ne is assumed to be constant throughout the
topology, rather than piecewise along branches. Although that might
not be terribly realistic, it would cut the number of parameters
dramatically, and might be useful for end-users when there isn't much
data to estimate Ne for the terminals.

I'll defer to you and Tanja on the merits of Yule vs. birth-death
models, and whether simultaneous estimation of extinction rates and
tree inference is a reasonable approach. There is a large body of
literature showing that incomplete sampling of extant taxa results in
a bias toward more recent divergence times ("pull of the present"), at
least for a posteriori analyses of diversification rates. I would
expect the same in this case, but that might not be true, or the bias
could be negligible in light of other issues.

Agreed that all of this is fertile ground for simulation work...

Best,
Chris

Message has been deleted

Suzanne Williams

unread,

Apr 7, 2010, 4:34:06 AM4/7/10

to beast-users

There has been some discussion about how many individuals per species to
include in a species-level phylogeny, however as I understand it, two
different ideas are being considered. The first is when using BEAST with
either a Yule or Birth-death prior, both of which I thought are best
performed using only a single specimen from each species.

The second method uses the BEAST programme but a method named *BEAST which
incorporates priors to cope with population level AND species-level
variation and so multiple specimens of each species can be included. This
seems to me to be the best approach, but I am not clear about how to
implement this method, and there was no supplementary information with the
most recent paper published this year (no xml file) to check. Does anyone
know more about how to use this method? And can anyone confirm if I have
understood correctly?

Thanks,

Suzanne

Chris

unread,

Apr 7, 2010, 10:53:37 AM4/7/10

to beast-users

Hi Suzanne,

You're right, there a lot of different questions floating around in
this thread... I'll take a stab at clarifying these two methods for
estimating a species tree, other folks should jump in if I get it
wrong or make things more confusing...

In a concatenated BEAST analysis, the Yule or birth-death prior
applies to a single tree linked across loci. However, this approach
does not address the possibility of loci having different (possibly
conflicting) genealogical histories due to incomplete lineage sorting.
If there is considerable intraspecific variation, such that there are
many deep coalescent events among species, then selecting a single
individual (even randomly) could bias your results. Moreover, even if
species are monophyletic at a given locus (or across all loci), there
could be deep coalescent events between clades along other internal
branches. Unfortunately, under certain patterns of gene tree
discordance (the infamous "anomaly zone"), concatenation can be
positively misleading, rather than having the desired outcome of
averaging the phylogenetic signal across loci.

A lesser issue is that the Yule or birth-death prior doesn't quite
correspond to the data when multiple individuals per species are
included in a concatenated analysis, since both of these models assume
that tips represent extant species completely sampled from the clade
of interest. The birth-death prior is slightly more complex, in that
it allows for estimates of background extinction, which can influence
the inference of branch lengths. Likewise, both priors can be tweaked
to account for the effects of incomplete random sampling, e.g., with
an a priori estimate of your sample rate. However, as Joseph noted
earlier, these types of mismatch between the tree prior and the data
might not be a big deal in practice... If the data are sufficiently
informative, the influence of a Yule or birth-death prior should be
small, and will decrease with increasing amounts of data. But you
still don't escape the more general problem of lineage sorting among
independently segregating loci.

In contrast, the Yule or birth-death prior in *BEAST applies to the
species tree (which by definition only has one tip per species).
Theoretically, this model is attractive, since it accomodates sequence
variation within species, as well as differences in genealogical
histories among loci. However, it includes many more free parameters
than concatenation, and may suffer from high variance, slow mixing and
convergence, etc... For larger species tree analyses, the amount of
information (loci/individuals) and computational time required to
estimate all of these parameters accurately can become prohibitive.
Compared to concatenation, I've also seen that *BEAST is far more
sensitive to missing data (e.g., only one individual per species, or
not all species sampled for all loci). So depending on the size and
structure of the data, these problems may be so extreme as to preclude
the useful implementation of *BEAST.

While the tree prior in *BEAST does account for multiple individuals,
the question of a Yule vs. birth-death model (with or without complete
sampling of extant taxa) remains... Again, Joseph has argued (and I
agree) that unless the data are so weak that the prior dominates the
results, these issues are not as critical.

There are several *BEAST XML files in the "examples" folder of BEAST.
You can generate your own XML in BEAUTi, but will need a tab-delimited
text file that links each sequence to a species. This file can be
imported using the "import traits" option in the "traits" window of
BEAUTi, for example:

species
species speciesA sequence1 sequence2 sequence3
species speciesB sequence4 sequence5
species speciesC sequence6 sequence7 sequence8

Ultimately, I think the best approach is to try a range of methods and
sampling designs. If the results are similar, then the data are robust
to the choice of model, and life is good. More often though, I
suspect there will be differences, which can make the biological
interpretation difficult... and in the absence of simulations,
untangling the reasons for these differences will be challenging.
Another good strategy is to run analyses that only sample from the
prior distribution, so that you can evaluate the relative influence of
the prior vs. data on your posterior distribution.

Ok, apologies for rambling on... hope this helps.

Best,
Chris

Chris

unread,

Apr 7, 2010, 11:27:41 AM4/7/10

to beast-users

Just wanted to add another quick comment... Although *BEAST is
expected to perform much better with multiple individuals per species,
I think the implications of using this method with only a single
individual per species (or perhaps a consensus sequence w/
intraspecific polymorphism excluded) remains an open question.

-Chris

Suzanne Williams

unread,

Apr 7, 2010, 1:06:41 PM4/7/10

to beast-users

Dear Chris,

Thank you so much! Your answer was very helpful.

Your discussion about concatenated analyses brought up a question I have
asked before - should tree topologies be linked between multiple gene
partitions? Even though my aim is now to use *BEAST, I would still be
interested to hear your opinion. It seems to me to makes sense to link trees
across genes as one single topology is required. This way, each partition is
effectively weighted against each other because different rates of evolution
among gene partitions can be taken into account. Otherwise, if trees are
unlinked, and some sort of consensus is obtained, then surely this is giving
each gene partition equal weight.

For what its worth, in the past I found no significant difference (using
Bayes Factors) between trees calculated using Yule or Birth-Death priors
(multiple genes, trees linked). In these cases I was missing LOTS of taxa
because I was looking at family level phylogenies, with only representative
taxa from some genera.

Thanks again for your help.

Best wishes,

Suzanne

--------------------------------------------------------
Dr Suzanne Williams
Zoology Dept
Natural History Museum
Cromwell Rd
London SW7 5BD
United Kingdom
Tel: + 44 (0) 207 942 5351 (office) 5774 (lab)
Fax: +44 (0) 207 942 5867

http://www.openairlabs.org/research-curation/staff-directory/zoology/s-willi
ams/index.html

Chris

unread,

Apr 7, 2010, 1:16:16 PM4/7/10

to beast-users

Hi Suzanne,

Again, I'll defer to Joseph and the other developers, but here are my
thoughts if I understand your question correctly.

When topologies are unlinked in an ordinary BEAST analysis, you'll
just estimate separate gene trees, basically the same as independent
MCMC runs for each locus. If you want to estimate a posterior
distribution of species trees by combining information from multiple
loci, then your choices - at least in BEAST - are concatenation (i.e.,
linked topologies) or *BEAST (i.e., unlinked topologies, except when
there is physical chromosomal linkage w/out recombination, e.g.,
mitochondrial or chloroplast genes). These two models are
fundamentally different in the way they use information from multiple
loci.

Both approaches will yield a posterior distribution of trees... but
the big question is which approach gives a better estimate of the
species tree??? Simulations using *BEAST and other methods (BEST,
STEM, etc.) suggest that concatenation performs much worse, at least
in some situations. But often simulations are based on idealized
versions of data that can be poor approximations to real-world data
(few loci, uneven sampling of sequences/species, unknown models,
missing data, evolutionary time scales, etc.). So while I think the
limited simulation results available thus far can be a useful guide,
it is just as important to examine the sensitivity of your empirical
data to different model assumptions, tempered by a biologist's
knowledge of the system you are working with.

For both BEAST and *BEAST, there is a wide array of options for
linking/unlinking various parameters and setting priors, thereby
accounting for different rates of evolution among partitions. There is
ample evidence in the literature that rate variation is important, and
this is implicitly included in the *BEAST model when you unlink the
clock models for each locus. Those decisions can be based on
convention (e.g., different rates/clocks/substititution models for
each gene, 1st/2nd vs. 3rd position codon, etc.) and/or some type of
model fitting approach.

Bayes factors derived from the harmonic mean of marginal log
likelihood scores have been trendy for the past several years, but
they have been widely discredited in the statistical literature due to
instability of estimates and infinite variance. It might be better to
consider whether the overall results (topologies, branch lengths,
etc.) are different, rather than relying on a single metric. In any
case, it is nearly impossible to explore the full range of different
model parameterizations, unless your data set is exceedingly small.
Nonetheless, reviewers may expect to see this type of approach. I
think Marc Suchard is working on a thermodynamic integration approach
to estimating Bayes factors, which should address some of these
concerns. There are also reversible-jump MCMC algorithms for model
averaging, but I'm not sure if there has been any progress toward
incorporating this in BEAST.

The question of species tree analyses is very interesting and far from
resolved...

Best,
Chris

Reply all

Reply to author

Forward