dealing with large data sets in BEAST

200 views

Skip to first unread message

Tiina Sarkinen

unread,

Feb 13, 2008, 9:33:01 AM2/13/08

to beast...@googlegroups.com

I'm writing to ask peoples experiences/current views on analysing large sequence
data sets in BEAST package: as sequence data is accumulating, more and more of
us are dealing with large (200-500 or more) sequence data sets. When using
BEAST, our current approach is to generate a user defined starting topology
which we insert into
the xml file to run BEAST with. There are several ways one can derive a starting
topology, and we wonder if you have any thoughts about how to best do this.

As an example, we work on Legumes for which a large matK sequence data set for
the whole family (460 taxa currently, but more being added) can be used with 12
fossil calibrations to derive age estimates (originally published by Matt Lavin
et al. in 2003 in Systematic biology). We can’t run BEAST without inserting a
user-defined starting tree to the xml file. To generate this starting topology,
there are several ways people do this, and question is what way is the best way
if any.

If working with around 260 taxa, we can run our dataset with only one fossil
constraint at the base of the tree to get a starting
topology which was then used for another analysis with 12 fossil constraints.

Alternatively, when working with more taxa (>320 based on our experience), we
can’t even run BEAST with a single fossil constraint to get a starting topology.
So we normaly run MrBayes to get a starting topology with branch lengths, then
making it ultrametric in r8s program, and then insert that ultrametric topology
into xml file for BEAST (method used by collaborators in Holland as well).
People might have other ways too, and we wanted to get ideas how best to get a
starting tree.

Another question is how to deal with large matrices with data partitioning. The
way our collaborators in Holland do it is by directly editing the XLM file and
following Couvreur (http://tlpcouvreur.googlepages.com/beastpartitioning).

If you have any ideas/opinions/methodological comments how best to deal with
large matrices/data partitiones in BEAST, it would be really good to see what
you think. Any response appreciated!

--
Tiina Sarkinen
D.Phil. student
Department of Plant Sciences,University of Oxford
South Parks Road, Oxford, OX1 3RB, UK

Couvreur, Thomas

unread,

Feb 18, 2008, 10:17:48 AM2/18/08

to tiina.s...@plants.ox.ac.uk, beast...@googlegroups.com

Hi Tiina,

The method with which you generate the starting tree (and corresponding
branch lengths) is not very important as long as it is a reasonable
hypothesis of the relationships between the specimens within your
dataset allowing the generation of a good enough initial tree likelihood
(that is of course if you are also looking to infer the tree topology
using BEAST). The starting tree gives BEAST a little nudge in the right
direction instead of having it crash from the start because it was too
far from the optimum. Besides from that the starting tree doesn t
influence the outcome of the analysis. At least that is what I have seen
with my datasets...

What I do is run a 'maximum parsimony' analysis and then make it
ultrametric with r8s. MP is good for large datasets cause you will get
your basic tree quickly using parsimony ratchet for example. With
MrBayes you would actually have to do the Bayesian analysis twice and
that can take forever with large datasets... Up to now this way has
always worked.

I am hoping that my website will be obsolete in the near future :) I
havn t come across any other way to partition a dataset in BEAST
(correct me if I wrong).