tMRCA and ESS

352 views
Skip to first unread message

Couvreur, Thomas

unread,
Apr 30, 2007, 7:41:06 AM4/30/07
to beast...@googlegroups.com
Dear BEASTers,

I am using BEAST to estimate divergence ages at a few nodes in my
phylogeny. Unfortunately within my family fossils are rare, and I only
have one calibration point deep down. I have placed a normal
distribution on that calibration age.
I then provide tMRCA elements in the XLM file for the nodes I am
interested in estimating the ages of, so I can see if the ESS is
appropriate or not.

After running BEAST for 10 million generations (2 weeks...) I still can
t get correct ESS for those nodes (ESS around 20-30 only)...I also get
low ESSs for a few other parameters such as 'coalescent' or 'meanrate'.
I have tried all the suggestions on the BEAST website with no effect. I
tried sampling more frequently (from 1000 to 100 times) with no effect.
I have also tried adjusting the operators, but without success.
Does any body know what I should do or try in order to get better ESS
for tMRCAs? Because if I can t get better ESSs then I suppose that my
estimations are bad and that I wouldn t be able to use them...
Should I try to run the chain for 20 or 30 million generations?
Thanks for any comments
Thomas

---------------------------------------
Thomas L.P. COUVREUR Ph.D. student
Wageningen University, Biosystematics Group
National Herbarium of the Netherlands - Wageningen branch
Generaal Foulkesweg 37
6703 BL Wageningen
The Netherlands
email: Thomas....@wur.nl
http://www.bis.wur.nl/UK/Staff/Thomas+Couvreur/
Phone: +31 317 48 31 61
Fax: +31 317 48 49 17
Skype pseudo (free calling via internet): isolona21
-----------------------------------------------------------

alexei....@gmail.com

unread,
Apr 30, 2007, 8:21:10 AM4/30/07
to beast-users
Hi Thomas,

(1) If you were running this analysis on a Windows machine then you
will find a 2-3 times speed up with the new version of BEAST (1.4.2)
that will be released in the next couple of days. Also, how may rate
categories did you have on the gamma distribution? The time BEAST
takes is linear with the number of gamma categories - so a model with
12 gamma categories takes 3 times longer than one with 4 gamma
categories. For most applications 4-6 categories does a good job.
Also, for many data sets codon partitioning fits better than gamma
(see Shapiro et al, 2006) and this model is 4 times faster than gamma.

(2) It sounds like you have lots of sequences. This may well mean that
you need to run a very long chain (i.e. >100 million states), or
alternatively a large number of shorter chains (20 millions states x
10 computers). I have started to do some research into the way
chainlength needs to increase as the number of sequences increase. It
appears to be quadratic. So if you double the number of sequences then
you need to quadruple the length of the chain to get the same ESS. So
you could analyse a subsample of, say, 20 sequences and use this to
estimate how long a chain you need to analyze 40 or 80 or 160
sequences and obtain the same ESS.

(3) Read the manual for more suggestions on improving ESSs including
changing the weights on some operators. However, in general if you
have much more than 100 sequences you should expect the analysis to
involve a lot of computation. Doing multiple runs and combining them
is a good idea.

Cheers
ALexei


On Apr 30, 11:41 pm, "Couvreur, Thomas" <Thomas.Couvr...@wur.nl>
wrote:

> email: Thomas.Couvr...@wur.nlhttp://www.bis.wur.nl/UK/Staff/Thomas+Couvreur/

archdev

unread,
May 2, 2007, 12:36:19 AM5/2/07
to beast-users
Dear Thomas,

I'm not too experienced in BEAST myself, but have played around a bit.
In addition to Alexei's suggesetions, I've got the following
additional ones:

1. Have you tried running BEAST using simpler models (e.g. HKY, strict
clock, constant population)? By starting with the simplest models and
then gradually stepping up, you can have a better idea of how long
it'll take to complete the analysis using the more complicated models
you want. Also, if you fail to see convergence under the simplest
models, then perhaps you need a better dataset.

2. Another way I've found to be quite 'useful' in shortening the
computation time is to consider using 'narrower' priors. For example,
if you are looking at plant evolution, and you are very sure you don't
want to consider "unreasonable" molecular clock rates (say, faster
than HIV evolution), then restricting your priors to exclude those
rates can give a significant improvement in speed and sampling
efficiency. Also, check the trees from your BEAST runs. If they show
up some "unreasonable" topologies (usually due to unreasonable rates
in some branches of the tree), then perhaps it'll be useful to enforce
some monophyletic clusters on the tree. These factors can often
contribute to small ESSs under relaxed clock models, especially the
lognormal clock.

On the other hand, by specifying more stringent priors for your
analysis, you are basically disregarding the possibility of these
"unlikely" scenarios and could potentially miss something unexpected
and important, e.g. greatly accelerated evolutionary rates during some
period in the past. In the end, it's up to you to decide whether it's
worth doing so.

Hope that helps,
archdev

Couvreur, Thomas

unread,
May 2, 2007, 4:23:14 AM5/2/07
to evil_di...@yahoo.com, beast-users
Hello,

Thanks very much for all the suggestions, I will try them out and see
what comes of it.

In fact, most of my ESSs are fine, it is just the ones concerning the
tMRCA element of the nodes I want to have a HPD of. The parameters of
the models (I have a partioned dataset) are all fine too. What is
worrying is that even with longer chains there seems to be no
improvement...

My dataset is of 64 taxa and 7720 characters. What exactly does one mean
by a "better dataset"? Is that in terms of more sequence data? Better
calibration points, more calibration points? My tree is almost fully
resolved with 95% of nodes having a 1.00 PP. I am not sure I will be
able to get a better tree than that.

Using simpler models: That seems to be a good approach. I will try it
out. But how useful is it to get proper ESSs with a simple model, when I
should be using a more appropriate model (which in my case is more
complicated)? Does anybody know if the date estimates are influenced by
the model we use? I mean if I get good ESSs with the simple model, would
my age estimates be valid?

I shall also wait for the new version to come out, and will start fresh
with that one!

Thanks for all the advice,

Thomas

Andrew Rambaut

unread,
May 2, 2007, 4:31:35 AM5/2/07
to thomas....@wur.nl, beast-users
Dear Thomas,

Take a look at the traces for these tMRCAs. Are they bi-modal or do they
have a general trend over the course of the run? The ESS is chain length
divided by the autocorrelation time (ACT) which is the number of steps
apart that two samples have to be to be uncorrelated. Thus if there is
bi-modality or a trend, then the ACT can be very large (and the ESS
small).

Andrew

On 2 May 2007, at 09:23, Couvreur, Thomas wrote:

> In fact, most of my ESSs are fine, it is just the ones concerning the
> tMRCA element of the nodes I want to have a HPD of. The parameters of
> the models (I have a partioned dataset) are all fine too. What is
> worrying is that even with longer chains there seems to be no
> improvement...

___________________________________________________________________
Andrew Rambaut
Institute of Evolutionary Biology University of Edinburgh
Ashworth Laboratories Edinburgh EH9 3JT
EMAIL - a.ra...@ed.ac.uk TEL - +44 131 6508624

Simon Ho

unread,
May 2, 2007, 4:51:20 AM5/2/07
to beast-users
Hi Thomas,

You mention that 95% of your nodes have PPs of 1.0, but check the PP
of the node being used for calibration. If it is not 1.0 (or close to
1.0), then your calibration prior might actually be applying to
different nodes during the MCMC. This will tend to lead to a bimodal
trace for the node, or worse. This reasoning also applies to nodes
that are being dated.

If this is the situation, then running the MCMC for longer will not
substantially improve the results.

One way to check whether this is causing a problem is to fix the tree
topology (start the MCMC with your majority-rule consensus tree and
remove all tree topology operators).

Simon

> archdev- Hide quoted text -
>
> - Show quoted text -

archdev

unread,
May 2, 2007, 5:38:08 AM5/2/07
to beast-users
Dear Thomas,

The unfortunate thing is that you need good ESS for the tMRCA in order
for the estimate to be reliable. My bet is that the ESS values of
coalescent, rootHeight, ucld.mean (or uced.mean) and meanRate for your
runs are all quite low too. Am I right?

There is no straightforward definition of a "good" dataset. It all
depends on the question you are asking. From my limited experience, if
you have a few clusters of modern sequences, and these clusters are
separated by considerable distances (such that one expects their
divergence to be quite ancient), then you will have trouble estimating
their divergences under the more complex models. Put it another way,
if the divergence occurred hundreds of thousands to millions of years
ago, and you have only a number of sequences that are decades apart,
then you are trying to estimate divergence using relatively
isochronous sequences. In this case, you have to specify some strong
priors as advised in the documentation. Better and more calibration
points in ancient parts of the tree act as good, strong priors, and
should help you a lot in estimating the tMRCA.

I recommend using simpler models to get an idea of how long it will
take for the more complex models to run. Of course, using a parameter-
rich model is desirable in the majority of cases, but when the
sampling efficiency is not satisfactory, one is simply forced to
resort to a simpler model. Yes, the age estimates will be influenced
by the model. The age estimate will be as "valid" as the model used to
obtain the estimate.

The problem you are facing (lack of convergence? poor sampling
efficiency) is a known limitation of the MCMC method. Don't give up.
If a simpler model is all one can achieve, then so be it.

BTW, I think the new version is already out, though the last time I
checked it, the library for the native likelihood core for Windows is
placed within the lib subdirectory. You have to copy the dll into the
same directory as beast.exe or the current directory (at least if you
are starting from the GUI version).

Regards,
archdev

Couvreur, Thomas

unread,
May 2, 2007, 10:45:53 AM5/2/07
to evil_di...@yahoo.com, beast-users
Dear all,
Once again thanks for all the comments!

Yes, some (but not all) of the traces of the tmrca are bi-modal (one is
even tri-modal), as well as the other parameters that do not reach a
suitable ESS (see below). They indeed correspond to a high ACT. However,
and in answer to Simon, the node of my calibration point has a PP of
1.00. In fact only one of the nodes I am trying to estimate the date of
has a weak PP (0.78). I have already tried fixing the topology to the
majority rule consensus tree, but that didn t change much to the results
(didn t improve the ESSs).
Finally, and in response to archdev, my ESSs of coalescent, ucld.mean
and meanRate are also low. (but not the rootHeight which has a very very
large ESS)

For one of the tMRCA nodes (quite deep down) I have a very skewed
posterior, with the region to the right being largely more sampled than
the rest of the range. What does that indicate? An inappropriate prior
value given the data?
And any suggestions on how I can get rid of the bimodal situation?

I have now downloaded the new version, and will start from scratch. I
will try very simple models to beginning with as suggested by archdev.
If that works ok, I will then try a very long run on (50 million) on two
different computers with the more complex model.

I will try and fiddle a bit to see what I can do and try not to give
up!. Thanks and sorry for overloading the email boxes...

Reply all
Reply to author
Forward
0 new messages