Basic question on effect of missing data on divergence time estimation?

484 views
Skip to first unread message

Alexander Gamisch

unread,
Mar 21, 2011, 9:30:35 AM3/21/11
to beast-users
Dear Beast users,

i am new to beast and i have a very basic which i hope could be
answered by some of the more advanced users. I looked up the manual
and searched the group but could not find any usefull inforamtion yet.

i have troubles to understand how to deal with missing data in my
alignment and its effect on the divergence estimation in beast.

I used an Genbank assembled alignment of 60 sequences (1 sequence per
species) of an chloroplast marker. Due to the nature of the NCBI
genbank those sequences have all different lenght (ranging from 1000
to 1700bp).
I use published fossil times for calibration but my problem is how to
deal with those gaps of missing data.

In the manual is written that gaps are treaten as missing data: but
when i use the mixed size alignment then i get younger estimations
with a shorter 95%HPD intervall for my target group. ( for example
ranging from 6 to 16 myr)

In contrast when i trimm all sequences to equal size (or fill them
with "?" to equal size) or substitue short Genbank sequences with my
own (longer) sequences of the same species then the age estimations
for this group are getting older ( for example ranging from 8 to 22
myr).

So i have three, very basic questions:

1) Is it ok to incorporate short incomplete sequences into alignment
of longer sequences?

2) Do i have to trimm them all to equal size?

3) Or better fill the shorter incomplete sequences with "?" or "N" ?

I would be ver happy if anybody could give me a clue!

Kind regards

Alexander Gamisch

Patrice Showers Corneli

unread,
Mar 21, 2011, 10:30:53 AM3/21/11
to beast...@googlegroups.com
All,

I'd like to hear back on this one as well.

It is a great question that deserves some discussion.

Seems to me that the deeper times are a result of greater average (per site) substitution rates. Including the gaps makes the sequences longer but they still have the same information about substitutions. So the average goes down and the rate of evolution (or time) goes down.

If the missing data are truly insertion/deletion events, I don't think it would do to leave them out. But if the data is simply missing then perhaps for the purposes of estimating divergence times those sites ought to be excluded.

Patrice Showers Corneli
cor...@biology.utah.edu

> --
> You received this message because you are subscribed to the Google Groups "beast-users" group.
> To post to this group, send email to beast...@googlegroups.com.
> To unsubscribe from this group, send email to beast-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beast-users?hl=en.
>

Alexander Gamisch

unread,
Mar 22, 2011, 8:23:41 AM3/22/11
to beast...@googlegroups.com
Thanks a lot for your reply!

i have done some testing to further illuminate the problem:  it seems like that there is not a too big difference if i fill the gaps with "?" or leave them as they are. Both ways lead to a more or less similar estimation of divergence time.

But there is still a big difference if i trimm the sequences, as mentioned before, leading to much deeper times compared to the untrimmed alignment.

 I am still confused: if i include the gaps the sequences get longer but they still have the same information on substitutions so the average goes down and the time goes down? So in a short gappy sequence are less substitutions compared to a full sequence...would this not lead to deeper times as the short sequences are not so divergent from the full one because of less informative characters?

So its still very tricky how to deal with incomplete sequences and how to deal with the alignment. For my understanding i would trimm the alignment to equal size and thus analyse the subset of the markerregion for divergence time dating. But this is may highly missleading.

Or better keep the gaps? What do you do if only a have partial alignment of a large reagion (for example only 1000bp alignment of an 6000bp marker)?

Maybe i am asking the same question again and again but i am really confused on this topic.

I would be happy for any opinion!

Kind regards

Alexander Gamisch



2011/3/21 Patrice Showers Corneli <showers...@gmail.com>

Andrew Rambaut

unread,
Mar 22, 2011, 8:05:02 PM3/22/11
to beast...@googlegroups.com
Dear Alexander,

Do you think your regions (the region for which you have all 60 sequences
and the region for which you are missing some data) have the same rate of
evolution? Perhaps adding the additional regions in the longer sequences
are faster evolving giving a higher average rate when these are included.

You could divide your sequences into two partitions to allow different
rates of evolution and see what happens.

Andrew

> --
> You received this message because you are subscribed to the Google Groups "beast-users" group.
> To post to this group, send email to beast...@googlegroups.com.
> To unsubscribe from this group, send email to beast-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beast-users?hl=en.
>
>

___________________________________________________________________
Andrew Rambaut
Institute of Evolutionary Biology University of Edinburgh
Ashworth Laboratories Edinburgh EH9 3JT
EMAIL - a.ra...@ed.ac.uk TEL - +44 131 6508624

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Alexander Gamisch

unread,
Mar 24, 2011, 11:40:10 AM3/24/11
to beast-users
Thank you very much for your suggestion!

In the alignment the shorter sequences are not equally distributed, so
some of them are aligning on the 5´- others on the 3´ end of the
alignment. So the gaps dont occur in one certain region which could be
easily defined as a partition. Because of that i tried to partition
the alignment half-half and unlinked subst. model and clock model and
linked the trees. This again lead to a simliar shallow estimation of
divergence time in this group as before...suggesting that the rates in
the whole region are not evolving to different?

After some literature search it seems to me that it is common to trimm
the matrix to equal size in order to avoid erroneous estimates of
divergence times ( Schenk and Hufford. Systematic Botany 35(3):
578-592. 2010.)

Since i dont know any better way i will just exclude too short
sequences from the analyses and trimm the rest of the sequences to
equal size.

I am still happy for any opinion!

Kind regards

Alexander Gamisch
> > For more options, visit this group athttp://groups.google.com/group/beast-users?hl=en.
>
> ___________________________________________________________________
>   Andrew Rambaut                
>   Institute of Evolutionary Biology       University of Edinburgh
>   Ashworth Laboratories                         Edinburgh EH9 3JT
>   EMAIL - a.ramb...@ed.ac.uk                TEL - +44 131 6508624      

John Schenk

unread,
Mar 24, 2011, 2:09:43 PM3/24/11
to beast-users
Hi Alexander,

Lemmon et al. (2009; full citation below) did a nice job looking at
the impact of missing data on phylogeny estimates and found that among
site rate variation with missing data has the most severe consequence
in topology and branch length estimates. I, therefore, like Andrew’s
suggestion of creating partitions (even if you have to create three of
them) as one method to deal with missing data (given that the priors
on branch lengths and rate heterogeneity are appropriate; see Lemmon
et al. 2009). One benefit I see from this approach is that you would
be taking into account some of the uncertainty due to missing data.

There is nothing inherently wrong with trimming your data down and
running the analysis that I am aware of, but most people are reluctant
to do so because of the potential of loosing phylogenetic signal. If
you are still able to robustly resolve relationships with a parsed
matrix, it may be more reasonable than an analysis that is overwhelmed
by missing data. Alternatively, deleting those short sequences is
another approach, as long as the taxa are not vital to your study.

~John

Lemmon, A. R., J. M. Brown, et al. (2009). "The Effect of Ambiguous
Data on Phylogenetic Estimates Obtained by Maximum Likelihood and
Bayesian Inference." Systematic Biology 58(1): 130-145.

Alexander Gamisch

unread,
Mar 30, 2011, 3:53:48 AM3/30/11
to beast-users
Thank you very much for your reply!

Your opinion and the citation you gave is really very useful!

I will delete to short sequences (not vital to my study) and then try
both ways with the rest of the alignment. I will update the outcame
some time later!

Kind regards

Alexander Gamisch
Reply all
Reply to author
Forward
0 new messages