How BEAST deal with missing site and gap now?

2,907 views
Skip to first unread message

zheng hongxiang

unread,
May 26, 2011, 8:27:11 PM5/26/11
to beast-users
Hello, all.
I really want to know how beast deal with missing site('N'&'?') and
gap('-'), I have checked
http://groups.google.com/group/beast-users/browse_thread/thread/d5b6da3792f59740/d8a37b57de073788?lnk=gst&q=missing#d8a37b57de073788,
but no definite reply could I get from there.

I have checked in the manuals, also could find no information. I know
in Paml, it delete all sites with one gap or missing site found.In
Mrbayes, gaps and missing characters will also not contribute to any
phylogenetic information. I want to konw whether BEASET use a similar
strategy or give N a equal probability on 4 nucleotides?

Gabrielle (Abby) Harrison

unread,
Jun 2, 2011, 3:08:29 AM6/2/11
to beast...@googlegroups.com
Dear Zheng Hongxiang,

did anyone answer this question? I would be curious to know the answer as well.


yours sincerely

Abby Harrison


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To post to this group, send email to beast...@googlegroups.com.
To unsubscribe from this group, send email to beast-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beast-users?hl=en.




--
Dr. G. L. A. Harrison
University of Oxford
OX1 3SY
Ph:+(0)1865 281532;
Fax:+(0)1865 281890;
Mb:+ 07581285242.

zheng hongxiang

unread,
Jun 2, 2011, 4:46:23 AM6/2/11
to beast...@googlegroups.com
Sorry, there is no answer.

2011/6/2 Gabrielle (Abby) Harrison <a.g.l.h...@googlemail.com>



--
Zheng Hongxiang
State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology,
School of Life Sciences and Institutes of Biomedical Sciences,
Fudan University,
220# Handan Road, Shanghai (200433)
P.R.China

郑鸿翔
上海市复旦大学生命科学学院
现代人类学教育部重点实验室

Andrew Rambaut

unread,
Jun 2, 2011, 5:12:51 AM6/2/11
to beast...@googlegroups.com

On 27 May 2011, at 01:27, zheng hongxiang wrote:

> I have checked in the manuals, also could find no information. I know
> in Paml, it delete all sites with one gap or missing site found.In
> Mrbayes, gaps and missing characters will also not contribute to any
> phylogenetic information. I want to konw whether BEASET use a similar
> strategy or give N a equal probability on 4 nucleotides?

BEAST treats gaps the same way as PAML and MrBayes - a gap character ('-', '?' or 'N') is treated as missing and does not contribute any probability to the likelihood for that branch and site. This is the same as saying that there is equal marginal probability for the 4 nucleotides. Joe Felsenstein's book, 'Inferring Phylogenies' has a section describing this. One additional thing to note is that, by default, BEAST treats any of the ambiguities codes (IUPAC) as gaps or missing data (i.e., an R is treated as an N). This simple approximation allow a considerable (up to about 50%) speed up in the likelihood calculation but if ambiguities are important to your analysis you can override this behaviour. See this posting:

http://groups.google.com/group/beast-users/browse_thread/thread/d0a07cb06b185ca5/df26a3a217146601?lnk=gst&q=gaps#df26a3a217146601

Andrew

___________________________________________________________________
Andrew Rambaut
Institute of Evolutionary Biology University of Edinburgh
Ashworth Laboratories Edinburgh EH9 3JT
EMAIL - a.ra...@ed.ac.uk TEL - +44 131 6508624

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

zheng hongxiang

unread,
Jun 2, 2011, 5:34:09 AM6/2/11
to beast...@googlegroups.com
Thank you , Dr Rambaut.

I think Paml cuts the whole site off wherever a missing site or a gap is found, this means the whole site is ruled out whether the site is originally a polymorphic/divergent site.This means the whole site really offers no information. According to you, BEAST 's strategy seems safer because it reserves the information offered by other no-missing or no-gapped seqence on that gap/missing containing site.

The example is like the follows.

A
A
N
T
T

In short, Paml deletes the whole site while Beast reserves the whole site, but gives an equal marginal probability on the 4 nucleotides just for the only 'N'. Is the interpretation proper?


2011/6/2 Andrew Rambaut <a.ra...@ed.ac.uk>
--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To post to this group, send email to beast...@googlegroups.com.
To unsubscribe from this group, send email to beast-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beast-users?hl=en.

Derek

unread,
Jul 30, 2011, 11:54:52 AM7/30/11
to beast-users
Dear Andrew,

I am very interested in setting up a 5x5 matrix for my evolutionary
model to handle single indels. My multiple sequence alignment have a
number of significant single indels but no gaps that that are more
than single. The published articles I have read about including single
indels in evolutionary models have shown positive results. In my
particular phylogeny, the indels are critical. Is there a way with
BEAST to set up a 5x5 matrix - for example the F84+Gaps model as
described by McGuire, Denham and Balding (2001) in "Models of Sequence
Evolution for DNA Sequences Containing Gaps" - although any other 5x5
model along those lines would probably be just as workable?

Sincerely,

Derek


On Jun 2, 5:12 am, Andrew Rambaut <a.ramb...@ed.ac.uk> wrote:
> On 27 May 2011, at 01:27, zheng hongxiang wrote:
>
> > I have checked in the manuals, also could find no information. I know
> > in Paml, it delete all sites with one gap or missing site found.In
> > Mrbayes, gaps and missing characters will also not contribute to any
> > phylogenetic information. I want to konw whether BEASET use a similar
> > strategy or give N a equal probability on 4 nucleotides?
>
> BEAST treats gaps the same way as PAML and MrBayes - a gap character ('-', '?' or 'N') is treated as missing and does not contribute any probability to the likelihood for that branch and site. This is the same as saying that there is equal marginal probability for the 4 nucleotides. Joe Felsenstein's book, 'Inferring Phylogenies' has a section describing this. One additional thing to note is that, by default, BEAST treats any of the ambiguities codes (IUPAC) as gaps or missing data (i.e., an R is treated as an N). This simple approximation allow a considerable (up to about 50%) speed up in the likelihood calculation but if ambiguities are important to your analysis you can override this behaviour. See this posting:
>
> http://groups.google.com/group/beast-users/browse_thread/thread/d0a07...
>
> Andrew
>
> ___________________________________________________________________
>   Andrew Rambaut                
>   Institute of Evolutionary Biology       University of Edinburgh
>   Ashworth Laboratories                         Edinburgh EH9 3JT
>   EMAIL - a.ramb...@ed.ac.uk                TEL - +44 131 6508624      

Andrew Rambaut

unread,
Jul 30, 2011, 1:02:54 PM7/30/11
to beast...@googlegroups.com
Dear Derek,

It might make more sense to code the gaps as an additional binary data matrix. That way, you would have a rate of insertion/deletion relative to point mutation. By going to a 5x5 matrix you add a lot of additional parameters that you may not have sufficient data to infer. However if you do want to define your own substitution model, take a look at this tutorial:

http://beast.bio.ed.ac.uk/General_Data_Type

Best,
Andrew

Marc Suchard

unread,
Jul 31, 2011, 10:32:23 AM7/31/11
to beast-users

You may also want to consider using stochastic Dollo (see Alekseyenko
et al 2008) to model the 0/1 insertion-deletion characters, since the
usual binary models allow for back-mutation. For indel characters,
back-mutation violates our standard definitions of homology.

best, Marc

Derek

unread,
Jul 31, 2011, 11:26:31 AM7/31/11
to beast-users
Dear Andrew,

Thank you for your entirely cogent response. I will definitely explore
both possibilities.

Best,

Derek.

On Jul 30, 1:02 pm, Andrew Rambaut <ramb...@gmail.com> wrote:
> >>  EMAIL - a.ramb...@ed.ac.uk                TEL -+44 131 6508624     

Derek

unread,
Aug 1, 2011, 12:13:20 PM8/1/11
to beast-users
Marc,

I read the paper and this is yet another good suggestion. I think my
problem at this point is how to tame the BEAST. I've been studying the
XML reference guide and the examples to understand how the models are
built and the parameters defined.

Let's say for example I want to follow Andrew's suggestion and use
HKY, addding a <binarySubstitutionModel> and define a parameter for
the mutation rate of G|A|T|C to -. If I use a <siteModel> element,
then I'm only allowed to define one substitution model at a time.
Should I be using the <sampleStateModel> element, then in that define
both the <HKYmodel> and <binarySubstitutionModel> ? Do you know of
any resources where I can find more sample XML code?

I am afraid that with enough tweaking, I may get BEAST to erroneously
reach whichever conclusion I want it to reach.

Best,

Derek

Andrew Rambaut

unread,
Aug 1, 2011, 7:50:53 PM8/1/11
to beast...@googlegroups.com
Dear Derek,

You should be able to load a binary (0,1) dataset into BEAUti (NEXUS format) and everything will be set up for you.

Andrew

> --
> You received this message because you are subscribed to the Google Groups "beast-users" group.
> To post to this group, send email to beast...@googlegroups.com.
> To unsubscribe from this group, send email to beast-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beast-users?hl=en.
>

___________________________________________________________________


Andrew Rambaut
Institute of Evolutionary Biology University of Edinburgh
Ashworth Laboratories Edinburgh EH9 3JT

EMAIL - a.ra...@ed.ac.uk TEL - +44 131 6508624

Amelia Chemisquy

unread,
Aug 2, 2011, 4:56:48 PM8/2/11
to beast...@googlegroups.com
Hi,
I am having a problem with treeannotator. When I try to read the .trees
file from my beast analysis I get the following error message:
Unable to parse number correctly: empty string.
Any ideas what could it be??

Thanks!
Amelia

Dra. Amelia Chemisquy
Divisi�n Mastozoolog�a
Museo Argentino de Ciencias Naturales "Bernardino Rivadavia" - CONICET
Av. Angel Gallardo 470 - C1405DJR -
Buenos Aires - Argentina -
Tel/Fax.: (5411) 4982-0306 / 1154 / 5243 / 4494 - Int. 210
http://www.macn.secyt.gov.ar/


Derek

unread,
Aug 6, 2011, 5:48:09 PM8/6/11
to beast-users
Are the steps: write a script to convert the nucleotide data for all
our taxa to (0,1) where 0 represent A,T,G,C and 1 represent a gap,
then 2. load into BEAUti the original nucleotide data, plus the 0,1
format data, as two separate partitions?

My apologies for needing to ask.

Derek
> > For more options, visit this group athttp://groups.google.com/group/beast-users?hl=en.
>
> ___________________________________________________________________
>   Andrew Rambaut                
>   Institute of Evolutionary Biology       University of Edinburgh
>   Ashworth Laboratories                         Edinburgh EH9 3JT
>   EMAIL - a.ramb...@ed.ac.uk                TEL - +44 131 6508624        

Derek

unread,
Aug 6, 2011, 9:03:32 PM8/6/11
to beast-users
Dear Andrew,

What of the TKF91 model to solve my (our) indel problems? I see in the
BEAST XML reference that this model is still supported (although I
don't see any working examples in the tutorials). Was the TKF91 model
problematic?

Best,

Derek


On Aug 1, 7:50 pm, Andrew Rambaut <a.ramb...@ed.ac.uk> wrote:
> > For more options, visit this group athttp://groups.google.com/group/beast-users?hl=en.
>
> ___________________________________________________________________
>   Andrew Rambaut                
>   Institute of Evolutionary Biology       University of Edinburgh
>   Ashworth Laboratories                         Edinburgh EH9 3JT
>   EMAIL - a.ramb...@ed.ac.uk                TEL - +44 131 6508624      

RobinvanVelzen

unread,
Aug 6, 2011, 9:38:04 PM8/6/11
to beast-users
Dear Derek,

I suspect Andrew suggested something like the following: have a
partition with your DNA sequence data with the indels as N or - or ?
which will be treated as unknown. Seperately, you can add a partition
with binary coded indel characters. No need to recode the whole
alignment. Simply recode only the characters that have indels, with
ATCG as 1 and gap as 0 (or vice versa). So if your sequence alignment
has 10 characters with indels, your binary partition is only 10
characters long.

One other paper to reference with regards to indel coding may be
Simmons & Ochoterena 2001. But I guess that with single-nucleotide
indels only your coding should be relatively straightforward.

As Marc rightly pointed out, characters that were lost in a deletion
event cannot logicaly be regained, which is why you may want to set a
stochastic Dollo model (no back-mutations) for the partition with
binary coded indels.

As far as I know (but I am no expert on this) the TKF91 model is used
mainly for multiple sequence alignment of length-variable sequences,
rather than for phylogeny reconstruction. But others may correct me on
this.

I hope this helps...

Best,

Robin

Andrew Rambaut

unread,
Aug 6, 2011, 10:49:49 PM8/6/11
to beast...@googlegroups.com
Thanks Robin for responding - I am travelling at the moment and haven't
had a chance to get through my emails.

Robin is is correct here. What Marc and I were suggesting is to separate
out the nucleotides and the indels into partitions and then load them into
BEAUti and give the Dollo model to the indels.

The TKF91 is not really a workable model I believe. If you want to do the
joint tree/alignment approach then Marc's BaliPhy is probably the only workable
solution at the moment.

Andrew

> For more options, visit this group at http://groups.google.com/group/beast-users?hl=en.
>

___________________________________________________________________
Andrew Rambaut
Institute of Evolutionary Biology University of Edinburgh
Ashworth Laboratories Edinburgh EH9 3JT

EMAIL - a.ra...@ed.ac.uk TEL - +44 131 6508624

Reply all
Reply to author
Forward
0 new messages