Treating ambiguities as informative sites in BEAST

1,623 views
Skip to first unread message

Alastair Potts

unread,
Sep 18, 2013, 2:58:22 PM9/18/13
to beast...@googlegroups.com
Hello all,

Most phylogenetic software treats the IUPAC ambiguity codes as "uncertainty" rather than "known" polymorphisms - and quite often the IUPAC code represents the latter.

We have demonstrated that treating intra-individual site polymorphisms can dramatically improve phylogenetic reconstruction for NJ, MP and ML (http://sysbio.oxfordjournals.org/cgi/reprint/syt052?ijkey=BhTtcB92GZeXzEs&keytype=ref) and I would very much like to try this approach in BEAST.

I have been playing around with the BEAST structure and I can't figure out how to do this. I have used BEAST before, in passing, but have never had try too hard to get under the hood and mess around. I would greatly appreciate any help or suggestions on how to increase the parameter matrix to include the transition rate estimates between all IUPAC codes (i.e. ambiguities treated as informative).

I am fully aware of the <treeLikelihood id="treeLikelihood" useAmbiguities="true"> code. This simply treats the ambiguity codes as a little less ambiguous (e.g. R is treated as R, not N).

I expect that there will be dramatic increases in run-time. Hopefully this will not be so debilitating as to make the approach unfeasible.

Again, any help would be greatly appreciated.

Thanks in advance,

Cheers,
Alastair

 

Arman Bilge

unread,
Sep 18, 2013, 6:58:36 PM9/18/13
to beast...@googlegroups.com
Hi Alistair,

To achieve the model that you describe, I suggest that you consider defining your own general datatype in the XML (as in the example here) where you define the IUPAC letters as state codes instead of ambiguity codes. The datatype you define can be used alongside the general substitution model, which can be referenced in your <treeLikelihood> element.

Good luck!
Arman


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/groups/opt_out.

Alastair Potts

unread,
Sep 22, 2013, 10:46:12 AM9/22/13
to beast...@googlegroups.com
Good day Arman and beast group users,

Thank you for your suggestion. I have tried to follow your suggestion, but find myself freefalling down the Beast rabbit hole and I definitely require further assistance, please.

I have two questions:

1) Can the GTR model be applied to a generalDataType? As far as I can tell, the settings within this do not allow it (i.e. rateAG, rateAC etc.).

2) How does one set the parameters that are nested within the generalSubstitutionModel parameterisation? Below are further details to this question.

I have defined the generalDataType with the worst case scenario of intra-individual site polymorphisms (2ISPs) - I would reduce this on a dataset specific basis excluding any 2ISPs that aren't present in the dataset to avoid inferring rates for states that are not present. 

<generalDataType id="twisp-informative-datatype">
<state code="A"/>
<state code="C"/>
<state code="G"/>
<state code="T"/>
<state code="Y"/>
<state code="R"/>
<state code="M"/>
<state code="S"/>
<state code="W"/>
<state code="K"/>
<state code="H"/>
<state code="V"/>
<state code="B"/>
<state code="D"/>
<state code="N"/>
<state code="-"/>
<state code="X"/>
<ambiguity code="?" states ="ACGT"/>
</generalDataType>

Here is the suggested generalSubstitutionModel...
<generalSubstitutionModel randomizeIndicator="true">
<generalDataType idref="twisp-informative-datatype"/>
<frequencies>
<frequencyModel idref="frequencyModel-twisp"/> <!-- how to define this? -->
</frequencies>
<rates>
<diagonalMatrix idref="diagonalMatrix-twisp"/> <!-- how to define this? -->
</rates>
</generalSubstitutionModel>

Below are the parameters referred to in the generalSubstitutionModel  parameter above, plus any offshoots from these parameters. I cannot find any documentation as to what precisely these parameters mean or how to change them. 

<!------------------------------------------------------------------------------------->
<frequencyModel normalize="true" dataType="codon-yeast">
  <frequencies>
    <compoundSymmetricMatrix idref="compoundSymmetricMatrix7"/>
  </frequencies>
</frequencyModel>

<diagonalMatrix>
  <designMatrix idref="designMatrix10"/>
</diagonalMatrix>

<designMatrix addIntercept="true" form="foo" colDimension="1" rowDimension="1">
  <parameter idref="parameter8"/>
  <compoundSymmetricMatrix idref="compoundSymmetricMatrix9"/>
  <compoundSymmetricMatrix idref="compoundSymmetricMatrix6"/>
</designMatrix>

<parameter value="0.5 1.0" dimension="1" upper="0.5 1.0" lower="0.5 1.0"/>

<compoundSymmetricMatrix asCorrelation="true">
  <diagonal>
    <designMatrix idref="designMatrix9"/>
  </diagonal>
  <offDiagonal>
    <compoundParameter idref="compoundParameter8"/>
  </offDiagonal>
</compoundSymmetricMatrix>

<compoundParameter>
  <designMatrix idref="designMatrix4"/>
</compoundParameter>
<!------------------------------------------------------------------------------------->

Thanks in advance for any help.

Regards,
Alastair

Arman Bilge

unread,
Sep 22, 2013, 6:17:54 PM9/22/13
to beast...@googlegroups.com
Hi Alastair,

To answer your questions:

1) The <generalSubstitutionModel/> is, if I understand it correctly, an implementation of the GTR model for a generic datatype. It defines a rate parameter for every potential state–state pair.

2) The exact implementation details are beyond my current knowledge as well; perhaps one of the BEAST developers can chime in. You should expect to be defining frequency and rate matrices with dimensions determined from the number of states that you define in your datatype. Something that you can try is making a mockup XML in BEAUti that uses discrete traits and inspecting the generated elements for an example. Let me know how that works for you.

Best, 
Arman

Alastair Potts

unread,
Sep 23, 2013, 9:30:34 AM9/23/13
to beast...@googlegroups.com
Hi Arman and other beast users,

Thanks for your suggestion. Playing around with the traits option in Beauti has given me further insights into the functioning of the <generalDataType> parameter.

I think this is a dead end for my purposes. The <generalDataType> seems to be exclusively for character state reconstructions and the <generalSubstitutionModel> can only have a CTMC model (I think). Thus, I don't think that I shall obtain a 2ISP-informative approach using this set of parameterisations.

Are there any other takers to try and solve this problem? (How to treat intra-individual site polymorphisms - R, Y, etc. - as informative characters in Beast?). Is it even possible despite the large set of editable parameters in Beast?

Thanks in advance,
Regards,
Alastair

Marc Suchard

unread,
Sep 23, 2013, 10:36:02 AM9/23/13
to beast...@googlegroups.com
This is certainly possible and regularly done.  Just keep in mind that YMR, etc., are specific ambiguity codes, not state codes, in your data-type specification.

Alastair Potts

unread,
Sep 23, 2013, 2:50:03 PM9/23/13
to beast...@googlegroups.com
Hi Marc,

Thanks for you response, although it is not exceptionally helpful. Any chance you could provide an example of an 2ISP-informative approach using Beast? 

Also, IUPAC defines YMR etc. as "ambiguity" codes. However, if, for example, one creates a consensus sequence from multiple clones from an individual and there is a C and T at a position to cause a Y coding, then there is nothing "ambiguous" about this. It is a known polymorphism. Most genes are not present as a single copy in the nuclear genome and this can give rise to such polymorphism.

Most "ambiguities" are treated as uncertainty by a wide range of phylogenetic software yet these are valid polymorphisms. I would be very interested to find the 2ISP-informative approach as outlined in the paper that I included in my original post being used regularly in Beast. 

Thank you for your time,
Regards,
Alastair 




--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/rWqz6FcTdKU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.

To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/groups/opt_out.



--
Dr. Alastair J. Potts
Claude Leon Postdoctoral Fellow
Botany Department
Nelson Mandela Metropolitan University

Cell #: 082 491-7275
Office #: 041 504-4375 (Mon, Thurs, Fri)
Fax #: 086 273-2675

Diego Caraballo

unread,
Oct 15, 2015, 2:34:34 PM10/15/15
to beast-users
Hi Alastair, 

I wondered if you received any further suggestion to deal with IUPAC ambiguities. I have a matrix with lots of potentially informative ambiguities, and would really appreciate to learn a way to make BEAST take them into account.

Cheers,

Diego

Nick Matzke

unread,
Oct 15, 2015, 4:06:53 PM10/15/15
to beast...@googlegroups.com
The answer is higher up in the thread, but I'm not sure folks are getting the basic idea.  You need to make a custom transition rate matrix.

Typical DNA rate matrix: 4x4, ie 4 possible ancestral states, 4 possible descendant states (ACGT).

If you want to have a polymorphism be its own character state, then you need a bigger rate matrix. For example, 4 bases + 6 possible 2-base combinations would be 10 states, requiring a 10x10 rate matrix. You can make custom rate matrices with generalSubstitutionModel, but it would require some skill with Beast XML.

I am pretty dubious whether coding polymorphisms as informative would change most typical DNA-based phylogenies, but maybe there are extreme cases where it would matter...

Cheers! Nick





For more options, visit https://groups.google.com/d/optout.

Diego Caraballo

unread,
Oct 15, 2015, 4:18:39 PM10/15/15
to beast...@googlegroups.com
Thanks Nick, you 've been very clear. I'll work on the XML file with the general SubstitutionModel, and see if it works out.

Cheers,

Diego


For more options, visit https://groups.google.com/d/optout.



--
-- 
Diego Caraballo

Laboratorio de Fisiología y Biología Molecular
Depto. de Fisiología y Biología Molecular y Celular
Facultad de Ciencias Exactas y Naturales
Ciudad Universitaria. Pabellón II. 2do piso.
C1428EHA-Buenos Aires. Argentina

Tel. 4576-3368/86 int 113
Fax. 4576-3321

Andrew Rambaut

unread,
Oct 15, 2015, 9:14:36 PM10/15/15
to beast...@googlegroups.com
By default, BEAST treats all IUPAC ambiguities as completely ambiguous (missing data - equivalent to a gap or N). This is done because it provides a considerable speed up in computation. However, if you want BEAST to use the information in ambiguity codes then you need to find all lines in your XML similar to the following:

<treeLikelihood id="treeLikelihood" useAmbiguities="false">

and change to

<treeLikelihood id="treeLikelihood" useAmbiguities=“true">

Search for ‘useAmbiguities’.

This is the correct way of using the information in ambiguities - if you have a code ‘R’ then you mean the actual base could be an ‘A’ or a ‘G’ but you don’t know which. The model will then correctly use this information. 

Best,
Andrew

Diego Caraballo

unread,
Oct 15, 2015, 11:23:44 PM10/15/15
to beast...@googlegroups.com
Thanks, Andrew.

griffinia

unread,
Oct 3, 2016, 2:40:59 PM10/3/16
to beast-users
Andrew,

I was very excited to see your response to this issue until I searched my *BEAST formatted input file generated with BEAST2.4.3.  No where in the XML file can I find the phrase ''useAmbiguities'.  Is the ability to get BEAST to use IUPAC codes as distinct states NOT available in BEAST2?  Or not available when one uses the STARBEAST template?  Or could it have something to do with using bmodeltest to set the site model?

Thank you,
Alan

Remco Bouckaert

unread,
Oct 3, 2016, 4:47:13 PM10/3/16
to beast...@googlegroups.com
Hi Alan,

To use ambiguities in BEAST 2, just set the useAmbiguities attribute to true on the elements with spec=“TreeLikelihood” (or spec=“ThreadedTreeLikelihood” if you use the threaded version), e.g.

<distribution id="treeLikelihood.ant" spec="TreeLikelihood" tree="@Tree.t:tree" useAmbiguities="true">

If you set up the analysis in BEAUti, in the partition panel the last column in the partition table allows you to toggle between using ambiguities or not (the latter is the default, since it is about twice as fast and the analysis is hardly affected if there are very few ambiguities).

Cheers,

Remco


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.

griffinia

unread,
Oct 4, 2016, 6:44:00 AM10/4/16
to beast-users
Dear Remco,

Thank for very much.  However, the last column in the BEAUTi GUI on my PC is a column headed by "..."  I cannot expand it any further to determine what it is for.  In that column, there is an unchecked box for each locus.  Is this to where you refer?  Do I just check the box for each partition?  I attach a jpg of the GUI with the column boxed in red.

Thanks again,

Alan

On Wednesday, September 18, 2013 at 2:58:22 PM UTC-4, Alastair Potts wrote:
beautii.jpg

Remco Bouckaert

unread,
Oct 4, 2016, 1:54:32 PM10/4/16
to beast...@googlegroups.com
Hi Alan,

Yes, the last column labelled ‘…’ is the one for using ambiguities — if you hover the mouse over the checkbox (underneath the ‘…' label) a popup with some text should be shown that explains it.

As you suggested, if you want to use ambiguities for all partitions just make sure all checkboxes are ticked in that column.

Cheers,

Remco


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.
<beautii.jpg>

Alan Meerow

unread,
Oct 4, 2016, 5:12:08 PM10/4/16
to beast...@googlegroups.com
Thanks again, Remco.

Best,
Alan
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/rWqz6FcTdKU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.

Alexander Berry

unread,
Nov 9, 2016, 3:20:55 PM11/9/16
to beast-users
Sorry to dig this thread up again but I don't think I'm understanding. Is there a way to use Ambiguity codes as states for diploids when using EBSP? That is to say, in a diploid, "R" is heterozygous for A and G, NOT either A OR G. My understanding is that useAmbiguities="true" is interpreted as either homozygous A or G.

Thanks,

Alex

Huw A. Ogilvie

unread,
Nov 10, 2016, 4:56:49 AM11/10/16
to beast-users
The EBSP model assumes that locus sequences are phased haplotypes. That is, if you have a sequence "ARTY", this wouldn't be sufficient for EBSP - the two phased haplotypes could be "AATT" and "AGTC", or they could be "AGTT" and "AATC".

Alexander Berry

unread,
Nov 10, 2016, 10:16:38 AM11/10/16
to beast-users
Thanks, Huw. I don't think that phasing is important for building phylogenies from SNP data, so I'm not concerned if the input is recognized as AATT/AGTC instead of an alternate phasing combination. Is the only way to do this by splitting each sequence into two sequences and inputting 200 haploid sequences (instead of 100 diploid sequences) into Beauti? How do I then tell BEAST that a given pair of sequences is, in fact, a pair?

Thanks for your help.

Huw A. Ogilvie

unread,
Nov 10, 2016, 8:00:28 PM11/10/16
to beast-users
Hi Alexander,

For coalescent models like EBSP that expect phased data, yes you should split your 100 diploid sequences into 200 haploid sequences. The models in BEAST that I'm aware of don't incorporate information about which haploid sequences are from the same individual or different individuals, so there is no way to specify which are paired.

You're right that when building phylogenies from SNP data phasing doesn't matter. As far as I am aware (Remco?), SNAPP will treat ambiguity codes as heterozygous sites.
Reply all
Reply to author
Forward
0 new messages