Heterozygous sites in RAxML: how to implement them?

Gabriele Sgarlata

unread,

Oct 18, 2018, 1:32:37 PM10/18/18

to raxml

Hello,

I have RAD-seq data and I would like to extract SNPs data to build phylogenetic tree in RAxML-NG. I'm working with diploid organisms, and I noticed that in the new RAxML-NG, heterozygous sites can be included.

In the documentation is specified the state order as GENOTYPE (diploid unphased).

However, it is still not very clear to me how heterozygous site should be implemented in the alignement. I'm working with unphased genotypes, so shall I implement heterozygous site with the two alternative bases separated by "/"? For instance:

>individual1

ATTTC/GAGG

>individual2

ATTTA/TAGG

.....

Thank you

Alexey Kozlov

unread,

Oct 18, 2018, 9:54:41 PM10/18/18

to ra...@googlegroups.com

Hi Gabriele,

sorry for the confusion, this feature is still kind of experimental and
thus wasn't properly documented.

You should use standard IUPAC ambiguity code to represent
heterozygosity, please see update wiki page:

https://github.com/amkozlov/raxml-ng/wiki/Input-data#state-encoding--order

Apart from this, unphased genotype data has 10 states and thus requires
a different model/substitution matrix.

So far, we have implemented just a few:

GTJC = all substitution rates are equal (Jukes-Cantor)

GTGTR = all rates are distinct (many free parameters!)

GTGTR4 = all nucleotide DNA mutation rates A<->C<->G<->T are distinct
but genotype-independent, e.g. A/A<->A/C = A/C<->C/C (this is basically
a classical DNA-GTR "projected" into genotype state space)

GTHKY4 = as above, but based on DNA-HKY

Hope this helps,
Alexey

> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Gabriele Sgarlata

unread,

Oct 19, 2018, 5:01:16 AM10/19/18

to raxml

HI Alexey,

First of all, thank you for your reply. Ok, then, if it is still experimental, I will use IUPAC coding as you suggested. However, from reading other posts in the Google Groups (https://groups.google.com/forum/#!topic/raxml/1eCZ0uh9nN8), I thought that heterozygous sites should be removed. I'm happy to know that now they can be used.

Please, if you think I misunderstood something, let me know.

To conclude, do you confirm that, once the diploid unphased option will be well implemented, we will need to indicate heterozygous sites in the alignment as Allele1/Allele2?

Thank you for the update in the wiki page and for your clarifications.

Alexey Kozlov

unread,

Oct 19, 2018, 8:09:45 AM10/19/18

to ra...@googlegroups.com

Hi Gabriele,

there is still a bit of confusion, let me try to clarify. With this sort
of data, you basically have three options:

1) Remove heterozygous sites as suggested in the older thread you
linked, and use DNA model *with* asc. bias correction (e.g. "--model
GTR+G||||+ASC_LEWIS")

2) Keep heterozygous sites, encode them with IUPAC and use DNA model
*without* asc. bias correction (e.g. "--model GTR+G||||"). In this
case, A/C means we are simply uncertain whether it's A or C.

3) Keep heterozygous sites, encode them with IUPAC and use GENOTYPE
model with or without asc. bias correction (e.g. "--model
GTGTR4+G+ASC_LEWIS||||" or "--model GTGTR4+G||||"). In this case, we
explicitly model A/C, A/A and C/C as three distinct states.

Best,
Alexey

> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.

> > For more options, visit https://groups.google.com/d/optout

> <https://groups.google.com/d/optout>.

Gabriele Sgarlata

unread,

Oct 19, 2018, 9:49:59 AM10/19/18

to raxml

Hi Alexey,

Ok, now I think to have understood.

To summarise, the genotype model will threat ambiguous sites ("Y"), which are the heterozygous sites, by modelling three possible states (as in your example). While the DNA model will keep some uncertainty whether it is an A or a C, but heterozygous sites will need anyway to be coded as "Y", according to IUPAC code.

Thank you again for your time and clarifications.

Bests,

Gabriele

Alexey Kozlov

unread,

Oct 19, 2018, 10:56:20 AM10/19/18

to ra...@googlegroups.com

Hi Gabriele,

> Ok, now I think to have understood.
> To summarise, the genotype model will threat ambiguous sites ("Y"),
> which are the heterozygous sites, by modelling three possible states
> (as in your example). While the DNA model will keep some uncertainty
> whether it is an A or a C, but heterozygous sites will need anyway to
> be coded as "Y", according to IUPAC code.

almost :)

With genotype model, "Y" means exactly one possible state: A/C.
Conversely, A/A would be encoded by "A", and "C/C" -- by "C".

With DNA model, there is no such state as "A/C", and so "Y" simply means
uncertainty about the true state, i.e. we say it's either "A" or "C"
(=not "G" or "T"), but we can't tell which one of both.

Best,
Alexey

> <https://groups.google.com/forum/#!topic/raxml/1eCZ0uh9nN8>), I

Gabriele Sgarlata

unread,

Oct 19, 2018, 11:01:13 AM10/19/18

to raxml

Very clear.

I must say that you are a great teacher, with patience and wish to make things clear.

Thank you again.

Alexey Kozlov

unread,

Oct 19, 2018, 11:11:12 AM10/19/18

to ra...@googlegroups.com

thank you :)

Sara Rocha

unread,

Nov 7, 2019, 9:49:41 AM11/7/19

to raxml

Hi!

Grimm

unread,

Nov 8, 2019, 11:47:49 AM11/8/19

to raxml

Hej you two,

A little correction (lots of people confuse the ambiguity codes I noticed) and a tip what you can do when you want to test whether there is substantial information in the heterozygous sites. You can also use RAxMLs MULTISTATE option.

With genotype model, "Y" means exactly one possible state: A/C.
Conversely, A/A would be encoded by "A", and "C/C" -- by "C".

With DNA model, there is no such state as "A/C", and so "Y" simply means
uncertainty about the true state, i.e. we say it's either "A" or "C"
(=not "G" or "T"), but we can't tell which one of both.

A/C would be M not Y :)

The labels for ambiguity codes have a simple logic.

Y stands for pYrimidine(s) which is C (cytosine) and T (thymine)
R is for puRine(s), adenosine (no y) and guanine (again no y)
W mean Weak, so it's A/T, forming a weak (double) bond vs.
S for Strong, C/G, triple bond in the double-helix
K stands for Keto-group, which you have in G and T
M stands for aMino-group, which you have in A and C

You don't have them with single-copy heterozygous data but the tripletts are easy, too. It's just one letter after the one missing:

B means A missing = C, G, T
D means C missing = A, G, T
H means G missing = A, C, T
V means T missing = A, C, G (U is uracil, hence, we have to jump to V)

Regarding option 1, 2 and 3, and what to expect and when to choose what.

Heterozygotic data is a simple version of what we called "twisps" (2ISP, intra-individual site polymorphism) in this paper, where we tested their information content in 2ISP-rich data using, among other things, the MULTISTATE option of RAxML:

Potts AJ, Hedderson TA, Grimm GW. 2014. Constructing phylogenies in the presence of intra-individual site polymorphisms (2ISPs) with a focus on the nuclear ribosomal cistron. Systematic Biology 63:1–16. https://academic.oup.com/sysbio/article/63/1/1/1687378

The standard implementation in RAxML is, in contrast to MP, distance and Bayes implementations (back then, not sure they changed it in MrBayes) not ambiguity-naive, as Alexej pointed out for his possibility 2. Depending on the diversity in data (how well sorted it is), it makes a difference whether you treat eg. R as G/T (implemenation in RaxML) or ? (implementation in MrBayes 2 and 3.1). Possibility one woud be ambiguity-naive.

As a quick test to see how informative the heterozygous sites are, you just run your data following Alexej's suggestion 1 (replace by ?) and 2. If the trees are the same, the heterozygous sites are irrelevant for the inference. If the trees differ in important aspects (placement of your heterozygotic accessions), then you go for Alexej's option 2 and 3. I guess 1 is quicker than 2 and 3 is the slowest.

The experimental genotype implementation (Alexej's possibility 3) is the more sophisticated version (since it used both the mutation rates and the ambiguity codes) of our recoding procedure: we simply replaced A, C, G, T, R, Y, etc by 0, 1, 2, 3, 4 and run the same matrices as DNA or MULTISTATE.

Using ambiguities like that for additional information was a double-edged sword under ML. When your data is 2ISP-rich (in your case: a lot of heterozygotic sites), it's unproblematic and you get fine phylogenetic hypothesis (we preferred networks since evolution is usually not tree-like towards the leaves), also outperforming standard ML (we also did simulations), but if it's 2ISP-poor, the MULTISTATE recoding will over-penalise the change from a monomorphic to a dimorphic (heterozygous) state, e.g. A<>G would have a higher probability than A<>R, which makes no sense. In those case, the standard run (as DNA, option 2) produced more sensible ML trees.

The new genotype implementation should be much less vulnerable to that problem, I guess.

Cheers, Guido

Below the split support changes on the real-world data (using low to high polymorphic multi-copy data, so the worst-possible data for phylogenetics)

(the P index is the level of heterozygosity, -A is the standard implementation, ie. ignorant for NJ and MP, partly aware for ML; -I using the ambiguity codes as additional states, ordered for NJ and MP, 13-state matrix for ML)

Sara Rocha

unread,

Nov 11, 2019, 7:14:15 AM11/11/19

to ra...@googlegroups.com

thanks a lot about this detailed explanation!

--
You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/WyelRC1X1Sg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/9b13bf50-cdff-4412-9afc-d5fb5b88ace8%40googlegroups.com.

--

Sara Rocha

Post-Doc Researcher

Phylogenomics Group University of Vigo (Spain)
http://darwin.uvigo.es/wordpress/?page_id=575