indel coding

66 views
Skip to first unread message

SoniaN

unread,
Jul 27, 2021, 5:31:38 PM7/27/21
to raxml
Hello. Can indel data be used in raxml? If so, is there a code you would recommend for scoring the indels? I am examining a data set for just a few loci (easy to inspect visually) and there are interesting indel patterns that would be nice to use as characters. Thank you in advance for any education you can offer on this.

Alexandros Stamatakis

unread,
Jul 28, 2021, 2:29:04 AM7/28/21
to ra...@googlegroups.com
Dear Sonia,

There is no such option in RAxML, but you can in principle model gaps as
5th state although I guess this is not what you want to do.

Another option is to model indel patterns as a separate binary matrix
and concatenate it with the molecular data matrix.

Alexis
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/147fda87-0782-457f-b2b1-f667e9342134n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/147fda87-0782-457f-b2b1-f667e9342134n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP) lab,
Institute of Molecular Biology and Biotechnology, Foundation for
Research and Technology Hellas

www.exelixis-lab.org

Grimm

unread,
Jul 28, 2021, 11:50:51 AM7/28/21
to raxml
Hi Sonia,

adding to Alexi, pending the complexity of your indels, you can use the binary option to code step-matrices as well, or use multistate instead
e.g.
tip1 AAAGGGAAA
tip2 AAA–––AAA
tip3 AAAAAAAAA

one could use a ternary coding
tip1 AAAGGGAAA = 0
tip2 AAA–––AAA = 1
tip3 AAAAAAAAA = 2
(in this particular case one would even consider treating this character as ordered viewing the needed mutation steps to generate such a sequence pattern)

Two more things:

First, a warning: gapcoding can be a double-edge sword, as it may invite false positives. For instance, duplications in non-coding chloroplast gene regions can be highly convergent. So, if the gap pattern doesn't match the point mutations, it's thin ice to rely on the binary partition. If the number of convergent gap patterns outnumber the number of (phylogenetically sorted) point mutations, the binary gap-coding partition will outcompete the nucleotide-substitution based tree.

Good news, you don't need to code gaps at all as binaries if the matrix also has good signal in the varying sites: Since gaps are treated as N's under ML, they are considered when optimising the tree. ML using the standard substitution model is semi-aware of gaps, the tip probability vector of a gap is the same than for N/missing data: p (1,1,1,1). I.e. if you have this four taxa problem

tip1 AAAAAAAAA
tip2 AAAAAAAAA
tip3 AAA–––AAA
tip4 AAA–––AAA

Parsimony will give you a star tree, but ML will prefer tip1 + tip2 | tip3 + tip4 split above the alternatives because it see an alignment pattern involving substitution from A (1,0,0,0) to N (1,1,1,1). I.e. if there are gaps showing congruent splits with the varying sites, they will already stabilise your topology even when you just leave them as they are in the alignment.
Also means: if you code your gaps as e.g. a binary partition, you need to exclude them from the nucleotide partition. Otherwise you duplicated in a way their signals.

To be on the safe side, always run the standard analysis (no coding), the combined (nucleotide + binary codes) and seperated (only remaining nucleotides, only gap codes) and compare the trees. The should converge to the same tree, only resolve some aspects better or worse.

Exception from the usually-doesn't-pay-to-gapcode rule is when your sequences are mostly differentiated by gap patterns and very few point mutations because then it may be hard for the algorithm to optimise a substitution model at all, and infer a meaningful tree. In such case one just binarises all alignment patterns.
e.g.
tip1 GAAAAAAGG  = 101
tip2 AAAAAAAGG  = 001
tip3 AAA–––AAA  = 010
tip4 GAA–––AAA  = 100

Happy coding,
Guido
Reply all
Reply to author
Forward
0 new messages