how RAxML codes/identifies invariant sites in SNP data

ZWC

unread,

Apr 26, 2021, 4:12:11 PM4/26/21

to raxml

I have a dataset of ~260 individuals with about 30,000 SNPs. I removed invariant sites during filtering, and visually scanning the alignment it does not appear that there are any invariant sites. However, RAxML is saying there are invariant sites and therefore cannot run when trying to account for ascertainment bias. I get that, but I don't understand why it is identifying invariant sites.

My outgroups do have a lot of SNPs with no call. What I'm wondering is if the outgroups having missing data, but all of the my ingroup specimens have the same base, is this still identified as an invariant site? If so, since it is difficult to scan a 260 X 30,000 matrix for such situations, am I better off just eliminating SNPs where the outgroups have missing data? Or is there a better way to deal with this type of invariant site?

Thanks!

Alexey Kozlov

unread,

Apr 26, 2021, 8:15:57 PM4/26/21

to ra...@googlegroups.com

we just recently answered a similar question, please see here:

https://groups.google.com/g/raxml/c/tQGe9r3T0AQ/m/7r_axFkiAwAJ

Best,
Alexey

> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/038f48f6-5e9e-46d6-ab7e-7465cc77e2aan%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/038f48f6-5e9e-46d6-ab7e-7465cc77e2aan%40googlegroups.com?utm_medium=email&utm_source=footer>.

ZWC

unread,

Apr 27, 2021, 4:46:09 PM4/27/21

to raxml

Hi Alexey,

Thanks for pointing me towards that thread. I somehow didn't see it when I searched. I don't have a full alignment, but rather only the SNPs. I used the script that the author of the other thread mentioned that removes invariant sites and was able to successfully do a run with ascertainment bias. However, my two outgroups have very long branches and nest within my ingroups. When I ran a "regular" analysis without removing these invariant sites using that script and without ascertainment bias, the tree looks roughly "normal" in that the outgroups are outgroups and the the relationship among populations looks roughly good.

My outgroups have a lot of missing data, and I'm wondering if this could be throwing off RAxML and whether I'm better off trying to run an analysis with only the SNPs that the outgroups have data?

Thanks for any insight that anyone has.

Zach

Alexandros Stamatakis

unread,

Apr 27, 2021, 11:21:45 PM4/27/21

to ra...@googlegroups.com

Dear Zach,

Regarding the outgroup issue:

You could also first infer a phylogeny on the ingroup and then either
use the RootDigger tool to determine the root via a non-reversible model
of substitution or place the outgroups onto the ingroup tree, after the
ingroup tree inference using the EPA-NG tool. Both tools will also allow
you to quantify root placement uncertainty.

Alexis

> <https://groups.google.com/d/msgid/raxml/038f48f6-5e9e-46d6-ab7e-7465cc77e2aan%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/038f48f6-5e9e-46d6-ab7e-7465cc77e2aan%40googlegroups.com?utm_medium=email&utm_source=footer>>.

>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/raxml/0b3bec9f-ef47-4cf5-9731-1e5951e307f9n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/0b3bec9f-ef47-4cf5-9731-1e5951e307f9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Grimm

unread,

Apr 28, 2021, 9:41:57 AM4/28/21

to raxml

And add-on to Alexi's suggestion, which you really should follow: you do want an ingroup tree not affected by outgroup-ingroup branching artefacts, which is always a risk when only very distant outgroups (or poorly sampled, e.g. with missing critical data) are available.

Usually when we do EPA, we want high LWRs but when placing distant outgroups the situation can be quite different.

Long-branching, very distant outgroups that do not trigger ingroup-outgroup branching artefacts (e.g. ingroup-outgroup LBA) ideally provide a somewhat ambiguous root signal, i.e. one can expect that e.g. EPA places outgroups at different ingroup branches and you get LWRs < 1 for more than one, but typcially connected branches in the ingroup tree for each queried outgroup. I.e. you end up with a potential root area rather than a specific, high-supported root. While this may look disencouraging on the first sight (and may require some explanation during review), it's actually a good result for certain evolutionary situations, such as a (relatively) long-isolated ingroup that underwent a fast ancient radiation. The following pic is an example how an outgroup-EPA result can look for a clearly distinct ingroup (that probably underwent explosive initital radiation) and an outgroup (here: sisterclade genera/species) providing only diffuse signal: the arrows give the summed probabilities based on EPA of 49 queried outgroup accessions – full paper is open access: Liede-Schumann et al. 2019, PeerJ; the full outgroup-EPA result is in Supplement file 4)

There are two more special cases to look out for:

If the LWRs of all alternative placements are generally very low and scatter across the tree or if your individual outgroups exclusively place in EPA with high LWR in distant (opposite) parts of the ingroup tree, their data isn't informative at all to define any outgroup-based root.

If your outgroups place with the most-distant, longest ingroup tip, you may look at inevitable ingroup-outgroup LBA. I haven't tested it but I guess RootDigger handles LBA situations better than EPA (the bulk rule is ML has 50:50 chance to escape LBA, and this may also apply to EPA).

If you have a broad outgroup sample, it's not unusual to find both LB-attracted outgroups producing high LWRs at wrong tips/subtrees representing high-evolved groups as well as well (phylogenetically speaking)-placed outgroups splitting their LWRs close to the actual ingroup root.

Good outgroup digging, Guido

ZWC

unread,

Apr 28, 2021, 4:08:17 PM4/28/21

to raxml

Hi Alexis and Guido,

Thanks for your helpful responses. I am attempting to run EPA-NG. I ran only my ingroup taxa in RAxML to generate a tree and info file to use in EPA. However, I'm getting an error (on CIPRES) that says "Bad tree/ref msa combination". If I run it locally it says the tree contains taxa not in the reference msa. Originally, it seemed that this was because an _ had been appended to the taxa names in the msa. But even when I remove them and the names look the same in the tree and the msa, it is still throwing that error.

Any thoughts on what could cause this?

Thanks again,

Zach

Alexandros Stamatakis

unread,

Apr 29, 2021, 12:33:24 AM4/29/21

to ra...@googlegroups.com

Hi Zach,

I guess Pierre (the developer of EPA-NG) can haelp you with this.

The most likely issue is that there is still some sort of mismatch
between the #taxa and/or taxon names in the reference tree and the
reference MSA.

Alexis

> – full paper is open access: Liede-Schumann et al. 2019, /PeerJ/
> <http://dx.doi.org/10.7717/peerj.8999>; the full outgroup-EPA result
> is in Supplement file 4 <https://doi.org/10.7717/peerj.8999/supp-4>)

>
> Liede-SchumannEtAl2019Fig3.png
>
> There are two more special cases to look out for:
>

> If the LWRs of *all alternative placements are generally very low*
> /and/ scatter across the tree or if your individual outgroups
> exclusively place in EPA with *high LWR in* *distant (opposite)
> parts* of the ingroup tree, their data isn't informative at all to

> define any outgroup-based root.
>
> If your outgroups place with the most-distant, longest ingroup tip,

> you may look at *inevitable ingroup-outgroup LBA*. I haven't tested

> <https://groups.google.com/d/msgid/raxml/0b3bec9f-ef47-4cf5-9731-1e5951e307f9n%40googlegroups.com?utm_medium=email&utm_source=footer

> <https://groups.google.com/d/msgid/raxml/0b3bec9f-ef47-4cf5-9731-1e5951e307f9n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of
> Technology
>

> www.exelixis-lab.org <http://www.exelixis-lab.org>

>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/raxml/c0139497-6195-463a-a2b2-1c11deb3c8dfn%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/c0139497-6195-463a-a2b2-1c11deb3c8dfn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Pierre Barbera

unread,

Apr 29, 2021, 3:17:57 AM4/29/21

to ra...@googlegroups.com

Hi Zach,

I agree with Alexis, its likely that there is still a mismatch. If you
run EPA-ng with --verbose it should print some size statistics about the
reference and query files, maybe this will already give you some
indication. Feel free to post the log file here if you want.

Pierre

MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: PD Dr. Wolfgang Müller

ZWC

unread,

Apr 29, 2021, 7:48:14 AM4/29/21

to raxml

Okay, evidently my eyes were not able to distinguish between _ and - yesterday! Looks to be fixed and running for now. Thanks!

Zach

Reply all

Reply to author

Forward