raxml-ng with high-level missing data

Jiaqi

unread,

Jun 8, 2021, 8:55:37 PM6/8/21

to raxml

Hi,

I am working on both ancient and modern samples and therefore my data set contains a lot of missing sites.

Currently I have a vcf file and a snp-only Multiple Sequence Alignment. However, per-sample data missingness can range from 10% to 90% ( ancient samples have a lot of missing sites as "N" or "-"). I would like to use those variant sites to construct a ML tree, but I am wondering if my missing data will be a big issue when using raxml-ng with Ascertainment bias correction, and How should I estimate the base frequencies? ( I saw this : +ASC_STAM{w1/w2/../wn} (Stamatakis' method with per-state invariable site numbers w1 w2 ... wn))

Do you have any suggestions

Thanks,

Jiaqi

Alexandros Stamatakis

unread,

Jun 9, 2021, 1:24:00 AM6/9/21

to ra...@googlegroups.com

Dear Jiaqi,

> I am working on both ancient and modern samples and therefore my data
> set contains a lot of missing sites.
> Currently I have a vcf file and a snp-only Multiple Sequence Alignment.
> However, per-sample data missingness can range from 10% to 90% ( ancient
> samples have a lot of missing sites as "N" or "-").

So the 10% to 90% missing data refers to the SNP-only dataset?

> I would like to use
> those variant sites to construct a ML tree, but I am wondering if my
> missing data will be a big issue when using raxml-ng with Ascertainment
> bias correction,

Missing data can affect the reconstruction because it is well, just
missing (i.e., no signal present). Apart from that, using ascertainment
bias correction or not should not make a difference with respect to the
behavior of RAxML-NG with respect to missing data.

> and How should I estimate the base frequencies? ( I saw
> this : +ASC_STAM{w1/w2/../wn} (Stamatakis' method with per-state
> invariable site numbers w1 w2 ... wn))
> Do you have any suggestions

Those are not base frequencies, but the number of sites in the original
large MSA that consisted entirely of As, Cs, Gs, and Ts.

If you don't want to compute them you can also infer a tree without asc.
bias correction on the full MSA, the sites consisting of As, Cs, Gs and
Ts will be automatically compressed in this case such that the
computations will only take a tiny bit longer.

This was our conclusion from that ascertainment bias correction paper
anyway: if you have the full MSA (and not only SNPs) just use it :-)

Alexis

>
> Thanks,
> Jiaqi
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/1641b221-f663-423d-a286-6f7d803a0a0an%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/1641b221-f663-423d-a286-6f7d803a0a0an%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP) lab,
Institute of Molecular Biology and Biotechnology, Foundation for
Research and Technology Hellas

www.exelixis-lab.org

Grimm

unread,

Jun 9, 2021, 8:46:57 AM6/9/21

to raxml

Hi Jiaqi,

a further practical tip, especially since your are working with ancient (aDNA? fossils via morpho-partition?) samples.

Infer first a backbone topology with the well-sampled OTUs (tip taxa; you may drop near-identical tips as well), and then use RAxML's evolutionary placement algorithm to place the dropped OTUs (i.e. treat them as queries). This will allow you to see which of the poorly sampled tips may be ancestral (closer to the root) – these will be placed (usually with relatively low split-LWRs if actual ancestors) in the deep (root-proximal) parts of the backbone topology and act as rogues during tree-inference when included, and which are trivial (signal-wise) accessions – placed with high, near-unambiguous LWRs within the terminal subtrees. Using a backbone topology avoids getting lost in tree-space (which can happen pending which alignment patterns are missing in the poorly sampled accessions), and much speeds up finding a first tree while being able to assess the affinity of all the other tips that will slow down analysis and may inflict topological ambiguity. For the final paper, the revs will of course insist on the full one, no matter how instable it is :) but to get an idea what one's swiss cheese matrix can possibly resolve, trying to infer a full tree usually is a waste of time when some tips have near 100% and others 10% of the alignment patterns covered.

The simplest test to see the effect of missing data accessions is to just infer a sequence of trees using different missing-data cut-offs. You may also want to give RogueNaRok a try, to test prior to tree inference how ambiguous an OTUs signal will, i.e. how roguish your poorly sampled accessions may act during tree inference.