Ancestral state estimate - undetermined columns

Vanessa Bieker

unread,

Sep 6, 2022, 8:40:55 AM9/6/22

to raxml

Hi!

I would like to estimate the ancestral state for a sequence alignment. The issue is that I have several regions where all my sequences in the alignment have Ns. As these are removed from the RAxML analysis, the resulting ancestral state does not have the same length as the original alignment. Is there any way to keep those undetermined sites so that the ancestral state has the same length as the original alignment?

Alexandros Stamatakis

unread,

Sep 6, 2022, 9:23:19 AM9/6/22

to ra...@googlegroups.com

Are you using RAxML or RAxML-NG?

Alexis

> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/729ccbd3-6907-43b7-b352-ed1caaeef309n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/729ccbd3-6907-43b7-b352-ed1caaeef309n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP) lab,
Institute of Molecular Biology and Biotechnology, Foundation for
Research and Technology Hellas

www.exelixis-lab.org

Vanessa Bieker

unread,

Sep 6, 2022, 2:29:13 PM9/6/22

to raxml

I am using RAxML-NG.

Thanks for the fast reply!

Alexandros Stamatakis

unread,

Sep 8, 2022, 2:03:35 AM9/8/22

to ra...@googlegroups.com

you could try with the following option:

--force [ <CHECKS> ] disable safety checks
(please think twice!)

Alexis

> <https://groups.google.com/d/msgid/raxml/729ccbd3-6907-43b7-b352-ed1caaeef309n%40googlegroups.com?utm_medium=email&utm_source=footer

> <https://groups.google.com/d/msgid/raxml/729ccbd3-6907-43b7-b352-ed1caaeef309n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP)
> lab,
> Institute of Molecular Biology and Biotechnology, Foundation for
> Research and Technology Hellas
>

> www.exelixis-lab.org <http://www.exelixis-lab.org>

>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/raxml/d0adfe77-9540-4498-a564-5401a3fbd0b5n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/d0adfe77-9540-4498-a564-5401a3fbd0b5n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Grimm

unread,

Sep 8, 2022, 8:13:32 AM9/8/22

to raxml

Hi Vanessa,

May I ask why you want to keep the entirely undetermined columns at all?

Are you using an extended alignment with undetermined partitions because you want to re-intergrate the (sub?)alignment used in the ancestral state reconstruction (or the reconstructed ancestral sequence) into a larger master aligment, hence, the need for total-length conservation?

Cheers, Guido

Vanessa Bieker

unread,

Sep 10, 2022, 5:26:32 AM9/10/22

to raxml

@Alexis Thanks, I will give it a try.

@Grimm I have an alignment of a whole chromosome for several closely related species based on shotgun sequencing data mapped to a reference genome ignoring indels and deletions. I would like to estimate the ancestral state and then use it to polarise my site frequency spectrum for several downstream analysis. For that, the ancestral state and the reference sequence need to be exactly the same length.

Cheers,

Vanessa

Grimm

unread,

Sep 12, 2022, 8:48:08 AM9/12/22

to raxml

Hi Vanessa,

I see. But you cannot polarise with an ambiguous ancestral state anyhow. One reason Alexi said "think twice" before turning off the safeguards. Another may be inflating computation times because you feed the algorithm with indiscriminative data. Always keep in mind that all gaps/? are treated as N = A/C/G/T when using nucleotides in a ML framework; i.e. for the algorithms completely (or nearly) indetermined columns don't represent "no data" but "anything possible" data.

Is there any reason to not just eliminate the gappy sites – those only known for the reference but not all or most of the chromosome assemblies mapped on it – before doing the ancestral state reconstruction, and base the frequency spectrum only on the (supposedly) homologous sites rather than the entire chromosome strand? Spontaneously, I don't see what one could learn from site frequencies in an indel/length-polymorphic region of a chromosome, where we (because of the length-polymorphism, if anything other than just duplications or deletions) cannot be sure about homology of individual nucleotides at all.

A general question is always whether one really needs to include the reference at all for the ancestral state reconstruction and your downstream analyses when looking a closely related tip set. Is the reference from the ingroup itself or is it from a sister/cousin lineage? I.e. Outgroup-ish? Note that when working with closely related tips, outgroups rarely provide any critical information. There is even the risk that (too distant, given the data at hand) outgroups may be harmful. If the reference is not from your ingroup, it's always a good idea to run a tree with and without it, to see if any ingroup branches are affected by inclusion of the outgroup/reference. The (unrooted) ingroup-only tree must be identical to the according (unrooted) subtree in the ingroup+outgroup tree, otherwise the reference/outgroup is problematic.

Good luck for the reconstruction, Guido

Vanessa Bieker

unread,

Sep 12, 2022, 11:38:12 AM9/12/22

to raxml

Hi Guido,

Thanks a lot for your detailed and very helpful answer.

The reference is from an ingroup species and the ingroup is an adaptive radiation, so they are relatively recent and thus closely related. The main reason for wanting to include the gaps in the ancestral state is because I have more samples than those in the alignment. But these additional samples are sequenced to a lower depth, so I'm using genotype likelihoods in angsd for most analysis. So the site frequency spectrum is based on the bam files rather than a called genotypes from a vcf. The ambigious sites are excluded from the analysis but need to be included in the ancestral state fasta file to fit with the positions in the bam file.

One thing I noticed when I tested my command with the --force option is that the ancestral state gives a D instead of an N at completely undetermined positions.

Cheers,

Vanessa

Grimm

unread,

Sep 12, 2022, 12:18:27 PM9/12/22

to raxml

Hi Vanessa,

The ambigious sites are excluded from the analysis but need to be included in the ancestral state fasta file to fit with the positions in the bam file.

Then the switching off safeguards is pretty much the easiest thing to do.

One thing I noticed when I tested my command with the --force option is that the ancestral state gives a D instead of an N at completely undetermined positions.

That's to be expected because each site with completely undetermined positions has the same tip probability vector in all tips (that's the main difference between parsimony trait mapping and ML). The tips' probability vector p for N under ML is p(1,1,1,1); the substitution models we typically use and optimise on the fly within the pre-defined possible model space include the information about the overall frequency of As, Cs, Gs and Ts in the input data and their probility to be substituted by each other, as modelled over the tree considering also the input branch-lengths. In parsimony, if the tips are N all interior nodes (the hypothetical MRCA's) will be N, same state than all tips, irrespective of any root-tip distance along the tree. Under ML the probability to maintain an N may undergo subtle changes because of our optimising of the model parameters, and differ from clade to clade.

If I remember correctly, the ML character mapping implemented in IQTree and RAxML-ng applies asymmetric substitution models, i.e. we may model a higher probability that e.g. C's change into A's,G's and T's than vice versa. Within the framework of the substitution model the probability of the ancestral state for C is then naturally smaller than being A, G, or T: p(<<0.25, ~0.33, ~0.33, ~0.33). Hence, a (deep) D ("not C"), even when all tips have p(1,1,1,1).

That we have some D, some N, maybe even another not-X ambiguity code is part of the magic of the Gamma distribution (if included in the model): allowing non-fixed per-site variation.

Alexey and Alexi may have a link showing at which point, node probability vector, the output choses an ambiguity code over a defined base or each other (couldn't find it in the github advanced tutorial using a very quick browse).

Also when using symmetric models, we may observe something like this but the effect should be smaller.

For your interpretation of the results, this is, however, irrelevant. Again to come back to Alexi's "think twice, if you do": The 2nd though in this case is, the D is a purely model-triggered result, but not based on your actual input data. So, we would not include any ancestral base for the all-N or most-N sites as a result of our data+analysis (e.g. including them in tabulations or graphical representations).

/G.

Grimm

unread,

Sep 12, 2022, 1:57:24 PM9/12/22

to raxml

Just noticed (got distracted by something else)

Must of course be p(~0.33, <<0.25, ~0.33, ~0.33) as reconstructed probability vector for a D, if we see the node states as a probability vector for A,C,G or T.

Vanessa Bieker

unread,

Sep 23, 2022, 4:26:16 AM9/23/22

to raxml

Hi Guido,

Thanks a lot for your detailed explanation! My background is in population genomics, so your answer has helped me a lot in getting a much better understanding in what RAxML is actually doing.

Best,

Vanessa

Guido

unread,

Sep 23, 2022, 4:45:01 AM9/23/22

to ra...@googlegroups.com

Hi Vanessa,

>
> Thanks a lot for your detailed explanation! My background is in
> population genomics, so your answer has helped me a lot in getting a
> much better understanding in what RAxML is actually doing.

You're welcome.

My background is geology-palaeontology and (just-pure) genetics ending
up in inter- and intra-species phylogenetics, using RAxML since it (the
programme) was a toddler (i.e. when Alexi had finished his Ph.D. and we
were introduced). Usually with data that fell/fall in the "hard" to
"impossible" category. Hence, I have the benefit of having been there
myself, lost in the tree-space (the mountain ranges and plains), while
knowing people, who could tell me why :)

Cheers, Guido

Reply all

Reply to author

Forward