ancestral states and gaps

137 views
Skip to first unread message

Krister Swenson

unread,
Jun 14, 2023, 7:18:41 AM6/14/23
to raxml
Hi Alexis!

I'm using raxml to estimate the probability distribution on the states of characters for internal nodes in the tree.

Looking in the `ancestralProbs` files there are only 4 columns with probabilities, which sum to 1.  The State column, however, does have a gap character, corresponding to the gaps seen in the `ancestralStates` file.

Could you please describe to me the conditions under which a gap is inferred as the ancestral state, and why there isn't a fifth p_GAP column in the `ancestralProbs` file?

Thanks.

Alexandros Stamatakis

unread,
Jun 14, 2023, 8:18:52 AM6/14/23
to ra...@googlegroups.com
Hi Krister,

You should switch to using RAxML-NG as it actually implements a full
ancestral state reconstruction using Ziheng Yang's algorithm.

Personally, I would not use the old RAxML ancestral state implementation
any more.

If you insist though, I can look up and try to understand what I
implemented in standard RAxML more than a decade ago :-)

Alexis
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

Grimm

unread,
Jun 14, 2023, 10:29:17 AM6/14/23
to raxml
Hi Krister,

we may save Alexi the dive-back-into-time. What are the four p values that lead to a gap in the state column? I may have some fresher memories regarding what my beloved old-RAxML came up with.

It had a logic back then but, as Alexi says, you may want to change to RAxML-NG; especially if you work with nucleotide data (?!, hence 4 columns).

Cheers, Guido.

Krister Swenson

unread,
Jun 15, 2023, 7:13:25 AM6/15/23
to raxml
Oh man... sorry for the confusion.  I'm using RAxML-NG v. 0.9.0.
I didn't know anyone still used the original raxml.

The output of the ancestral reconstruction, with raxml-ng, gives a file with the 4 nucleotide probabilities (that sum to 1). So it seems like in the following example, although a gap has 0 probability in the model it is inferred as being the state for site 89.

Node       Site    State      p_A                   p_C                  p_G                  p_T
2156_X1 89        -             0.238338990 0.248995661 0.309204388 0.203460961

I'm wondering how raxml has decided that this should be a gap.  I have subtrees where a certain column has nothing but gaps, so it would be appropriate  that the ancestral sequence has a gap, but it is strange to see probabilities that sum to 1 without gaps being considered.

Thanks for the help.

Grimm

unread,
Jun 15, 2023, 7:35:43 AM6/15/23
to raxml
There are some traditionalists out there :)

The standard ML implementation treats gaps like missing data (N or ?); if the probability for each nucleotide is the same, like here, this would result in an N/- as the reconstructed ancestral state meaning the ancestral state reconstruction has been ambiguous. Classic RAxML would have done this exactly like this for this case, for (near-)equal p(X) the returned "State" would be "-".

But I have to pass this on to the programmers, whether RAxML-ng should (still) do this. It would be indeed easier to grasp for biological users, if it pops out an N as many of us don't know that Ns equal gaps in standard ML frameworks (there has been a lot of likewise questions here on the RAxML GoogleGroup). But from an inference point of view, there's simply no difference between Ns and gaps when we work with the standard nucleotide substitution model. There is (should be) a difference, when we e.g. go for genotype/mixing models and reconstruct ancestral polymorphisms; a gap would have the tip-probability vector p(0,0,0,0), while an N would have p(1,1,1,1); and a R (A or G) e.g. p(1,0,1,0)

/G

PS If you want to treat the gaps as an explicit fifth state, you can recode your matrix as multistate (A = 0, C = 1, G = 2, T = 3, - = 4) for RAxML. RAxML-ng provides a very sophisticated to analyse multistate data, i.e. you could provide with an according substitution matrix reflecting the standard optimised 4x4 nucleotide matrix and adding a probability to lose a nucleotide.

Krister Swenson

unread,
Jun 15, 2023, 9:38:21 AM6/15/23
to raxml
Thank for the tips Guido!
The genotype/mixing models are more like what I was expecting, but it sounds like I can do something with a custom matrix.

I wonder if raxml-ng infers an ancestral state of '-' when the probabilities are all about the same, as classic raxml did, or if there is something new?

Alexandros Stamatakis

unread,
Jun 15, 2023, 9:48:44 AM6/15/23
to ra...@googlegroups.com
Hi Krister,

Please wait until Alexey who designed RAxML-NG gets back from vacation,
he can elucidate on how he implemented this.

Alexis
> https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> ERA Chair, Institute of Computer Science, Foundation for
> Research and
> Technology - Hellas
> Research Group Leader, Heidelberg Institute for
> Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe
> Institute of Technology
>
> www.biocomp.gr <http://www.biocomp.gr> (Crete lab)
> www.exelixis-lab.org <http://www.exelixis-lab.org>
> (Heidelberg lab)
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/4e4f1ed4-b5b7-4403-b2ff-4400e2d65f43n%40googlegroups.com <https://groups.google.com/d/msgid/raxml/4e4f1ed4-b5b7-4403-b2ff-4400e2d65f43n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Grimm

unread,
Jun 15, 2023, 10:05:06 AM6/15/23
to raxml
Was about to write, there'll be soon an answer from Alexi and Alexej :)

In case you can't stand waiting, there's an easy empirical check for the thresholds and the state they return by just tabulating the p(X) against the state info. If you only have A,C,G,T,- as ancestral states, split probabilities are not translated into ambiguity codes. If R,Y,etc. pop up, there'll be thresholds, at which point p(X) trigger a different ambiguous state and at which point, e.g. all p(X) > 0.2, the gap state is returned.

If you expect past polymorphism in your data and the output only has the five states but no ambiguity codes, it should be ok to translate the resulting p(X) vectors into such codes rather than to just take the returned state from the according column. Note that the differential p(X) are the actual result of the algorithm, the state column is mainly an accomodity for applicants, a sort of quick translation of the differential p(X) profiles for each site.

/G

Krister Swenson

unread,
Jun 15, 2023, 11:55:30 AM6/15/23
to raxml
Thanks, you two.

I can wait until vacation is over to hear the exact details, no problem.

Oleksiy Kozlov

unread,
Jun 19, 2023, 8:21:54 AM6/19/23
to ra...@googlegroups.com
Hi Krister,

sorry for the delay!

Indeed, there is a threshold

prob_eps = 0.5 / num_states

and all states with estimated probabilities above (max_state_prob - prob_eps) are considered part of
the ambiguous ancestral state. In your example, all four states have probabilities higher than

0.309204388 - 0.5 / 4 = 0.184204388

and hence undetermined ancestral state (="gap") is returned.

Please see here for more implementation details:

https://github.com/amkozlov/raxml-ng/blob/master/src/AncestralStates.cpp#L80

Hope this helps,
Oleksiy
> https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com>> <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer> <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/395a5fba-9d00-48e8-b7a0-a460f968c44fn%40googlegroups.com?utm_medium=email&utm_source=footer>>>.
> >
> > --
> > Alexandros (Alexis) Stamatakis
> >
> > ERA Chair, Institute of Computer Science, Foundation for
> > Research and
> > Technology - Hellas
> > Research Group Leader, Heidelberg Institute for
> > Theoretical Studies
> > Full Professor, Dept. of Informatics, Karlsruhe
> > Institute of Technology
> >
> > www.biocomp.gr <http://www.biocomp.gr> <http://www.biocomp.gr <http://www.biocomp.gr>>
> (Crete lab)
> > www.exelixis-lab.org <http://www.exelixis-lab.org> <http://www.exelixis-lab.org
> <http://www.exelixis-lab.org>>
> > (Heidelberg lab)
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to raxml+un...@googlegroups.com
> > <mailto:raxml+un...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/4e4f1ed4-b5b7-4403-b2ff-4400e2d65f43n%40googlegroups.com <https://groups.google.com/d/msgid/raxml/4e4f1ed4-b5b7-4403-b2ff-4400e2d65f43n%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/4e4f1ed4-b5b7-4403-b2ff-4400e2d65f43n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/4e4f1ed4-b5b7-4403-b2ff-4400e2d65f43n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> ERA Chair, Institute of Computer Science, Foundation for Research and
> Technology - Hellas
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.biocomp.gr <http://www.biocomp.gr> (Crete lab)
> www.exelixis-lab.org <http://www.exelixis-lab.org> (Heidelberg lab)
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/3244cede-2441-4324-b9d4-8f9662986f93n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/3244cede-2441-4324-b9d4-8f9662986f93n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Krister Swenson

unread,
Jun 19, 2023, 8:41:14 AM6/19/23
to raxml
Thanks for clearing that up Oleksiy.
Reply all
Reply to author
Forward
0 new messages