RNA models and bootstrap values of trees inferred with RNA models

55 views
Skip to first unread message

Vân Nguyễn

unread,
Mar 12, 2025, 12:23:26 AMMar 12
to raxml

Hi everyone,

I’m working on a project to infer phylogenetic trees while incorporating RNA secondary structure. I’ve gone through the PHASE documentation, but I’m still unclear on how the RNA models (6/7/16) function. From what I understand, these models seem to focus on the stem regions. Do they also account for helices, and if so, how?

Additionally, I’d appreciate any insights on how bootstrap values are calculated for RNA-based trees.

Thanks in advance for your help! I look forward to hearing back from you.

Best,

Van


Alexandros Stamatakis

unread,
Mar 16, 2025, 12:51:02 AMMar 16
to ra...@googlegroups.com, Alexandros...@gmail.com
Dear Van,

> I’m working on a project to infer phylogenetic trees while incorporating
> RNA secondary structure. I’ve gone through the PHASE documentation, but
> I’m still unclear on how the RNA models (6/7/16) function. From what I
> understand, these models seem to focus on the stem regions. Do they also
> account for helices, and if so, how?

I assume you are referring to my old implementation in standard RAxML.

It focuses on the Stem region, so two complementary bases in the stem
are transformed into one state which in the most general case has 4 x 4
possible states, hence the 16 x 16 substitution matrix. The other models
result from some restrictions on this general 16 x 16 state model that
are biologically inspired. I would need to look this up again myself as
I implemented these models about 10 years ago, but this is the general
idea.

> Additionally, I’d appreciate any insights on how bootstrap values are
> calculated for RNA-based trees.

You just apply the normal bootstrap procedure. I would assume (I would
need to look it up in the code) that in stem regions, the linked
complementary sites are drawn at random as one unit with replacement and
the sites not in a stem just like in a standard bootstrap for DNA data.

> Thanks in advance for your help! I look forward to hearing back from you.

Hope this helps a bit.

Alexis

>
> Best,
>
> Van
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/raxml/82d6d550-758a-44c6-b54f-e25ffd84fa3an%40googlegroups.com <https://groups.google.com/d/msgid/raxml/82d6d550-758a-44c6-b54f-e25ffd84fa3an%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

Bui Quang Minh

unread,
Mar 18, 2025, 12:50:26 AMMar 18
to raxml
Hi Alexis, Thank you for the quick response!

I have a couple of follow-up questions (Van is my PhD student).

1. I can understand that you use RNA models for the stem regions, but what happens to the non-stem regions? Do you consider them (i.e. using a DNA model) or ignore them from phylogenetic reconstruction? If the former, I can imagine that you treat the data as a two-partitioned alignment, a stem partition (where you apply RNA model) and a non-stem partition (where you apply a DNA model), and you link the branch lengths between the two partitions? 

2. How do you treat gap/missing data in the RNA model? For example, with a simple pairwise sequence setting, what is the likelihood of going from the pair "AN" to "GC"? I could imagine an analogy to the DNA model, that you would sum up the likelihood of four possible transitions by replacing "N" or "-" with proper state. In this case it would be the sum of four likelihoods AA -> GC, AC-> GC, AG->GC and AU->GC? But this is just my guess. 

Thank you for digging out the 10-year old code ;-) I'm sorry if that has been asked before.

Minh

Alexandros Stamatakis

unread,
Mar 18, 2025, 3:09:35 AMMar 18
to ra...@googlegroups.com
Dear Minh,

How are you ?

> 1. I can understand that you use RNA models for the stem regions, but
> what happens to the non-stem regions? Do you consider them (i.e. using a
> DNA model)

I just use standard DNA models for the non-stem regions.

> or ignore them from phylogenetic reconstruction? If the
> former, I can imagine that you treat the data as a two-partitioned
> alignment, a stem partition (where you apply RNA model) and a non-stem
> partition (where you apply a DNA model),

yes.

> and you link the branch lengths
> between the two partitions?

I guess so, but I would need to look this up in the code.

> 2. How do you treat gap/missing data in the RNA model? For example, with
> a simple pairwise sequence setting, what is the likelihood of going from
> the pair "AN" to "GC"? I could imagine an analogy to the DNA model, that
> you would sum up the likelihood of four possible transitions by
> replacing "N" or "-" with proper state. In this case it would be the sum
> of four likelihoods AA -> GC, AC-> GC, AG->GC and AU->GC? But this is
> just my guess.

I assume this is what I might be doing.

> Thank you for digging out the 10-year old code ;-) I'm sorry if that has
> been asked before.

This has never been asked before. So I guess you'd like me to look into
the code and confirm our conjectures, correct?

Alexis
> https://groups.google.com/d/msgid/raxml/82d6d550-758a-44c6-b54f-e25ffd84fa3an%40googlegroups.com <https://groups.google.com/d/msgid/raxml/82d6d550-758a-44c6-b54f-e25ffd84fa3an%40googlegroups.com> <https://groups.google.com/d/msgid/raxml/82d6d550-758a-44c6-b54f-e25ffd84fa3an%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/raxml/82d6d550-758a-44c6-b54f-e25ffd84fa3an%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> ERA Chair, Institute of Computer Science, Foundation for Research and
> Technology - Hellas
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.biocomp.gr <http://www.biocomp.gr> (Crete lab)
> www.exelixis-lab.org <http://www.exelixis-lab.org> (Heidelberg lab)
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/raxml/cfce44fa-c7f8-43b9-a41a-6442e5bf7a88n%40googlegroups.com <https://groups.google.com/d/msgid/raxml/cfce44fa-c7f8-43b9-a41a-6442e5bf7a88n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Grimm

unread,
Mar 18, 2025, 5:32:06 AMMar 18
to raxml
Hi, a little add-on to Alexi's answer.

Re: DNA-model for non-stem RNA regions. If you want a perfect-as-possible model, you could partition the non-stem regions into terminal loops and others. Terminal loops of conserved stems are typically short but sequentially conserved within lineages (they can actually be used as barcodes within taxonomic groups of various hierarchies, pending the organism one is studying), while the non-stem regions can be diverse regarding both their nucleotide composition, structural constraints and sequences.

Re: Ignoring them for phylogenetic reconstruction: ML is robust against missing data, so gaps don't matter a lot. However, it can be tricky to ascertain homology pending how broad the taxonomic coverage of a data set is for the less-conserved parts. For instance, the V-Regions of the LSU (25S/28S rRNA genes) may encode for structurally diverse extra stems and large loopy bits (AT-rich strands), which can be hard to align across the entire dataset but very informative for groups of taxa (phylogenetic lineages) whithin a large data set. The simplest way to deal with this is just try&error:
  1. Infer a tree on the total data, using two/three partitions (stem - non-stem, stem - terminal loops - other non-stem)
  2. Infer a tree only on the stem-regions (using rRNA substitution model vs. standard nucleotide model, see Note below)
If 2 loses too much towards the leaves but 1 is hard-to-accomplish because of homology issues within V-Regions and alike, a supertree approach can be an option. Use the total alignment to infer a backbone tree based on the stem-regions and subsets to optimise trees for groups of taxa to populate the backbone tree. 

Re: How do you treat gap/missing data in the RNA model? What Alexi describes is what the model does. However, you cannot have gaps in a stem/pairing to be modelled. You may have extra basepairs in stems, some of which a (functionally, RNA maturation process) strongly constrained and sequentially non-random, i.e. their nucleotide relates to the preceding and/or subsequent pair (e.g. the extra nucleotide at the end of the 5.8S rRNA, which are otherwise strictly paired with the 5' end of the 25S rRNA, see attached pic: it's GCG|CC or GUG|CC, but never GGG|CC and rarely GAG|CC). Evolution-wise, you don't look at a substitution from AN to GC, what you may look at is a substitution of AA (paired UU) to GCG (paired CC). The correct treatment would hence be to exclude these extra nucleotides.

However, for real-world data, their effect on the tree inference is necligible because they are rare enough and sequentially equally conserved as the pairing nucleotides of their stems. Note: A reason why rRNA models haven't been overly used may be that for many groups of organisms, it doesn't matter a lot to change from the standard substitution model to the rRNA substitution models, but it matters more how sequentially conserved a structural aspect is. See attached pic for the 5.8S rRNA: the most-conserved bits are the yellow highlighted ones, which define start and end of the 5.8S rRNA genes and strictly paired with bits of the 25S/28S rRNA (reason being, they used to be one unit in the dawn of time; the ITS2 evolved from a V-Region of the primordial LSU). But the sequence of the 3' end is more conserved than that on the 5' end. The longer of the two "pins" to the 3' end is more conserved in its small terminal loop sequence (which can be 4 or 5 nt-long) than in its stem, the "apple guy" to their left shows very little length-polymorphism and while mutations accumulate in the large terminal loop forming it's head, the sequences making up the small loop forming up its left "foot" and its "belly" are again pretty conserved within angiosperm families and even orders. Finally, there are very few mutations in the unpaired base pairs of the ground loop(s) of the 5.8S rRNA. In the 25S rRNA mutations are much more frequent in the V-regions (which include loops and stems) than in the core structure regions (internal loops and critical stems) conserved across all metazoans. I.e. partitioning into structural regions (e.g. conserved core structure vs. V-Regions) possibly matters more for the tree inference than using rRNA models for stem regions and 4x4 substitution model for non-stem regions. Maybe something worthwile testing for your data.

Good inferences.

ExtraNucleotidesInStems.png

Bui Quang Minh

unread,
Mar 18, 2025, 5:49:48 PMMar 18
to ra...@googlegroups.com
Hi Alexis,

Yes it'd be great if you can confirm these with your code.

Thanks!
Minh
> You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/6bPdSuFjWCU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/raxml/af5e2de3-3940-4af9-bce6-9a6e105e92f5%40gmail.com.

Alexandros Stamatakis

unread,
Mar 20, 2025, 4:08:46 AMMar 20
to ra...@googlegroups.com
Dear Minh and Van,

I looked at the code.

So what happens is that standard RAXML parses the secondary structure
file (format is specified in the old RAxML version 8 manual) and based
on this info internally splits the input data into two partitions, one
containing the normal DNA data in the loops and one containing the 6, 7,
or 16-state models.

So internally, an additional partition is being created. The two
partitions are then bootstrapped independently and the branch lengths
should be linked by default.

You can grep for "SECONDARY" in the standard RAxML source files to find
all respective operations.

I also discussed this with Alexey yesterday and we both think that it
would make more sense to move the partitioning to an external tool that
then just passes two partitions to the downstream ML tool.

If you look at my old code it is a very ugly index war that you may want
to avoid.

Hope this helps,

Alexis

Bui Quang Minh

unread,
Mar 27, 2025, 1:48:53 AMMar 27
to raxml
Hi Alexis, Thank you for the confirmation! I agree, better have an additional tool to prepare input partition file for RAxML.

@Grimm: Thanks for the additional bonus info, that's very helpful ;-)

Minh

Reply all
Reply to author
Forward
0 new messages