RAxML Compatibility with 3Di Character Alignments and Likelihood Comparability

34 views
Skip to first unread message

david ferreiro garcia

unread,
Apr 29, 2025, 5:15:15 AMApr 29
to raxml
Dear RAxML development team,

RAxML is traditionally designed to analyze alignments composed of the twenty standard amino-acid characters. Since the 3Di models employ a distinct structural-alphabet character set, could you kindly advise whether RAxML can correctly process an alignment encoded in 3Di characters when specified to use a model tailored to those characters? 

If so, are there any special settings or custom code modifications required to ensure that the program interprets and optimizes over the 3Di states rather than standard amino acids?

Assuming that RAxML can indeed handle 3Di-encoded alignments under the appropriate model, would the resulting likelihood values be directly comparable to those obtained from the corresponding amino-acid alignment (i.e., the same alignment positions, but represented in the standard twenty-state alphabet)? 

In other words, can differences in likelihood under the 3Di model versus an empirical amino-acid model be interpreted on the same scale, or should one apply corrections or normalization when contrasting these two analyses?

Thank you for your time,  

Best regards,

J Hengstler

unread,
Apr 30, 2025, 5:07:00 AMApr 30
to raxml

Hello,

RAxML can process arbitrary alphabets, among them 3Di, with a few special parameters and files.

I am currently not sure if I remember the alphabet used by 3Di correctly: It has 20 states, but I am not sure if it uses the same letters as the amino acid code, so I will describe to you the steps necessary to define a custom state encoding (charmap.txt):
You need to define a mapping from your alignment characters to model characters (in your case that mapping should be the identity mapping but rename the 3Di characters to Amino acids). This mapping is done using a file of the format described here.

The matrix should look like this for you:
Ade Cyt Gua Thy ...
A 1 0 0 0
B 0 1 0 0
C 0  0 1 0
D 0 0 0 1
...

The last line in this matrix defines that gaps are ambiguous for every model character (see the example in the link above). Mind the header lines described in the linked wiki, that I didn't include here.
If 3Di happens to use the same alphabet characters as the Amino Acid code, you can just skip this step.

Next, you need to provide the custom substitution model in the PAML format in a file. You then define this model in the command line as "PROTGTR{paml.txt}+M{charmap.txt}". An example of how the paml.txt file is structured can be found here. The file contains the lower triangle matrix of the rate matrix, as well as the stationary frequencies for your 3Di model. If you don't have a custom state encoding, you don't have to append the +M part.
RAxML does not have optimization for models with an uncommon number of states. Since 3Di happens to have 20 states like AA, it has the same optimizations as any AA model (this is also the reason why you can use PROTGTR for the model definition).

Lastly, likelihoods between different datasets cannot be compared, so the likelihood of amino acid data and structural 3Di data is uncomparable.

Kind regards,
Johannes Hengstler

david ferreiro garcia

unread,
Apr 30, 2025, 7:20:20 AMApr 30
to raxml

Hi Johannes,

Thank you very much for the detailed instructions on how to set up and run RAxML with a custom 3Di alphabet and model (charmap.txt + PROTGTR{…}+M{…}). I appreciate the clear steps for defining the character mapping and substitution matrix in PAML format.

I do have one follow-up question about the comparability of likelihoods: you noted that “likelihoods between different datasets cannot be compared,” even when both datasets use the same number of states (20), the same character labels, and the very same sequence alignment as a starting point (one translated into amino-acid states, the other into 3Di structural states). I understand that the likelihood is tied to the specific data type and its underlying distribution, but given that our 3Di alignment is generated directly from the AA alignment (so it has identical length, taxa, and state alphabet), could you elaborate a little on the statistical rationale for why the two likelihoods—and therefore BIC or AIC scores—remain fundamentally incomparable?

I’m interested in evaluating which models (AA vs. 3Di) yields more reliable phylogenetic estimates. Ideally one would benchmark against a known “true” tree, but simulated alignments are inherently biased by the simulation model itself. Do you have recommendations for robust model-comparison workflows under different alphabets that minimize such circularity?

Thanks again for your help!

Best regards,

David Ferreiro

Hengstler, Johannes

unread,
Apr 30, 2025, 9:28:27 AMApr 30
to raxml

Hi David,

I might have been wrong on the comparability. We tried to get phylogenic inference working with a structural alphabet (a different one from 3Di) for the past six months, and we didn't compare likelihoods out of fear that changing the underlying data source (i.e. exchanging amino acid data with structural data) has an effect that we cannot accurately correct for with BIC/AIC. But this probably has been done before, when people compared Codon models with Amino Acid models, so maybe it works. I am not sure.


In any case, we came to the conclusion that the signal in structural data is worse. In our case we compared to reference taxonomies for various phyla, and were unable to match the performance of Amino Acid models, neither with the structural sequences, nor with a combined approach of using both sequences at once. In our case the structural sequence had more states (36) than amino acid, so without BIC our likelihoods were incomparable. Thus, i don't know if that would have yielded different results. 

However, a similar study to ours, if less in-depth, has been done with 3Di as well, so you may be able to compare to their methods: https://doi.org/10.1101/2024.08.02.606352

Best regards,

Johannes


From: ra...@googlegroups.com <ra...@googlegroups.com> on behalf of david ferreiro garcia <ferreirog...@gmail.com>
Sent: Wednesday, April 30, 2025 1:20:20 PM
To: raxml
Subject: [raxml] Re: RAxML Compatibility with 3Di Character Alignments and Likelihood Comparability
 

This email originates from an external sender. Please verify the sender's identity carefully before opening any attachments or links!

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/raxml/caca4750-f0b3-4969-b074-ebe290aa1018n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages