How to improve the ratio between alignment sites and number of free parameters?

Gustavo Avelar Molina

unread,

Dec 16, 2023, 9:20:23 AM12/16/23

to raxml

Dear RAxML-NG users,

I am a (relatively) beginner at phylogenetic tree inference and my question may sound more general than specifically about RAxML-NG itself.

I am trying to make a phylogenetic tree of homologous proteins of around 270 amino acids. These sequences were obtained by blasting a reference protein sequence that exhibits the biological function of interest. Then, I align those sequences using Clustal Omega or Muscle5. Now, I am at the stage of tree inference using RAxML-NG in a HPC cluster. I've tested the 50, 100, 250, and 500 (minimal identity of ~65%) hits closest to the query, and my objetive is to infer a tree with the first 5000 hits (minimal identity of ~45%), since there is an instance in literature of a protein with the desired biological function at this identity level.

With 50 sequences an on, the tree inference goes fine (except for approx. 10 near-zero branches that appear to grow proportionally with the number of sequences in the analysis, but I suppose this is OK, since many sequences from BLAST are very similar). With 250 sequences and above, I get a warning about the number of free parameters being greater than the alignment size/sites. For instance, with 500 sequences, I have ~1000 free parameters and ~350 alignment sites.

I've played a bit with the BLAST filters and tested two different MSA programs (as mentioned) with different options, but they have no influence on the total number of free parameters and a small influence in the alignment size/sites. When I varied the evolutionary model parameters to make it as simple as possible, their influence in the total number of parameters is very small. I appears that the total number of free parameters varies proportionally with the number of sequences, but the alignment size increases at lower rate.

My question is: how does one change the ratio between alignment sites and the number of free parameters so they comply with the requirements for an accurate/reliable tree inference, particularly with 5000 sequences?

I appreaciate any help you may provide.

Best regards,

Gustavo

Grimm

unread,

Dec 18, 2023, 10:16:54 AM12/18/23

to raxml

Hi Gustavo,

The number of aligned sites are irrelevant, relevant is only the number of "distinct alignment patterns". It may be quite low in a case like yours depending how homologuous the harvested protein data really is and the taxonomic breadth of the source organisms. The critical thing is the sample structure: since you blast against a single reference to decide what is harvested, one doesn't know whether the 50, 200 and 450 extra and putatively less similar sequences belong to substantially diverse groups or are very similiar among themselves. In the last step (so far: 500), you may add 250 sequences that have only 65% similarity to your reference but near-100% with each other. Or 1 that is most dissimilar just above the threshold but the other 249 are near-identical to the (part of) 150 added in the filtering step before (or those before-before, in fact many of them could just a little bit more dissimilar than the first 50)

The easiest way to change the ratio is simply to reduce the number of tips to a meaningful and representative set by eliminating the high-similar sequence groups, thus, terminal (often random, too) noise and indiscriminate, in a phylogenetic tree-inference context, data. Right now you may look a thicket, which would need to be first weeded out to see a tree. E.g. do a blast all vs all and a fixed threshold to define groups of near-identical sequences in your total sample (e.g. 95% identity, 90%, or increments, whatever much reduces the number of tips) and then randomly select two placeholders from each group to establish a phylogenetic backbone for the putative homologues. One you have that, you just use EPA-ng to place all the others in the found backbone tree as a check-up. EPA is the reason to keep two tips per group, as the jPlace result is more straightforward to interpret than when the reference tree only includes a single tip per high-similarity cluster.

You could also pick random placeholders from the sequence groups that have e.g. 95, 90, ...% identity scores with the reference to get a tip-reduced dataset with a better ratio. And then again use EPA-ng to place the remainder of the 5000.

You could also go for a stepwise addition, not adding the next 50, 100, etc. Blast hits but only add sequences that have a minimum dissimilarity to any of those already included at the step before. This should also boil down the number substantially as well while not losing tips that may be of important to understand the protein's genealogy.

In any case, there's no point in inferring a 5000-tip tree for a matrix that pushes up such a warning, it's much better to have a (much) smaller tree with tips that cover the range of sequence variation in the putative homologues and place all others in this backbone protein tree using, e.g., EPA-ng: https://github.com/pierrebarbera/epa-ng

Cheers, Guido.

PS Here's three papers you could give a look in the context of pruning down tip-sets with too many, too similar sequences, to a meaningful sample, and treeability of data:

Haag J, Höhler D, Bettisworth B, Stamatakis A. 2022. From easy to hopeless - predicting the difficulty of phylogenetic analyses. Molecular Biology & Evolution doi:10.1093/molbev/msac1254

Mavian et al. 2020. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.2007295117

Stamatakis A, Göker M, Grimm GW. 2010. Maximum likelihood analysis of 3,490 rbcL sequences: Scalability of comprehensive inference versus group-specific taxon sampling. Evolutionary Bioinformatics 6:73-90. http://dx.doi.org/10.4137/EBO.S4528

Alexandros Stamatakis

unread,

Dec 20, 2023, 11:58:18 AM12/20/23

to ra...@googlegroups.com

Dear Gustavo,

I guess the problem are the branch lengths that are free parameters of
the model. So, for instance a tree with n sequences has 2n-3 branches
that are free parameters. In other words, the more short sequences you
add to the tree, the number of free parameters will increase linearly
with the number of sequences, which I believe is what is causing this
warning.

A helpful tool to steer dataset assembly in our new Pythia tool
(https://academic.oup.com/mbe/article/39/12/msac254/6832260) that
predicts the difficulty of a phylogenetic analysis (basically just the
signal strength) for a given MSA.

So in your case, you can use Pythia to determine the MSA with the lowest
diffculty which gives you an objective criterion to select the
phylogenetically most appropriate way to assemble your data.

Alexis

> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/08f2e258-74b1-4ef5-bf85-00ad333b9630n%40googlegroups.com <https://groups.google.com/d/msgid/raxml/08f2e258-74b1-4ef5-bf85-00ad333b9630n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

Reply all

Reply to author

Forward