Number of free parameters vs alignment size

43 views
Skip to first unread message

Črt Pantner

unread,
Jul 17, 2025, 2:40:41 PMJul 17
to raxml
Greetings to all,

I am a microbiology student working on my masters thesis on the feild of bioinformatics, specifically on the phylogeny of pore forming proteins. I have in total around 470 proteins obtained from a JGI clustering run, for which i would like to costruct a phylogenetic tree. For this purpose, i have been using MAFFT with the L-ins-i algorithm, and calculating the best substitution model (which turned out to be WAG+G4) using modeltest-ng. 

After checking the alignment using raxml-ng parse command RAxML was called with:

raxml-ng --msa only_id_maxiter_1000_einsi.fasta --model WAG+G4 --prefix einsi_1 --threads 6 --tree pars{25},rand{25}

However the calculation finished with:

WARNING: Number of free parameters (K=940) is larger than alignment size (n=742).
         This might lead to overfitting and compromise tree inference results!

On this topic I have read previous theads regarding this over fitting in the abscence of phylogenetic signal in the alignment. To combat this, I have picked two proteins at random from each cluster and placed the remainder of the proteins onto the tree using EPA-ng. I find the resulting tree to be quite unintuitive, and the tree format seems to be incompatible with my visualization software of choice. I have also tried a more rigorous cleaning 

My question is thus: is creating two or perhaps three subtrees, one for each cluster that contains the majority of my proteins and one for the remaining ~10% of proteins a viable strategy. If so, does RAxML-ng support the creation of a final supertree, if that is even necessary? I have read through the man page within the program and through the wiki, but don't seem to find the option.

Thank you for all your support in advance.

My next 

Oleksiy Kozlov

unread,
Jul 21, 2025, 7:48:01 AMJul 21
to ra...@googlegroups.com
Hello Črt,

> WARNING: Number of free parameters (K=940) is larger than alignment size (n=742).
>          This might lead to overfitting and compromise tree inference results!

As you can see, the ratio between the number of free parameters (mostly tree branches in this case)
and MSA columns is not awful, although certainly far from optimal.

You will likely get a few near-zero branches in the resulting tree, which are not so reliable.

You can also check the phylogenetic difficulty of your MSA using PyPythia:

https://github.com/tschuelia/PyPythia

or --pythia command in raxml-ng 2.x:

https://github.com/amkozlov/raxml-ng/releases/tag/2.0-beta2

> My question is thus: is creating two or perhaps three subtrees, one for each cluster that contains
> the majority of my proteins and one for the remaining ~10% of proteins a viable strategy.

It is worth trying, then you can compare cluster subtrees to the respective parts of the full tree
and check whether you see major differences.

> If so,
> does RAxML-ng support the creation of a final supertree, if that is even necessary? I have read
> through the man page within the program and through the wiki, but don't seem to find the option.

Nope, there is no such option.

As an alternative, you can consider using a topological constraint:

https://github.com/amkozlov/raxml-ng/wiki/Input-data#topological-constraint

Hope this helps,
Oleksiy


J Hengstler

unread,
Jul 21, 2025, 9:22:32 AMJul 21
to raxml
Hi Črt,

The problem with using a super-tree method or phylogenetic placement (EPA-ng) is that you aren't getting rid of the free parameters, you are just hiding them from the inference tools (so essentially you will just get rid of the warning, but you won't solve the problem). The only way to ensure mathematically that the result becomes reliable is to use longer sequences (which i guess is not possible for you) or throw away some of your taxa (for example by stratified sampling, or identically distributed sampling, or using centroid sequences from a clustering approach).

If that is also not an option, all the methods you were describing (super trees and placement) have the same issue: You have not enough data for your model. You are using 50 starting trees for your tree search, so the first step you can do to verify how stable your result is, is to compare all 50 final trees (you can find them in the  einsi_1.raxml.mlTrees file). Check their pairwise RF distances and their branch lengths to see how stable your final tree is. If all distances between the final trees are large, you cannot obtain a reliable tree with your data (at least not a bifurcating tree, there are some things you can do if you are fine with polytomies: Consensus Trees or collapsing short edges for example).

If you need more advise, the Pythia score that Oleksiy described to you will help determine how "hopeless" your dataset is ;)

Kind regards
Johannes

Črt Pantner

unread,
Jul 22, 2025, 1:39:30 PMJul 22
to raxml
Hello Oleksiy,

thank you very much for yor response. I have tried using PyPythia v 2.0.0 to determine the difficulty of my MSA. The result was a value of 0.41, which seems to suggest that the alignment should not be too difficult to analyze. 

This result has raised a few specific questions, and I would be very grateful for your advice. 

My understanding is that scores closer to 0 indicate strong phylogenetic signal that will result in one distinct tree topology with only one significant peak, however the closer the score becomes to 1, higher the ammount of tree topologies with a significant RF score. If so, what would you consider a threshold where one should become concerned about this problem?
As far as my understanding goes, one option would be one that mr. Johannes has recommended - to check all the final trees. Would another option be to run multiple RAxML-ng runs and compare the resulting best tree between each run, find inconsistently placed taxa and just comment this in the discussion?

Furthermore, seeing that I got a result of 0.41 (which seems relatively low), does this mean that removing duplicate sequences would solve my problem and that (more importantly) I can consider the warning message from RAxML-ng "resolved"? I have tried to remove duplicate sequences (since the first tree was computed without duplicates removed), however the warning message persists - which makes sense since RAxML-ng automatically removes duplicates. 
Or should I do a few rounds of data cleanup - perhaps by removing sequences that "disturb" the MSA by hand or perhaps using TrimmAl (which I have been avoiding since in from previous experience a very large protion of my proteins usually get "cut").

Thank you again for your time and insight and sorry for the abundance of questions!

Črt Pantner

Črt Pantner

unread,
Jul 22, 2025, 1:39:30 PMJul 22
to raxml
Dear Johannes,

thank you for your detailed reply. You are in fact correct about your assumption that I cannot use longer sequences, since the proteins I am working with are quite short (on average around 250 to 300 bp, depending on cutoff criteria). I will however look at the other options you mentioned, combined with the Pythia score in order to hopefully resolve this problem. 

With regards,

Črt Pantner

Grimm

unread,
Jul 23, 2025, 4:37:38 AMJul 23
to raxml
Sorry, to add another take on this.

The Pythia score is relatively low, below 0.5 is managable in a tree environment. What Johannes pointed out is valid but, in practise, may be less a concern. One cannot really tell, as it all depends on the actual signal the matrix produces for certain subtrees. Every data set has its specialities, and these values only point us to its strengths or weaknesses 

As Oleksiy pointed out you probably have tips (or groups of tips) that are difficult to tree because they are near-identical and result in accordingly flat, poorly resolved subtrees. They are inference-wise detrimental, increase the Pythia score, have little branch support, waste computation time, while providing no additional phylogenetic information. The simple solution to the problem is informed pruning to get a backbone tree, and then stuff that backbone tree.

What I would do is 
  1. Run a quick-and-dirty comprehensive tree, more important the tree topology is the BS support for its branches
  2. Prune that tree to a set of representative sequences, representing terminal clades defined by e.g. a distance threshold (for the members of a subtree), branch support threshold or with algorithms that define species (don't know, if they work for protein-data though). E.g. Start at the tips and keep only one representative per branch that has BS > 40 or whatever threshold gives you a sizeable tip size reduction. Or the 2 most distant per terminal subtree which members have a pairwise distance below 0.05.
  3. Calculate the Pythia score of the pruned data set, if you like iterate (e.g. you start with a distance threshold of 0.05, increase in 0.05 steps, or BS of 10, 20, 30, ...) The aim is to get a pruned-as-possible tree with an as low as possible Pythia score but as many as possible clades covered see in the total tree.
  4. Generate a backbone tree with the rep. seqs, you will probably get a tree with appreciable branch support along all branches, fine phylogenetic distances (i.e. via the tree) between tips, which makes a good reference tree for EPA.
  5. Place all pruned tips using EPA-ng in that backbone tree. If the LWR << 1, take a look at the jPlace file and pin the split support on your reference tree. Maybe add some representatives, for overlooked clades (over-pruning).  Note: Queries with split-LWR using a high-discriminative reference tree are those roques that can increase Pythia scores and decrease branch supports and trigger polytomies in the widely used but insufficient consensus trees (see below).
  6. If you want a total tree: 
    1. Generate subalignments including the pruned queries and the rep.seqs. of the subtrees they were placed (LWR ~ 1).
    2. Run local trees
    3. Supertree them with the tree from Step 4, the reference tree, as the core, the phylogenetic backbone.
And, on a general note, please stop using consensus trees: they are inferior summaries of phylogenetic trees, their polytomies can have different data (signal) or biological reasons, hence, are meaningless in a phylogenetic context (PS consensus trees, being summaries, are not phylogenetic trees; only fully dichotomous trees are phylogenetic trees per definition). Decreased branch support can have two main reasons: ancestral-descendant relationships (hard polytomies), lack of discriminate signal, and internal signal conflict (soft polytomies). If you want to summarise topologically different trees, use consensus networks. E.g. you can run RF distance to compare your initial 50 trees but you can also visualize their differences with a consensus network, or Adams conensus tree: the Adams consensus only retains polytomies inflicted by rogues.

Rogues may act rogue-ish because of the one or other reason. E.g. an ancestral protein type X that evolved into A and B types, will trigger a (quasi-)hard polytomy because our phylogenetic trees cannot depict ancestor-descendant relationships, they only resolve sister relationships. You end up with a high supported X-A-B polytomy in the best case, in the worst with a X + (A + B) subtree because of long-branch attraction between A and B. Or the subtree is genuine because X is not the last (or a late) common ancestor (LCA, close to the hypothetical MRCA, most-recent common ancestor, the node connecting the A and B roots) of A and B but an early precursor type (ECA). In EPA, if the reference only include A and B, and LCA-X can have split-LWR for the internodes (branches) representing the A root, the B root and the A+B root, but an ECA-X could have split-LWR involving the next deeper internode(s) and only the A+B root.
Soft and misleading polytomies can be inflicted by a recombinant sequence. Imagine an AxB, being a recombinant of A and B, it will be attracted by both clades. If A and B are sisters the triggered soft polytomy in the consensus tree is not a big problem because we still have an AxB+A+B clade, but if they are cousins, the recombinant will be place in root-proximal ("basal") position to A or B, decreasing the support for A+AxB or AxB+B. If you then have a sister C to B, you may end up with a well supported AxB-B-C polytomy, which is wrong because AxB is not a possible sister of C, in fact, it's unrelated to C beyond they fact that it is half-B. Again EPA can be a great asset here, because a recombinant query will have accordingly split LWRs if there is enough signal in the query and high-discrimination capacity in the reference. Queries with split-LWR to non-connected internodes must be excluded when inferring phylogenetic trees, naturally. Consensus networks on the other hand, don't collapse everything into polytomies but respresent all alternative splits in a tree sample (strict consensus networks), or in a certain fraction of them (e.g., BS support networks). See Schliep et al. (2017;  http://dx.doi.org/10.1111/2041-210X.12760) for examples, R vignettes, and further literature.

Good inference

/G

J Hengstler

unread,
Jul 23, 2025, 5:49:36 AMJul 23
to raxml
>  which makes sense since RAxML-ng automatically removes duplicates

No, raxml does not do that. It generates a new MSA with duplicates removed as a convenience feature, so you can run with that alignment in future runs, but it does not remove duplicates from the run you just started. This is because downstream analyses in your pipeline likely expect all taxa of your input MSA to be present, so if raxml did remove them from the tree, your tree would suddenly have less tips. However, you should definitely remove those, as they provide no phylogenetic information and just add useless parameters to your model. If you want them in the tree in the end, just append them next to their twin taxon with minimum branch length. 

Kind regards
Johannes
Reply all
Reply to author
Forward
0 new messages