Recently I have tried AleRax (as kindly advised by your colleague, Prof. Szöllősi) and would like to share my first usage impression. I have spotted several bugs to report and have also come up with some proposals for improvement.
I. Improvement requests:
1) Constrained tree search
As one can see from numerous papers concerning ALE outgroup-free rooting, the correct root reconstruction requires many gene families and yet often ends up in somewhat ambiguous results. On the other hand, when trying to reconstruct a species tree, we sometimes might have a priori knowledge about the tree's true topology and/or root. It appears to me that being able to pass this knowledge to AleRax might significantly improve species tree reconstruction accuracy. Suppose the data do not have enough information for correct rooting, then AleRax is more likely to pick up a wrong root, which, in its turn, might bias the whole topology search.
One could pass this kind of knowledge in the form of a constraint tree -- a rooted or unrooted multifurcating species tree draft defining bipartitions and (optionally) the root, that have to be present in all the trees to be considered during the ML species tree search, as well as in the final species tree (like we see it in IQ-Tree2).
If possible, I would kindly ask you to implement this constraint species tree option in AleRax (and maybe also in SpeciesRax).
2) Consel file for N most likely species trees
In line with written above, we might not always have enough data to infer a single ML species tree and might get a set of nearly equally likely species trees instead. However, AleRax will still select the (possibly insignificantly) most likely tree of this set without reporting the others, in fact, leaving no chance to estimate statistical significance of the selection made.
This problem could be solved by reporting a Consel .mt file with per-family likelihoods for N most likely species trees that have been encountered during the whole ML tree search. Then users would be able to delineate this terrace-like tree set and, thus, to get genuinely all information (all the non-rejected trees) their data can provide.
Do you consider it possible to implement this kind of option in AleRax?
3) Minor things concerning the log output
To me the new alerax.log report lacks some very useful information present in the generax.log report: the "MPI Ranks" entry and the entire "Input data information" part (including entries like number of gene families, number of species, species coverage, etc.).
But surely it is just a matter of taste and design.
II. Some real bugs:
1) Crash with the species tree pruning option
Invoking the --prune-species-tree option results in the run crashing either during the first transfer-guided step if the UndatedDL model is used, or during the conditional clade probabilities initiation if the UndatedDLT model is used. I am not attaching any logs here, but it seems that the bug is readily reproducible with any data.
2) Family filtering options not working
AleRax seems to be effectively insensitive to the --max-clade-split-ratio and --trim-ratio options. Invoking these options is marked in the Run settings log section, but manifests in no relative likelihood change at any step of species tree inference.
III. Minor bugs:
1) Despite running AleRax, the second line of the the log file insists it is GeneRax:
[00:00:00] GeneRax was called as follow:
Another case of possible typo is AleRax writing [SpeciesSearch] instead of [Species search], whenever it mentions model optimization:
[00:15:20] [Species search] After root search: LL=-180646
[00:15:20] [SpeciesSearch] Optimizing model rates (light)
[00:15:23] [Species search] After model rate opt, ll=-180568
[00:15:25] [SpeciesSearch] Optimizing model rates (thorough)
2) For a dataset of ca. 1000 gene families and 50 species AleRax recommends using 473 cores, however, the running time already increases if I use 70 cores instead of 40. So the recommended maximum must be a huge overestimation.
3) In the species tree search mode the last root search dives into some formidable depths, I guess this is a mistake (?):
[00:14:24] [Species search] Root search with depth=4294967295
4) Even when using the UndatedDL model, in the species tree search mode AleRax goes through the transfer-guided topology search step. Given the transfer rate is fixed to zero, as expected, no likelihood optimization occurs, yet the step takes some time and seems just to run in vain.
By the way, can't this lead to prematurely stopping the topology search? I mean, after finishing several rounds of local SPR search AleRax reoptimizes the root and model parameters and tries transfer-guided search, but expectedly not succeeding in it stops the entire procedure, instead of trying local SPR search one more time, now with the new root and model parameters.
I have also observed this kind of behaviour in SpeciesRax, where it still makes the transfer-guided moves despite using the UndatedDL model. So either I am not getting something right, or the bug manifests even more seriously in that tool.
IV. Some strange things:
The following two cases are of nearly no practical importance due to the modest effect size and may be just a result of different handling of pseudorandom processes by the different tools or by the different run modes of AleRax. But since AleRax is somewhat "raw", I find it better to report them anyway.
1) Compared to ALEml_undated, AleRax seems to underestimate slightly the final likelihood for the same data (LL=-133.097 by ALE vs. LL=-136.032 by AleRax for a particular single-family example), it also estimates slightly different DTL parameters. But don't these tools estimate the likelihood under exactly the same model and so eventually have to get identical estimates?
2) I have tried the following experiment: I inferred the ML species tree from a set of ca. 1000 gene families and then I reconciled these gene families with the inferred species tree, expecting that both runs would end up with the same likelihood. Still the estimated likelihoods differed a bit (ΔLL=1).
Great thanks for upgrading ALE to AleRax and especially for including the long-awaited species tree inference mode! Sorry for such a long report, I hope you'll find at least some parts of it useful.
Best regards,
Stepan