AleRax bugs and requests

249 views
Skip to first unread message

Stepan Puhov

unread,
Oct 29, 2023, 7:39:07 PM10/29/23
to GeneRax
Hi Benoit,

Another long email from me :)

Recently I have tried AleRax (as kindly advised by your colleague, Prof. Szöllősi) and would like to share my first usage impression. I have spotted several bugs to report and have also come up with some proposals for improvement.


I. Improvement requests:

1) Constrained tree search
As one can see from numerous papers concerning ALE outgroup-free rooting, the correct root reconstruction requires many gene families and yet often ends up in somewhat ambiguous results. On the other hand, when trying to reconstruct a species tree, we sometimes might have a priori knowledge about the tree's true topology and/or root. It appears to me that being able to pass this knowledge to AleRax might significantly improve species tree reconstruction accuracy. Suppose the data do not have enough information for correct rooting, then AleRax is more likely to pick up a wrong root, which, in its turn, might bias the whole topology search.
One could pass this kind of knowledge in the form of a constraint tree -- a rooted or unrooted multifurcating species tree draft defining bipartitions and (optionally) the root, that have to be present in all the trees to be considered during the ML species tree search, as well as in the final species tree (like we see it in IQ-Tree2).
If possible, I would kindly ask you to implement this constraint species tree option in AleRax (and maybe also in SpeciesRax).

2) Consel file for N most likely species trees
In line with written above, we might not always have enough data to infer a single ML species tree and might get a set of nearly equally likely species trees instead. However, AleRax will still select the (possibly insignificantly) most likely tree of this set without reporting the others, in fact, leaving no chance to estimate statistical significance of the selection made.
This problem could be solved by reporting a Consel .mt file with per-family likelihoods for N most likely species trees that have been encountered during the whole ML tree search. Then users would be able to delineate this terrace-like tree set and, thus, to get genuinely all information (all the non-rejected trees) their data can provide.
Do you consider it possible to implement this kind of option in AleRax?

3) Minor things concerning the log output
To me the new alerax.log report lacks some very useful information present in the generax.log report: the "MPI Ranks" entry and the entire "Input data information" part (including entries like number of gene families, number of species, species coverage, etc.).
But surely it is just a matter of taste and design.


II. Some real bugs:

1) Crash with the species tree pruning option
Invoking the --prune-species-tree option results in the run crashing either during the first transfer-guided step if the UndatedDL model is used, or during the conditional clade probabilities initiation if the UndatedDLT model is used. I am not attaching any logs here, but it seems that the bug is readily reproducible with any data.

2) Family filtering options not working
AleRax seems to be effectively insensitive to the  --max-clade-split-ratio and --trim-ratio options. Invoking these options is marked in the Run settings log section, but manifests in no relative likelihood change at any step of species tree inference.


III. Minor bugs:

1) Despite running AleRax, the second line of the the log file insists it is GeneRax: 
[00:00:00] GeneRax was called as follow:
Another case of possible typo is AleRax writing [SpeciesSearch] instead of [Species search], whenever it mentions model optimization:
[00:15:20] [Species search] After root search: LL=-180646
[00:15:20] [SpeciesSearch] Optimizing model rates (light)
[00:15:23] [Species search]   After model rate opt, ll=-180568
[00:15:25] [SpeciesSearch] Optimizing model rates (thorough)


2) For a dataset of ca. 1000 gene families and 50 species AleRax recommends using 473 cores, however, the running time already increases if I use 70 cores instead of 40. So the recommended maximum must be a huge overestimation.

3) In the species tree search mode the last root search dives into some formidable depths, I guess this is a mistake (?):
[00:14:24] [Species search] Root search with depth=4294967295

4) Even when using the UndatedDL model, in the species tree search mode AleRax goes through the transfer-guided topology search step. Given the transfer rate is fixed to zero, as expected, no likelihood optimization occurs, yet the step takes some time and seems just to run in vain.
By the way, can't this lead to prematurely stopping the topology search? I mean, after finishing several rounds of local SPR search AleRax reoptimizes the root and model parameters and tries transfer-guided search, but expectedly not succeeding in it stops the entire procedure, instead of trying local SPR search one more time, now with the new root and model parameters.
I have also observed this kind of behaviour in SpeciesRax, where it still makes the transfer-guided moves despite using the UndatedDL model. So either I am not getting something right, or the bug manifests even more seriously in that tool.


IV. Some strange things:
The following two cases are of nearly no practical importance due to the modest effect size and may be just a result of different handling of pseudorandom processes by the different tools or by the different run modes of AleRax. But since AleRax is somewhat "raw", I find it better to report them anyway.

1) Compared to ALEml_undated, AleRax seems to underestimate slightly the final likelihood for the same data (LL=-133.097 by ALE vs. LL=-136.032 by AleRax for a particular single-family example), it also estimates slightly different DTL parameters. But don't these tools estimate the likelihood under exactly the same model and so eventually have to get identical estimates?

2) I have tried the following experiment: I inferred the ML species tree from a set of ca. 1000 gene families and then I reconciled these gene families with the inferred species tree, expecting that both runs would end up with the same likelihood. Still the estimated likelihoods differed a bit (ΔLL=1).


Great thanks for upgrading ALE to AleRax and especially for including the long-awaited species tree inference mode! Sorry for such a long report, I hope you'll find at least some parts of it useful.


Best regards,
Stepan

Benoit Morel

unread,
Oct 30, 2023, 7:30:38 AM10/30/23
to Stepan Puhov, GeneRax
Hi Stepan,
Thanks a lot for those valuable feedbacks. I'll reply/fix bugs when I am back from vacation in Thursday!
Benoit

--
You received this message because you are subscribed to the Google Groups "GeneRax" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generaxusers...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/generaxusers/826679d3-247e-4e3e-a3cd-ecd5848210afn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benoit Morel

unread,
Nov 3, 2023, 5:50:48 AM11/3/23
to Stepan Puhov, GeneRax
Hi Stepan,

Here are some answers.
I]
1) I'll see if I can implement the constrained species tree search. That should not be too difficult, but the code is getting large and I need to make sure that I do it correctly. And I have to find the time to implement it :) 
2)  Same answer
3) Good point, I added the number of ranks and the information about the input data.

II]
1) There is a bug indeed. I'm on it
2) Which values are you using? For instance --trim-ratio should definitely trim some families if you use a number strictly between 0.0 and 1.0

III]
1) Thanks, it's now fixed.
2) Are you sure that you have more than 40 (physical!) cores? We might slightly overestimate the optimal number of cores (assuming that your machine can handle them), but 400 vs 40 is surprising.
3) No, that's not a mistake, we use the maximum depth for the last step. I can update the logs to make it look less weird :)
4) I haven't tried the UndatedDL model for a while, but as far as I remember, we temporary "force" a positive transfer rate when reconciling gene trees to estimate the transfers (we don't expect "real" transfers, but rather artifact transfers that indicate a mistake in the species tree, so it's ok to temporary allow transfers). You could check by starting from a random species tree and see if the first round of transfer search finds better topologies (which can only happen if it infers transfers).

IV]
1) The model is the same but, but our algorithms to compute the likelihood under this model are slightly different. I also observed slightly different values. One difference, for instance, is that the likelihood computation (even assuming the same trees and model parameters) relies on solving equations that might not converge super well. From my experience, this does not affect the results substantially unless that data quality is very low.
2) Maybe one of the two runs was a bit better at estimating the DTL parameters. Do you get very different rates?

Thanks a lot for your valuable feedback

Best,
Benoit



 

Benoit Morel

unread,
Nov 3, 2023, 6:19:00 AM11/3/23
to Stepan Puhov, GeneRax
Hi again,

I think I fixed the bug with the prune mode. But I only managed to reproduce it with UndatedDTL. Can you see if that fixes it for UndatedDTL, and then if you still reproduce with UndatedDL?
Also, I checked the trimming options and you were right, there was another problem, now fixed. Thank you so much for spotting all those problems!

Best,
Benoit

Stepan Puhov

unread,
Nov 27, 2023, 6:30:28 PM11/27/23
to GeneRax
Dear Benoit,

Sorry for not replying for so long.


Thank you for extending the input data logs and fixing the trimming options! The general bug with the species tree pruning option has also been successfully fixed, I've checked that the option is now working both for DL and DTL models.
Yet I've noticed another run-terminating bug associated with the pruning option that seems to be data-dependent. I'm presenting the thorough description and logs in a private letter.


Additionally, I've spotted an occasional joint likelihood decrease in several runs:
[00:03:48]      better tree (transfers:0.125455, trial: 33, ll=-293451, hash=98729)
[00:03:48] [SpeciesSearch] Optimizing model rates (light)
[00:03:57] [Species search]   After model rate opt, ll=-297235
As this issue is quite rare, I think it doesn't actually affect performance, but, maybe, it is still worth noting.


I'd also like to ask you to pay attention one more time to the part III of the original letter of this thread:

1) The mentioned log typos are still there :(

2) I suppose, that the "mpi ranks used" and the "recommended number of cores" are referring to the same entities (?)
For a given dataset (1173 families and 59 spp) and run parameters, while running species tree reconstruction, I receive the following in the log:
[00:03:48] Recommended maximum number of cores: 473
Using 40 mpi ranks the run completes in 13:37 mins, and with 70 mpi ranks the running time is 14:33 mins. I don't have 473 cores on my HPC cluster, however, it is evident that using 70 is already superfluous here resulting in suboptimal time performance, so the recommended number of cores would fit even less.

3) Concerning the maximum-depth root search. As far as I get, the depth here means the radius in number of branches from the current root within which to search for the new root. In this definition, a line from the log file like Root search with depth=4294967295 makes just no sense, as my tree doesn't have so many branches! Or is it just some very large number chosen in advance to cover all the possible cases?

4) Ok, using fake transfers to facilitate species tree search is a smart move :) I thought about it in the first place, but, having not found anything about it for the case of the DL model in the SpeciesRax/AleRax papers, decided it'd be better to ask.
But still, I slightly doubt another feature of the search algorithm:
From the logs I see that the model optimization precedes every transfer-guided round, but is not undertaken before local SPR rounds. To me it might be problematic in the following scenario: ... -> successful local SPR round -> failed local SPR round -> model optimization -> failed transfer-guided round -> STOP. Consider the last model optimization, though not permitting any transfer-guided moves, yet permits a likelihood-increasing local SPR move -- this move won't be explored by the algorithm. However, if the model optimization were undertaken before every local SPR round, such a problem wouldn't occur.
In my tests I've come across a case where the search starting with a random tree follows the aforementioned scenario and ends up with a suboptimal species tree (compared to the MiniNJ-starting search result) which could be improved with a pair of local SPR moves.
Couldn't it be reasonable to add the model optimization step before each local SPR round or/and add a final step of model optimization and local SPR with radius=3 after the usual search stop?


My final concern is about the inferred species tree branch lengths.
When starting with a random tree, all the branch lengths end up being estimated as 0.1. That is true for both AleRax and SpeciesRax. But shouldn't the lengths be estimated from the inferred GFT reconciliations and, thus, be independent of the starting tree?
I'm also concerned with the branch lengths being somewhat large. Consider having a set of gene families which contain no in-paralogues and no xenologues, so every pair of leaves are either orthologues or paralogues from different species. Thus, for any 2 species the speciation event cannot occur deeper than the separation of the 2 most close leaves related to these species seen in the true GFTs. Also consider that we know that the ML trees for these gene families (inferred from alignments alone) don't have branches longer than 1.4. How comes it that a good share of the inferred species tree's leaf branches have lengths exceeding 1.5 and sometimes 2.0? Can't the true GFTs be so different from the ML trees?


Best regards,
Stepan

Stepan Puhov

unread,
Dec 27, 2023, 3:15:13 AM12/27/23
to GeneRax
Dear Benoit,

From checking the last AleRax commit it seems that the pruning bug is successfully fixed (thank you for that!), but a new bug with the family trimming options has somehow appeared. Now both the --max-clade-split-ratio and the --trim-ratio options apparently get stuck in an infinite loop: the last line printed in the log file is about trimming initiation, and then just nothing happens for too long.

I presume, you will be able to answer only after your vacation, but please consider also answering some points mentioned in the previous letter of this thread. There is a question mentioned about inferred branch lengths which is still quite relevant to me.

Happy holidays!

Best regards,
Stepan

Benoit Morel

unread,
Dec 27, 2023, 11:21:15 AM12/27/23
to Stepan Puhov, GeneRax
Hi Stepan,
Yes I will look at this when I am back. I am a bit lost with the different points, could you just rewrite a list of the unanswered points?
Benoit

Stepan Puhov

unread,
Jan 22, 2024, 12:33:40 AM1/22/24
to GeneRax
Hi Benoit,

Here I will summarize all the known unresolved issues, as you just asked in the previous letter:

I. Cosmetic defects in the log:
Nothing serious, just potential recommendations to consider. Do not require replying :)
1) A typo in the AleRax log file:
Despite running AleRax, the second line of the the log file insists it is GeneRax: 
[00:00:00] GeneRax was called as follow:
2) Recommended number of mpi ranks in the log file is totally senseless:
The log file in its line Recommended maximum number of cores suggests using 473 cores for a run with a dataset of 1000 families and 80 species and the DL model invoked, where ca. 40 cores have empirically been found to be the optimum for the dataset. Yet it suggests using 18 cores for a run with a dataset of 25 families and 250 species and the DTL model invoked, where 24 have empirically been proven to perform better, but still to be not enough for the dataset.
If it is hard to improve and it works really not so well, maybe it is worth considering to remove this recommendation from the log file?
3) Intimidating root search depth:
The following line in the log file might seem a bit confusing (there are no so many branches in my tree to have such a search radius!):
[00:11:32] [Species search] Root search with depth=4294967295
I understand that here you are just using a very large number to cover all the possible cases. But maybe it would be better to substitute the line in the log with a phrase like "Root search with maximal depth"? 

II. Serious bugs:
1) Both the --max-clade-split-ratio and the --trim-ratio options apparently get stuck in an infinite loop: the last line printed in the log file is about trimming initiation, and then just nothing happens for too long. The problem is data-independent and thus must be reproducible.
2) For the actual version (22.01.24) AleRax installation exits with error after being 98% completed. Installation log file attached.

III. Possible bugs:
1) An occasional joint likelihood decrease:
[00:03:48]      better tree (transfers:0.125455, trial: 33, ll=-293451, hash=98729)
[00:03:48] [SpeciesSearch] Optimizing model rates (light)
[00:03:57] [Species search]   After model rate opt, ll=-297235
As this issue is quite rare, I think it doesn't actually affect performance, but, maybe, it is still worth noting.
2) Questions about estimated species tree branch lengths:
When starting with a random tree, all (!) the branch lengths end up being estimated as 0.1. That is true for both AleRax and SpeciesRax. But shouldn't the lengths be estimated from the inferred GFT reconciliations and, thus, be independent of the starting tree?
I'm also concerned with the branch lengths being somewhat large. Consider having a set of gene families which contain no in-paralogues and no xenologues, so every pair of adjacent leaves are either orthologues or paralogues from different species. Thus, for any 2 species the speciation event cannot occur deeper than the separation of the 2 most close leaves belonging to these species seen in the true GFTs. Also consider that we know that the ML trees for these gene families (inferred from alignments alone) don't have any branches longer than 1.4. How comes it that a good share of the inferred species tree's leaf (!) branches have lengths exceeding 1.5 and sometimes 2.0?


Best regards,
Stepan

alerax_install_error.txt

Benoit Morel

unread,
Jan 23, 2024, 7:22:18 AM1/23/24
to Stepan Puhov, GeneRax
Thanks Stepan, I'll go through the list and bugs.
Regarding the number of cores, it is the optimal number of cores assuming that your machine has enough cores (e.g., in your case, I guess your machine has 40 physical cores, but it would run faster on a cluster). I'll figure out a less misleading message :)

Stepan Puhov

unread,
Jan 26, 2024, 12:25:01 AM1/26/24
to GeneRax
Hi Benoit,

Your fixing did change something, but didn't quite help -- now another installation error occurs :(
Or is something wrong on my part?
Installation log file attached.


Best regards,
Stepan

alerax_install_error_2.txt

Benoit Morel

unread,
Jan 26, 2024, 8:08:00 AM1/26/24
to Stepan Puhov, GeneRax
Hi Stepan,
I think I fixed it, but could you give it one more try?
Best,
Benoit

Stepan Puhov

unread,
Jan 26, 2024, 8:26:50 AM1/26/24
to GeneRax
No, doesn't work (
A little typo in your latest amendment. Here is a brief error context from the installation log:
-- Performing Test CXX_STANDARD_14_SUPPORT - Success
-- CMAKE_BUILD_TYPE not set, defaulting to Debug
CMake Error at ext/GeneRaxCore/CMakeLists.txt:33:
  Parse error.  Function missing ending ")".  End of file reached.

Best regards,
Stepan

Benoit Morel

unread,
Jan 26, 2024, 8:33:20 AM1/26/24
to Stepan Puhov, GeneRax
I'm very sorry, I went too fast. Not it should compile

Stepan Puhov

unread,
Jan 26, 2024, 9:40:30 AM1/26/24
to GeneRax
Benoit,

Thanks, now the installation works)
A quick question: you have just made the dependency on the GSL library optional, could there possibly occur any usage difference for compiliations with and without this library preinstalled?

Best regards,
Stepan  

Stepan Puhov

unread,
Jan 26, 2024, 9:40:33 AM1/26/24
to GeneRax
Sorry, yet another error:
[ 71%] Building CXX object ext/GeneRaxCore/src/CMakeFiles/generaxcore.dir/optimizers/DTLOptimizer.cpp.o
/home/flaumberg/gpfs/Scripts/AleRax-1.0.0/ext/GeneRaxCore/src/optimizers/DTLOptimizer.cpp:12:10: fatal error: gsl/gsl_multimin.h: No such file or directory
 #include <gsl/gsl_multimin.h>
          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.


On Friday 26 January 2024 at 16:33:20 UTC+3 beno...@gmail.com wrote:

Benoit Morel

unread,
Jan 26, 2024, 9:50:53 AM1/26/24
to Stepan Puhov, GeneRax
No, don't worry. I am testing new libraries for the optimization of the parameters, but this is experimental and can only be enabled with options that are not documented. I pushed it accidentally when fixing one of the issues, which caused this mess, but having gsl installed or not won't change anything for you

Stepan Puhov

unread,
Jan 26, 2024, 10:38:04 AM1/26/24
to GeneRax
Benoit, and what could you tell about these two issues? As I have just checked the infinite loop problem is still very much there:

1) Both the --max-clade-split-ratio and the --trim-ratio options apparently get stuck in an infinite loop: the last line printed in the log file is about trimming initiation, and then just nothing happens for too long. The problem is data-independent and thus must be reproducible.
2) Questions about estimated species tree branch lengths:
When starting with a random tree, all (!) the branch lengths end up being estimated as 0.1. That is true for both AleRax and SpeciesRax. But shouldn't the lengths be estimated from the inferred GFT reconciliations and, thus, be independent of the starting tree?
I'm also concerned with the branch lengths being somewhat large. Consider having a set of gene families which contain no in-paralogues and no xenologues, so every pair of adjacent leaves are either orthologues or paralogues from different species. Thus, for any 2 species the speciation event cannot occur deeper than the separation of the 2 most close leaves belonging to these species seen in the true GFTs. Also consider that we know that the ML trees for these gene families (inferred from alignments alone) don't have any branches longer than 1.4. How comes it that a good share of the inferred species tree's leaf (!) branches have lengths exceeding 1.5 and sometimes 2.0?


Best regards,
Stepan

Benoit Morel

unread,
Jan 26, 2024, 10:54:38 AM1/26/24
to Stepan Puhov, GeneRax
I didn't have the time to look at these yet.

But I just fixed 1) (hopefully).

The branch lengths of the species tree inferred with AleRax are meaningless. Maybe they depend on the starting tree because we just reuse the initial branch lengths and then they get rearranged. I should set of all them to a hardcoded value at the end to avoid any confusion. There are ways to estimate those branch lengths but we never had the time to work on them. However, SpeciesRax does estimate those branch lengths (although the method is not that great...). I hope one of us will manage to work on this soon.

Stepan Puhov

unread,
Jul 20, 2024, 7:14:12 PM7/20/24
to GeneRax
Hi Benoit,

There are 2 major bugs in AleRax that went unnoticed throughout the recent commits:

1. The --prune-species-tree option doesn't have any effect on the final likelihood. I've managed to trace the origin of this bug to the commit 788f488 (the first time where it happens). It could be tested only for the UndatedDTL model due to the second bug (see below).
However, from looking through the code I also anticipate another problem in the UndatedDL model, as, in contrast to the UndatedDTL model, it doesn't declare the onSpeciesTreeChange function (guess, this discrepancy between the models wasn't originally intended?).

2. Usage for the UndatedDL model for species tree search fails during the transfer-guided step with the following error message:
alerax-f0f94e0/src/ale/UndatedDTLMultiModel.hpp:533: void UndatedDTLMultiModel<REAL>::recomputeSpeciesProbabilities() [with REAL = ScaledValue]: Assertion `maxSpeciesId == transferRates.size()' failed.
The bug originates somewhere after the commit 350996c (this one works properly) and before the commit f0f94e0 (I haven't been able to compile the commits in between, hence cannot provide any further precision).

If needed, I can send some test data.

Hope it can be fixed. Thank you!

Best regards,
Stepan

Reply all
Reply to author
Forward
0 new messages