AleRax terminates with error at Initializing ccps and evaluators...

109 views
Skip to first unread message

Varada Khot

unread,
Oct 14, 2024, 1:41:04 PM10/14/24
to GeneRax
Hi there,

I'm running into the follow issue with AleRax (as a trial) terminating at [Initializing ccps and evaluators...] step. 

This is the command with the error:
alerax -f ../50_gene_trees/gene_families_file_test.txt -s ../60_species_tree/20_iqtree/species_tree_renamed_rooted.tree -p 10_output_reconciliation_5 --gene-tree-samples 100 --fraction-missing-file fraction_missing.txt --highways --species-tree-search HYBRID

This command works:
alerax -f ../50_gene_trees/gene_families_file_test.txt -s ../60_species_tree/20_iqtree/species_tree_renamed_rooted.tree -p 10_output_reconciliation_5 --gene-tree-samples 100 --highways --species-tree-search HYBRID

I tested each of the parameters individually and found that when I include the "fraction_missing.txt"  - it dumps the core. 

I am fairly positive it's not a memory issue (like many core dumps are) as the resource allocation is 200G and the command without the fraction_missing runs with just 10G

Please let me know if you have any ideas on how to fix this!

Thanks,

Varada

I'm not able to attach the fraction missing file and the error so it is below

error:
hey 8.39429 < 1 is wrong
alerax: /home/vmkhot/data/Programs/AleRax/src/ale/UndatedDTLMultiModel.hpp:543: void UndatedDTLMultiModel<REAL>::recomputeSpeciesProbabilities() [with REAL = double]: Assertion `proba < REAL(1.000001)' failed.
[mc172:2351572] *** Process received signal ***
[mc172:2351572] Signal: Aborted (6)
[mc172:2351572] Signal code:  (-6)
[mc172:2351572] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f37fba42cf0]
[mc172:2351572] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f37fb6b9acf]
[mc172:2351572] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f37fb68cea5]
[mc172:2351572] [ 3] /lib64/libc.so.6(+0x21d79)[0x7f37fb68cd79]
[mc172:2351572] [ 4] /lib64/libc.so.6(+0x47426)[0x7f37fb6b2426]
[mc172:2351572] [ 5] alerax(_ZN20UndatedDTLMultiModelIdE29recomputeSpeciesProbabilitiesEv+0xca5)[0x494775]
[mc172:2351572] [ 6] alerax(_ZN12AleEvaluator15resetEvaluationEjb+0x96e)[0x482f8e]
[mc172:2351572] [ 7] alerax(_ZN12AleEvaluatorC1ER12AleOptimizerR11SpeciesTreeRK12RecModelInfo20ModelParametrizationRSt6vectorI18AleModelParametersSaIS9_EEbbRKS8_I10FamilyInfoSaISD_EER16PerCoreGeneTreesRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESR_+0x2c5)[0x483855]
[mc172:2351572] [ 8] alerax(_ZN12AleOptimizerC2ENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorI10FamilyInfoSaIS7_EERK12RecModelInfo20ModelParametrizationRK10ParametersbbRKS5_SK_+0xafd)[0x4a47fd]
[mc172:2351572] [ 9] alerax(_Z3runR12AleArguments+0x48b)[0x467d5b]
[mc172:2351572] [10] alerax(_Z11alerax_mainiPPcPv+0x63)[0x468823]
[mc172:2351572] [11] alerax(_Z13internal_mainiPPcPv+0x4d)[0x46893d]
[mc172:2351572] [12] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f37fb6a5d85]
[mc172:2351572] [13] alerax(_start+0x2e)[0x46017e]
[mc172:2351572] *** End of error message ***
/var/spool/slurmd/job32844731/slurm_script: line 19: 2351572 Aborted                 (core dumped) alerax -f ../50_gene_trees/gene_families_file_test.txt -s ../60_species_tree/20_iqtree/species_tree_renamed_rooted.tree -p 10_output_reconciliation_5 --gene-tree-samples 100 --fraction-missing-file fraction_missing.txt --highways --species-tree-search HYBRID

fraction_missing.txt (sample):
g0003 0.04
g0004 0.01
g0006 1.08
g0007 4.72
g0008 0.71
g990065 0.2
g990066 1.79
g0009 2.92
g0010 2.79
g0011 2.78
g0012 2.48





Benoit Morel

unread,
Oct 14, 2024, 1:44:08 PM10/14/24
to Varada Khot, GeneRax
Hello 
If I remember correctly these numbers should be between 0 and 1. I'll add a proper error message 
Benoit

--
You received this message because you are subscribed to the Google Groups "GeneRax" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generaxusers...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/generaxusers/2fb90d03-be81-489b-956f-45233cc8884fn%40googlegroups.com.

Varada Khot

unread,
Oct 30, 2024, 3:38:02 AM10/30/24
to GeneRax
Hi Benoit,

Thanks for your response! I was able to run AleRax after fixing the fraction missing file, but have now encountered a new error. 

This is my command:
mpirun --oversubscribe -np 50 alerax -f ../50_gene_trees/gene_families_file.txt -s ../60_species_tree/20_iqtree/species_tree_renamed_rooted.tree -p 30_gene_tree_reconciliation_all --gene-tree-samples 100 --fraction-missing-file fraction_missing.txt --highways

With these resources:
#SBATCH --ntasks=50
#SBATCH --mem=250G
#SBATCH --time=48:00:00

And this is the error I get after it exports 5300 reconciliations out of 7884:

alerax: /home/vmkhot/data/Programs/AleRax/ext/GeneRaxCore/src/trees/PLLUnrootedTree.cpp:171: static std::__cxx11::string PLLUnrootedTree::buildConsensusTree(const std::vector<std::shared_ptr<PLLUnrootedTree> >&, double): Assertion `treePointers.size()' failed.
[mc100:400607] *** Process received signal ***
[mc100:400607] Signal: Aborted (6)
[mc100:400607] Signal code:  (-6)
[mc100:400607] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f439bfdbcf0]
[mc100:400607] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f439bc52acf]
[mc100:400607] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f439bc25ea5]
[mc100:400607] [ 3] /lib64/libc.so.6(+0x21d79)[0x7f439bc25d79]
[mc100:400607] [ 4] /lib64/libc.so.6(+0x47426)[0x7f439bc4b426]
[mc100:400607] [ 5] alerax(_ZN15PLLUnrootedTree18buildConsensusTreeB5cxx11ERKSt6vectorISt10shared_ptrIS_ESaIS2_EEd+0x2bf)[0x54991f]
[mc100:400607] [ 6] alerax(_ZN13PLLRootedTree18buildConsensusTreeERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EEd+0x34f)[0x55a7ef]
[mc100:400607] [ 7] alerax(_ZN12AleOptimizer9reconcileEj+0x17de)[0x4a182e]
[mc100:400607] [ 8] alerax(_Z3runR12AleArguments+0x626)[0x467ef6]
[mc100:400607] [ 9] alerax(_Z11alerax_mainiPPcPv+0x63)[0x468823]
[mc100:400607] [10] alerax(_Z13internal_mainiPPcPv+0x4d)[0x46893d]
[mc100:400607] [11] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f439bc3ed85]
[mc100:400607] [12] alerax(_start+0x2e)[0x46017e]
[mc100:400607] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 11 with PID 400607 on node mc100 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

During my first run of this program, it timed out after 48h so the program quit. Then I restarted with the same command, thinking it would not require 48h to export the reconciliations. The time doesn't seem to be the issue though. I also got the same error when I ran it with the command --species-tree-search HYBRID but I did not use the species tree search argument this time. 

Additionally, my command runs totally fine with and without the --species-tree-search argument with a smaller dataset (500 gene trees).

Let me know if you have any thoughts on how to fix this! 

Thanks kindly,

Varada

Varada

unread,
Nov 15, 2024, 2:25:34 AM11/15/24
to GeneRax
Hi Benoit,

Just wondering if you had any thoughts on this error? I've tried by absolute best to debug it on my end, with multiple subsets of my data etc. None of my gene tree distributions are problematic either

I have a feeling it might be some type of bug in the code related to the the number of gene trees I can run? Unfortunately, I'm not so familiar with c++ to look into this further but I did see a very similar bug reported here https://github.com/BenoitMorel/AleRax/issues/8 (the first one)

This is the section of code that fails in the PLLUnrootedTree.cpp (line 171)

169: auto weight = 1.0 / static_cast<double>(treePointers.size());
170: weights = std::vector<double>(treePointers.size(), weight);
171: assert(treePointers.size());
172: assert(weights.size());

I'm not sure why an assert on treePointers.size() should return False, if in the line above, there's a 1/treePointers.size() - which does not produce an error. 

I can understand you are probably very busy so your help is most appreciated!

Thanks,

Varada

Stefan Flaumberg

unread,
Nov 21, 2024, 12:53:24 PM11/21/24
to GeneRax
Hi  Varada,

What you're reporting here definitely should not happen on a normal run, but it's really hard to understand what exactly causes the error only from your description without being able to reproduce it.

To me it seems that the consensus-making subroutine (namely the PLLUnrootedTree::buildConsensusTree function) gets an empty input file with no gene trees, thus it cannot fill the treePointers vector. The input file here is always the reconciliations/all/<fam_name>.newick file with reconciliation samples for a given gene family. So the question is why this file is empty.
The line 169 can by no means be surprising, as division by 0 is not prohibited for the double type in C++, thus no error is produced here. The bug from github you mentioned doesn't seem to be related in any sense and has already been fixed.

Although your dataset is large, it should be readily possible to find the family on which the reconciliation error occurs -- you'll have an empty gene tree consensus file for it (search in the reconciliations/summaries/ directory). Then you can check for this family the input tree distribution and the output *.newick file mentioned above. As the error occurs during the reconciliation step, you already have the final species tree (in the species_trees/inferred_species_tree.newick file). So you can try to reconcile the single family on which the error occurred with this final species tree and see what will happen (use --species-tree-search SKIP to run in the reconciliation-only mode). If this single-family run reproduces the error, then we'll know that the error is specific to the data used and there will be no trouble to find the cause.

Another guess I have is that the error has something to do with checkpointing, as you wrote that you had restarted the program after a timeout (though I don't see any clear connection here). The checkpointing is not perfect in the current version, but I've recently proposed an update to the code on github that fixes most of the bugs. Hopefully, it will be accepted soon.

Sorry that I couldn't be of more help yet. But it will be interesting to see if any new details appear.

Best,
Stefan

Varada

unread,
Nov 25, 2024, 12:38:48 AM11/25/24
to GeneRax
Hi Stefan,

Thanks for your reply! Sorry,  I got a private reply from Noah and then emailed him back not realizing, it wasn't being posted - I'll just post the thread below. In short, the last full run I tried without the checkpoint, fails but does not produce empty newick files and for when it does produce empty newick files, I can run these separately and they work fine. It seems like it's a limit around ~5354 gene trees?

If there's any other tests I can do or send you logs, I would be happy to - since the reproducible example might not be possible. 

Thanks,

Varada


On 15. Nov 2024, at 13:04, Varada wrote:

Hi Noah,

Just an update, I now also tried the command without the —highways parameter and it still failed after processing 5352 reconciliations. So this parameter doesn’t seem to be the issue either?

Thanks,

Varada

On 15. Nov 2024, at 10:34, Varada wrote:

Hi Noah,

Thanks for your response!

The first time I had this error, it did produce empty newick files in the reconciliations/all folder. I also thought it might have to do with these empty files so I removed them and ran it again, and it failed again by producing other empty files. When I run the command on just those tree distributions that were empty, they don’t produce any errors and are no longer empty… so I’m inclined to think that it doesn’t have to do with the gene tree distributions. 

I reran the reconciliations command last night using the same command (but not from a checkpoint) to see if I could reproduce the error and indeed it produces the same error. It again failed after producing 5354 reconciliations but did not produce any empty newick files in the reconciliations/all folder. (see attached)
 
Is it the highways parameter that is causing the issue? I would have liked to have the transfer highways in my analysis but I will try to rerun it today without this. If this was the case, should it not also fail with a subset of 500 gene trees? 

Let me know if you have any other suggestions!!

Thanks a lot,

Varada


On 15. Nov 2024, at 09:01, Noah wrote:

Hi Varada,

From your description it seems like there were no reconciled trees to build the consensus tree from (which should not happen). 1/treePointers.size() will not produce an error but happily evaluate to +inf. Maybe this should not crash the program but simply produce a warning... Ideally, the reconciliation for that family happened and produced one empty <family>.newick file in reconciliations/all. That would tell you which family caused it. I will try to help from there. Also if you are not doing something related to highways, maybe skip that option for now as it is still in development and might lead to problems.

Best
Noah
Screenshot 2024-11-22 at 10.08.12.png

Noah Wahl

unread,
Nov 27, 2024, 1:55:12 PM11/27/24
to GeneRax
Thanks for posting the thread. I didn't realize the messages would not be posted here. The origin of the problem is still unclear to me. Could it be that you run into RAM or storage limits? The number of successful reconciliations is too odd to be of real significance I think. To at least get the remaining reconciliations and see if any others fail, you can turn the `assert(treePointers.size())` into `if (treePointers.size())` before running the rest of `buildConsensusTree()` and return an empty string otherwise.

Best
Noah

Stefan Flaumberg

unread,
Nov 28, 2024, 5:42:56 AM11/28/24
to GeneRax
Hi Varada,

"It again failed after producing 5354 reconciliations but did not produce any empty newick files"
No empty files were produced, but the error message was still the same?

"It seems like it's a limit around ~5354 gene trees?"
I find it quite implausible that such a limit would reveal itself only during the reconciliation step. Each of the specified 50 cores processes a single gene family at a time (50*100 gene trees in RAM), so there is no actual dependence of the load on the number of gene families.
I still suspect that the error cause lies rather in the interplay between some particular gene family tree distribution, some particular species tree and the reconciliation model (hence, there is a bug in the model).

"If there's any other tests I can do"
Actually, you can test whether the problem is in the number of families or in their interaction by removing this interaction:
Please consider launching a run with all the families with the --model-parametrization PER-FAMILY option (in the previous runs it seems that you used GLOBAL by default) and with no species tree search (which is also default). The memory usage shouldn't be lower than with the GLOBAL parametrization, but the model parameters will be estimated separately for each family much like as in single-family runs, which went without error. If the run fails, than the problem is most likely due to the number of families indeed, and if the run terminates without error, we will know that the problem is caused by how the gene trees are reconciled with the species tree (and that changing model parametrization affects it).

Best,
Stefan
Reply all
Reply to author
Forward
0 new messages