Dear Dr. Morel (@BenoitMorel),
We recently ran gene tree reconciliation for a gene family with over 4000 members, and we encountered significant performance issues. Specifically, the process seems to stall during the "Final model rate optimization" step, which has been running for over 40 hours with the parameter --gene-tree-samples 100.
Below are the command, input files, and logs for your reference:
Command:
alerax -f IPR004000_families.txt \
-s species.cleaned.nwk \
-p alerax/reconciliation/IPR004000 \
--gene-tree-samples 100
Logs:
[00:00:00] AleRax v1.4.0
Logs will also be printed into alerax/reconciliation/IPR004000/alerax.log
[00:00:00] AleRax v1.4.0
[00:00:00] AleRax was called as follows:
alerax -f IPR004000_families.txt -s species.cleaned.nwk -p alerax/reconciliation/IPR004000 --gene-tree-samples 100
[00:00:00] Run settings:
Output directory: alerax/reconciliation/IPR004000
Families information: IPR004000_families.txt
Starting species tree: will be imported from the user file: species.cleaned.nwk
MPI ranks: 1
Random seed: 123
Reconciliation model: UndatedDTL
Transfer constraint: transfers to parents are forbidden
Memory savings: OFF
Model parametrization: rates are global to all species and families
Rate optimizer: LBFGSB
Prune species mode: OFF
Gene tree rooting: all gene tree root positions are considered with the same probability
Origination strategy: gene families can originate from each species with the same probability
Species tree search: skip species tree search
Number of reconciled gene trees to sample: 100
AleRax will exclude gene families covering less than 4 species
[00:00:00] Checking families...
[00:00:00] Generating ccp files...
[00:04:05] Families: 1
[00:04:05] Trimming families covering less than 4 species...
[00:04:05] Remaining families: 1
[00:04:05] Initializing starting species tree...
[00:04:05] Checking that ccps and mappings are valid...
[00:04:15] Input data information:
- Number of gene families: 1
- Number of species: 1936
- Total number of genes: 4032
- Average number of genes per family: 4032
- Maximum number of genes per family: 4032
- Species with the smallest family coverage: "GCA_038131915.1_SD3109_NP.scaffolds.fasta_genomic" (covered by 0/1 families)
- Average (over species) species family coverage: 0
[00:04:24] Initializing ccps and evaluators...
[03:05:40] Initializing ccps finished
[03:05:40] Total number of clades: 801703
[03:05:40] Load balancing: 1
[03:05:40] Recommended maximum number of cores: 1
[03:05:40] Initial ll=-28644.3
[03:05:40] Start the species tree optimization...
[03:05:40] Optimization skipped!
[03:05:40] End of the species tree optimization
[03:05:40] Final species tree topology: alerax/reconciliation/IPR004000/species_trees/inferred_species_tree.newick
[03:05:40] Final model rate optimization, non-thorough...
[03:05:40] [Species search] Optimizing model rates (light), ll=-28644.3
[03:05:40] [Species search] Free parameters: 3
We have also reduced the --gene-tree-samples parameter to 20 or 1, but the process still progresses very slowly, with no noticeable advancement after approximately 20 hours.
Given the large size of the gene family, with over 4000 nodes, we are concerned about the speed of the reconciliation process. Could you kindly provide guidance on how to speed up the gene tree reconciliation for large gene families? Additionally, are there alternative strategies or tools that might be more efficient for this type of analysis?
Thank you in advance for your assistance.
Best regards,
Lily
--
You received this message because you are subscribed to the Google Groups "GeneRax" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generaxusers...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/generaxusers/04d729ec-7601-4de4-8803-1ea904c63ad6n%40googlegroups.com.
Dear Dr. Morel,
Thank you very much for your helpful suggestion. We would like to follow your recommendation and first assess the runtime of gene tree sampling with parameter optimization disabled.
Could you please clarify the correct way to disable (or effectively bypass) model parameter optimization in AleRax? In particular, should this be done by explicitly fixing the DTL rates via command-line options, or by adjusting optimization-related parameters such as --rec-opt or the --origination strategy (e.g. UNIFORM, ROOT, LCA, OPTIMIZE)?
Any guidance on the recommended practice for this use case would be greatly appreciated.
Best regards,
Lily
We are able to run reconciliation analyses successfully with parameter optimization disabled (using the --fix-rates option). We would now like to seek your advice on how best to determine an appropriate set of DTL rates for a very large gene family comprising more than 4,000 gene copies.
Our current understanding and tentative strategy are as follows:
For gene families with more than ~4,000 nodes, direct optimization of reconciliation parameters is often unreliable and computationally prohibitive. This is primarily due to
(i) the exponential growth of the gene tree search space, and
(ii) the limited phylogenetic signal available per gene when aligned across a very large number of species.
To mitigate these issues, we consider applying a downsampling strategy to construct one or more smaller, representative gene trees. In this context, we would appreciate guidance on what range of node counts (e.g., hundreds vs. low thousands) is generally considered reasonable for reliable rate estimation.
Using these downsampled gene trees, we would then estimate the optimal DTL rate parameters with AleRax.
Finally, we would apply the inferred rate parameters to the reconciliation analysis of the full, large-scale gene tree.
In addition, we would like to ask for your opinion on the following methodological question:
When estimating reconciliation rates on reduced datasets, is it preferable to
(a) estimate rates individually for each gene family (on a small-scale version) and then apply the resulting family-specific rates to the corresponding full-scale family, or
(b) estimate a single global set of rates using a diverse collection of small-scale gene families, and then apply this global parameter set to all large-scale families?
We would greatly appreciate any recommendations or best-practice guidance you could share regarding these approaches.
Thank you very much for your time and for developing and maintaining AleRax.
Best regards,
Lily