Slow Gene Tree Reconciliation for Large Gene Family (4000+ Members)

29 views
Skip to first unread message

林为丽

unread,
Feb 1, 2026, 7:44:41 AMFeb 1
to GeneRax

Dear Dr. Morel (@BenoitMorel),

We recently ran gene tree reconciliation for a gene family with over 4000 members, and we encountered significant performance issues. Specifically, the process seems to stall during the "Final model rate optimization" step, which has been running for over 40 hours with the parameter --gene-tree-samples 100.

Below are the command, input files, and logs for your reference:

Command:

alerax -f IPR004000_families.txt \
  -s species.cleaned.nwk \
  -p alerax/reconciliation/IPR004000 \
  --gene-tree-samples 100


Logs:

[00:00:00] AleRax v1.4.0
Logs will also be printed into alerax/reconciliation/IPR004000/alerax.log
[00:00:00] AleRax v1.4.0
[00:00:00] AleRax was called as follows:
alerax -f IPR004000_families.txt -s species.cleaned.nwk -p alerax/reconciliation/IPR004000 --gene-tree-samples 100

[00:00:00] Run settings:
        Output directory: alerax/reconciliation/IPR004000
        Families information: IPR004000_families.txt
        Starting species tree: will be imported from the user file: species.cleaned.nwk
        MPI ranks: 1
        Random seed: 123
        Reconciliation model: UndatedDTL
        Transfer constraint: transfers to parents are forbidden
        Memory savings: OFF
        Model parametrization: rates are global to all species and families
        Rate optimizer: LBFGSB
        Prune species mode: OFF
        Gene tree rooting: all gene tree root positions are considered with the same probability
        Origination strategy: gene families can originate from each species with the same probability
        Species tree search: skip species tree search
        Number of reconciled gene trees to sample: 100
        AleRax will exclude gene families covering less than 4 species

[00:00:00] Checking families...
[00:00:00] Generating ccp files...
[00:04:05] Families: 1
[00:04:05] Trimming families covering less than 4 species...
[00:04:05] Remaining families: 1
[00:04:05] Initializing starting species tree...
[00:04:05] Checking that ccps and mappings are valid...
[00:04:15] Input data information:
- Number of gene families: 1
- Number of species: 1936
- Total number of genes: 4032
- Average number of genes per family: 4032
- Maximum number of genes per family: 4032
- Species with the smallest family coverage: "GCA_038131915.1_SD3109_NP.scaffolds.fasta_genomic" (covered by 0/1 families)
- Average (over species) species family coverage: 0

[00:04:24] Initializing ccps and evaluators...
[03:05:40] Initializing ccps finished
[03:05:40] Total number of clades: 801703
[03:05:40] Load balancing: 1
[03:05:40] Recommended maximum number of cores: 1
[03:05:40] Initial ll=-28644.3

[03:05:40] Start the species tree optimization...

[03:05:40] Optimization skipped!

[03:05:40] End of the species tree optimization
[03:05:40] Final species tree topology: alerax/reconciliation/IPR004000/species_trees/inferred_species_tree.newick

[03:05:40] Final model rate optimization, non-thorough...
[03:05:40] [Species search] Optimizing model rates (light), ll=-28644.3
[03:05:40] [Species search]   Free parameters: 3


We have also reduced the --gene-tree-samples parameter to 20 or 1, but the process still progresses very slowly, with no noticeable advancement after approximately 20 hours.

Given the large size of the gene family, with over 4000 nodes, we are concerned about the speed of the reconciliation process. Could you kindly provide guidance on how to speed up the gene tree reconciliation for large gene families? Additionally, are there alternative strategies or tools that might be more efficient for this type of analysis?

Thank you in advance for your assistance.

Best regards,
Lily

IPR004000.treelist

IPR004000_families.txt
IPR004000_mapping.link.txt
species.cleaned.nwk.txt

Benoit Morel

unread,
Feb 2, 2026, 2:27:26 AMFeb 2
to 林为丽, GeneRax
Dear Lily,
The parameter optimization is actually the slowest step and happens before the sampling. I recommend trying to disable parameter optimization first and check how much time you need to generate 1/20/more gene tree samples. 
If that works: the parameters are global to all gene families (by default at least), so what you could do is to estimate them on a subset of families and reuse them for the global run.
I hope this helps!
Benoit

--
You received this message because you are subscribed to the Google Groups "GeneRax" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generaxusers...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/generaxusers/04d729ec-7601-4de4-8803-1ea904c63ad6n%40googlegroups.com.
Message has been deleted

Lily

unread,
Feb 2, 2026, 9:45:32 AMFeb 2
to GeneRax

Dear Dr. Morel,

Thank you very much for your helpful suggestion. We would like to follow your recommendation and first assess the runtime of gene tree sampling with parameter optimization disabled.

Could you please clarify the correct way to disable (or effectively bypass) model parameter optimization in AleRax? In particular, should this be done by explicitly fixing the DTL rates via command-line options, or by adjusting optimization-related parameters such as --rec-opt or the --origination strategy (e.g. UNIFORM, ROOT, LCA, OPTIMIZE)?

Any guidance on the recommended practice for this use case would be greatly appreciated.

Best regards,
Lily

Message has been deleted

Lily

unread,
Feb 4, 2026, 6:55:50 PMFeb 4
to GeneRax
Dear Dr. Morel (@BenoitMorel),

We are able to run reconciliation analyses successfully with parameter optimization disabled (using the --fix-rates option). We would now like to seek your advice on how best to determine an appropriate set of DTL rates for a very large gene family comprising more than 4,000 gene copies.

Our current understanding and tentative strategy are as follows:

  1. For gene families with more than ~4,000 nodes, direct optimization of reconciliation parameters is often unreliable and computationally prohibitive. This is primarily due to
    (i) the exponential growth of the gene tree search space, and
    (ii) the limited phylogenetic signal available per gene when aligned across a very large number of species.

  2. To mitigate these issues, we consider applying a downsampling strategy to construct one or more smaller, representative gene trees. In this context, we would appreciate guidance on what range of node counts (e.g., hundreds vs. low thousands) is generally considered reasonable for reliable rate estimation.

  3. Using these downsampled gene trees, we would then estimate the optimal DTL rate parameters with AleRax.

  4. Finally, we would apply the inferred rate parameters to the reconciliation analysis of the full, large-scale gene tree.

In addition, we would like to ask for your opinion on the following methodological question:

When estimating reconciliation rates on reduced datasets, is it preferable to

  • (a) estimate rates individually for each gene family (on a small-scale version) and then apply the resulting family-specific rates to the corresponding full-scale family, or

  • (b) estimate a single global set of rates using a diverse collection of small-scale gene families, and then apply this global parameter set to all large-scale families?

We would greatly appreciate any recommendations or best-practice guidance you could share regarding these approaches.

Thank you very much for your time and for developing and maintaining AleRax.

Best regards,
Lily


在2026年2月2日星期一 UTC+8 15:27:26<beno...@gmail.com> 写道:
Reply all
Reply to author
Forward
0 new messages