Ask for suggestions on dealing with big gene trees

119 views
Skip to first unread message

Yiyan Yang

unread,
Aug 2, 2021, 11:44:22 AM8/2/21
to GeneRax
Dear Benoit and the staff of Generax,

I am writing to ask for some suggestions about dealing with big gene trees. If genes are highy duplicated in a gene family X and this family has a big gene tree with 2000 leaves, but the species tree has, for example, only 100 leaves. Will GeneRax give results regarding this situation?  I guess this may lead to a very high duplication rate or extra time for computation, so I am wondering if I need to use some specific options while running Generax? 

I guess I may figure it out by just trying yet I'd also appreciate any suggestions and advice.

Thanks,
Yiyan

Benoit Morel

unread,
Aug 2, 2021, 4:36:24 PM8/2/21
to GeneRax
Hi Yiyan,

(sorry for the double post, I forgot to reply to all)

That's a good question... I don't know what is the maximum size I was able to process during my experiments, but this is definitely a challenging pair of gene tree and species tree. Here are a few tips:
- parallelization will reduce the runtime a lot, so if you have a big machine or a cluster available, use it (even if there is only one gene tree to estimate).
- do not use generax to estimate the starting gene tree from the sequences, but rather raxml-ng or a similar tool. GeneRax reuses the code from raxml-ng for this step, but maybe raxml-ng is slightly better, and this step can already take quite some time. This is even more important if you have a "large" alignment (> 1000 sites) and want to use parallelization: for this step only, GeneRax is not parallelized for a single gene tree (while raxml-ng is).
- you might have observed from the logs that GeneRax runs 5 levels of search (radius=1, 2, 3, 4, 5). You can reduce this value with the option --max-spr-radius. This will make the search less exhaustive, but faster.
- check if some of the gene sequences mapped to the same species are identical. If the number of such duplicates is not negligible, I highly recommend removing them from the analysis to make the search faster. (you can add them back at the end).
- I would avoid the --per-species-rates option for such a huge gene tree (at least for the search), because this mode can be substantially slower than the default mode. Btw, the corresponding update is still waiting for bioconda's approval.


If GeneRax seems to be too slow to process your data, do not hesitate to share it. I can't promise miracles, but  it's good to have access to difficult data in order to make the tool faster, and we all love computational challenges in our lab :D

Best,
Benoit

Benoit Morel

unread,
Aug 2, 2021, 4:38:26 PM8/2/21
to GeneRax
Out of curiosity, how large is the input alignment?

Yiyan Yang

unread,
Aug 2, 2021, 5:55:37 PM8/2/21
to GeneRax
Dear  Benoit,

Thanks so much for your quick response. 

- parallelization will reduce the runtime a lot, so if you have a big machine or a cluster available, use it (even if there is only one gene tree to estimate).
- do not use generax to estimate the starting gene tree from the sequences, but rather raxml-ng or a similar tool. GeneRax reuses the code from raxml-ng for this step, but maybe raxml-ng is slightly better, and this step can already take quite some time. This is even more important if you have a "large" alignment (> 1000 sites) and want to use parallelization: for this step only, GeneRax is not parallelized for a single gene tree (while raxml-ng is).
- you might have observed from the logs that GeneRax runs 5 levels of search (radius=1, 2, 3, 4, 5). You can reduce this value with the option --max-spr-radius. This will make the search less exhaustive, but faster.
- check if some of the gene sequences mapped to the same species are identical. If the number of such duplicates is not negligible, I highly recommend removing them from the analysis to make the search faster. (you can add them back at the end).
These are all very great suggestions! I will definitely follow them.

- I would avoid the --per-species-rates option for such a huge gene tree (at least for the search), because this mode can be substantially slower than the default mode. Btw, the corresponding update is still waiting for bioconda's approval.
Really appreciate your updates and remembering me! I may have to use the "--per-species-rates" option since I want to compare the rates among different groups of species. And no worries about the bioconda, because I have just compiled the GeneRax from source and I am using the latest version i.e. v2.0.3.

Out of curiosity, how large is the input alignment?
I haven't formally get the alignment input data. But what I am sure is that some gene trees will have > 2000 leaves.

Many thanks,
Yiyan

Benoit Morel

unread,
Aug 3, 2021, 3:48:44 AM8/3/21
to Yiyan Yang, GeneRax
Regarding the per-species rates: the reason why this was not documented nor covered in our paper is that it is substantially slower (because there are more rates to estimate on the whole set of gene trees) without clearly improving the accuracy of gene trees. I am also not sure if the method used for optimizing the rates is adequate for a large number of species (because I haven't worked that much on it).

My suggestion for this would be to split the process into two steps: first infer/correct the gene trees, without the per-species rates, and save those gene trees. Then I would try to estimate the per-species rates on the fixed gene trees.

Do not hesitate to come back here if you need further help. This seems to be a very interesting use case for GeneRax that I haven't thought about, and I'd be happy to make the changes necessary to make this kind of analysis feasible.

Best,
Benoit


--
You received this message because you are subscribed to the Google Groups "GeneRax" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generaxusers...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/generaxusers/4a87c6b1-2119-4dde-b122-1c4bf6f71c66n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yiyan Yang

unread,
Aug 3, 2021, 7:06:18 PM8/3/21
to GeneRax

Dear Benoit,

Thanks a million for your super valuable suggestion!

I would also like to introduce you more about what I am trying to achieve in my project and ask for your opinion on how GeneRax could be involved in it.
I have collected a group of genomes and annotated them with trait data (such as soil, maine etc habitats). A species tree was built with the genomic data (annotating trait with different colors):

Screenshot 2021-08-03 174920.png

My plan is to infer DTL rates for red and non-red branches so that a group of genes with significantly different DTL rates between red and non-red branches could be found and further analyzed. My final goal is to see how these red genomes/species evolve to adapt to their unique habitat. This can also be extended to other scenarios in Comparative Biology and various hypotheses could be tested. Then I found GeneRax and thought this new and fantastic tool may be very helpful.

That is why I am wondering [Question 1] if there is a single rate estimated for all the red branches (this is where my first question about branch-specific rates came from). After discussing with you, I think the "--per-species-rates" option (i.e. free rate model) is definitely a good choice, yet considering its complexity and my use of big species/gene trees, a “partially free rate” model may greatly reduce the computation time and seems more suitable in this case.
Naturally, I am also wondering [Question 2] if the DTL rates can be compared between different branches. In other words, is it reasonable or meaningful to compare them? 

At last, would you mind taking a very quick look at the following command lines I constructed according to your suggestion to make sure I understand you correctly?

STEP1:
mpirun -np $cores $generax --families $families_2 --species-tree $speciestree --rec-model UndatedDL --prefix $output_1 --max-spr-radius 3 --strategy SPR

STEP2:
# Save the output gene trees from STEP1 and change the gene tree paths in $familiesmpirun -np $cores $generax --families $families_2 --species-tree $speciestree --rec-model UndatedDL --per-species-rates --prefix $output_2 --max-spr-radius 3 --strategy SKIP

I may have asked many questions in a single conversation and I truly and always appreciate your patience and help. 

Best,
Yiyan

Yiyan Yang

unread,
Aug 3, 2021, 7:24:19 PM8/3/21
to GeneRax
oops.  $families_2 in STEP1 should be $families_1. Sorry for the mis-spelling.

Yiyan Yang

unread,
Aug 3, 2021, 9:08:09 PM8/3/21
to GeneRax
Update STEP2:

# Save the output gene trees from STEP1 and change the gene tree paths in $families
mpirun -np $cores $generax --families $families_2 --species-tree $speciestree --rec-model UndatedDL --per-species-rates --prefix $output_2 --max-spr-radius 3 --strategy SKIP

Benoit Morel

unread,
Aug 6, 2021, 3:35:25 AM8/6/21
to GeneRax
Hi Yiyan,

sorry, I am falling a bit behind other urgent tasks, but I haven't forgot you. I'll take the time to reply next week.
Just a quick note: I think that the DTL rates won't be computed with the --strategy SKIP mode. But this is unintentional, and I will fix it soon.

Best,
Benoit

Yiyan Yang

unread,
Aug 6, 2021, 11:18:17 AM8/6/21
to GeneRax
Thank you for your response and not forgetting me :D
Don't worry and give your urgent tasks priority. You have already helped me a lot. I also noticed that the  --strategy SKIP mode did skip the step of computing DTL rates, so I have modified the commands into:
1. generax --families $families --species-tree $speciestree --rec-model UndatedDTL --prefix $output --max-spr-radius 3 --si-strategy SKIP
2. generax --families $families --species-tree $speciestree --rec-model UndatedDTL --per-species-rates --prefix $output --max-spr-radius 1
 
Wish everything goes well with your tasks/projects.

Best,
Yiyan

Reply all
Reply to author
Forward
0 new messages