Basics of using Species Rax

73 views
Skip to first unread message

Clifton Lewis

unread,
May 18, 2021, 9:27:52 AM5/18/21
to GeneRax
Hi,

I was wondering if you could describe the basic command for running speciesRax as I'm not a 100% sure what the command to run it is from the wiki. I have a created a bunch of alignments and I am creating the families file as well. Do I need to create the gene trees prior to running speciesRax or can it infer that through the MiniNJ pipeline using the alignments? Do I need anything else as I have the gene names formatted as per geneRax requirement for automatic recognition of >SpeicesName_genename. Does that transfer over to speciesRax as well? Sorry I do realise that this is a rather basic question.

Kind regards,
Clifton

Benoit Morel

unread,
May 19, 2021, 2:00:20 PM5/19/21
to GeneRax
Hi Clifton,

Sorry I accidentally replied with private message. Here was my answer to your first post:



Dear Clifton,

The gene trees need to be generated, either with generax or with an external tool.

To generate them with generax, you need to specify in the family file the input alignment of each family as well as the substitution model that will be used to infer the gene tree from the sequences.
In particular, the sentence "alternatively, one could provide the alignments and substitution models instead of the gene trees, and the gene trees will be inferred from the alignments"

To generate them with another tool, I would recommend ParGenes that we developed for this purpose. The advantages compared with generax only are that you could run modeltesting prior to the tree inference, and you can also specify the number of starting trees for the gene tree searches. Also, I would expect the gene tree inference to take longer than the speciesrax algorithm, so maybe it's not a bad thing to pre-generate the gene trees in advance, in case you want to run speciesrax with different parameters. ParGenes can be found here: https://github.com/BenoitMorel/ParGenes
The advantage of generax is that you don't need any external tool. Both methods (ParGenes and GeneRax) use the same method (they call raxml-ng).

What you said about the mappings should work. If not, do not hesitate to send me a subset of your data, such that I can see what's going wrong. 

GeneRax is not an easy tool to use, and I am still working on improving the documentation, so please ask questions if anything is unclear :-)

Best,
Benoit

Benoit Morel

unread,
May 19, 2021, 2:06:46 PM5/19/21
to GeneRax
Now, to reply to your second question:


What you do looks fine to me. Did you encounter any problem by running this command?

When you talk about inferring gene trees with MiniNJ, I think you confuse something.
- MiniNJ does not infer gene trees. It infers an initial species tree from the gene trees.
- The gene trees are inferred with raxml-ng from the sequences. (or to be more exact, generax uses the code from raxml-ng) Raxml-ng is one of the most commonly used standard tools for maximum likelihood phylogenetic tree inference (from the sequences).

I plan to add a page on the wiki with a figure showing every step run (or that can be run) by generax, with the input/output of those steps. This should hopefully disambiguate the role of each step, and the effect of the arguments.

Best,
Benoit

Benoit Morel

unread,
May 20, 2021, 4:37:30 PM5/20/21
to GeneRax
Dear Clifton,

this is a bug in GeneRax. MiniNJ is run before the gene tree inference step, which is absurd because it needs the gene trees to run correctly... The same issue occurs if you start from a random species tree, for the exact same reason. Thanks a lot for the report :-)

I will fix the bug, but I can't tell if it will be fixed before tomorrow, and then I will have one week off. Even if it's fixed by tomorrow, you will need to wait that the fix reaches the bioconda package (I think you use the conda installation), which typically takes them a few days.

After having looked at your data, here are two personal recommendations, not directly related to your issue:
- you have quite a large dataset, and just inferring the gene trees will be quite challenging. I would precompute the gene trees first, and save them preciously before running any other tool. This is exactly the case where ParGenes (see my post above) makes a lot of sense, if you have access to a cluster or a big machine. But even with ParGenes and a cluster, it's hard to tell how long this is gonna take, because you have a few very big trees (at least one with more than 5000 taxa).
- you are using the GTR model for proteins. Unless you are doing this for a particular reason and you know what you are doing, we usually discourage this. The GTR model is fine for DNA alignments, but it is very problematic for protein alignments (in short, the protein GTR model has almost 200 free parameters, which is huge, and will cause many issues). You can either pick another model, if you know which one makes sense for your data, or run modeltesting on each alignment. Note that ParGenes can run modeltesting before each gene tree inference automatically (sorry for the repeating advertisement, but ParGenes was developed for this exact purpose).

I will keep you in touch about the generax bug fix.

Best,
Benoit

Benoit Morel

unread,
May 27, 2021, 3:58:27 PM5/27/21
to GeneRax
Hi Clifton,

Sorry, I just saw your very last mail. Don't forget to reply to all, such as it ends up in the google group.
To reply to your question:
"is there anyway of automatically recording the sustitiuation model used for each MSA so I can add it to the families file or else this would be rather challenging?"

-> I don't think that I have written any code to generate a families file from the ParGenes output, with the correct inferred substitution models.
However, I can write such a script, since it could be quite useful for other users as well. I hope I can get this done next week, when I am back from vacations.

Best,
Benoit

Benoit Morel

unread,
May 28, 2021, 3:29:32 PM5/28/21
to GeneRax
Hi Clifton,

now there is a script to generate the family file from a ParGenes run, which will correctly deal with the best-fit substitution models. See the end of this section in the wiki: https://github.com/BenoitMorel/GeneRax/wiki/GeneRax#families-file

I haven't tested it that much yet, so please let me know if it works for you.

Best,
Benoit

Clifton Lewis

unread,
May 28, 2021, 8:51:44 PM5/28/21
to Benoit Morel, GeneRax
Hi Benoit,

That script looks brilliant and suits my needs exactly. Once I sort out the ParGenes installation issue on our HPC, I'll test it out and update you on its success. Thanks for your brilliant work.

Thanks,
Clifton

--
You received this message because you are subscribed to a topic in the Google Groups "GeneRax" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/generaxusers/-xWj2gmngqI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to generaxusers...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/generaxusers/7f6138c8-269e-48f2-a6a1-4236d2591a54n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages