What gene family should I use?

41 views

Skip to first unread message

MULLER Héloïse

unread,

Oct 2, 2024, 1:37:22 PM10/2/24

to GeneRax

Hi,

I have a big dataset composed of several species of bacteria of a same genera (~1000 isolates). I have two objectives: (i) build a species tree and (ii) get the DTL events for some gene families of interest.

I see 2 possible strategies:

1) either I firstly make a species tree with classic methods, using only core genes or ribosomal genes for exemple, and then, as a second step, I use a second software to reconciliate my gene families of interest with my species tree.

2) I use AleRax to do both at ones.

In the strategy 1, I should better use GeneRax instead of AleRax, right?

In the strategy 2, I am not sure to understand what gene families I should use? I suppose I cannot use only my few gene families of interest as they are not that many of them and as these gene families may have a lot of DTL events (thus, a story quite different than the one of the species tree) ? So should I use all annoated genes in this entire dataset for a better species tree inference? Or just a subset would be enough ? (in this case, how should I select them?) And at the end, I could focus only on the outputs concerning my gene families of interest.

I think I am confused because the exemple with primates is with 8 gene families only, whereas the one with Archae in the paper is with 5379 gene families. In. both cases, I did not get how these genes were selected?

Thank you,

Héloïse

Benoit Morel

unread,

Oct 6, 2024, 12:28:31 PM10/6/24

to MULLER Héloïse, GeneRax

Hi Héloïse,

If you want to use all isolates, I would rather pick the first strategy, because GeneRax won't be able to generate a species tree with that many taxa. You could also try to use one isolate per species and see if you get a similar tree with GeneRax.

To select the gene families: this is always a hot topic, but I would rather use as many families as possible. On the other hand, some families are very "noisy" because of low signal or too many HGTs, but it's always difficult to decide which ones to exclude. The example in the repository with 8 families is just to show how to run the tool on a very small dataset, but does not correspond to a real study.

I hope this helps,

Benoit

--
You received this message because you are subscribed to the Google Groups "GeneRax" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generaxusers...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/generaxusers/3c61a869-330b-4761-99dd-718e99d884f7n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages