Should assemblies be removed if core gene alignment shows redundancies?

76 views
Skip to first unread message

Chahat Upreti

unread,
Sep 21, 2022, 6:19:18 PM9/21/22
to raxml

I have a set of 100 bacterial genomes. I annotated them with Prokka and performed pangenome analysis with Roary. Roary outputs the core gene alignment file which I then used to generate a phylogenetic tree using RaxML. While running RaxML, the console output said -

`IMPORTANT WARNING - Found 13 sequences that are exactly identical to other sequences in the alignment. Normally they should be excluded from the analysis.`

My question is, should I remove these 13 sequences from my subsequent analyses based on this information? I first thought that this it would be obvious to remove these redundant/clonal sequences so that they don't mess up the statistics for gene enrichment etc. But a counterargument is that these 13 sequences are being called as exactly identical to other sequences in my database based on the core gene alignment. What about any differences these 13 assemblies may have (from the sequences these are supposedly identical to) in the non-core genome?

In other words, what if these sequences are actually completely unique but their uniqueness lies in terms of those genes that are not core genes, but those that are present in a subset of the assemblies?

Any insights would be great appreciated. I apologize for this not being a question directly about running RaxML but about analysis based on its output.

Alexandros Stamatakis

unread,
Sep 21, 2022, 11:55:04 PM9/21/22
to ra...@googlegroups.com


> I have a set of 100 bacterial genomes. I annotated them with Prokka and > performed pangenome analysis with Roary. Roary outputs the core gene
> alignment file which I then used to generate a phylogenetic tree
using > RaxML. While running RaxML, the console output said -
>
> `IMPORTANT WARNING - Found 13 sequences that are exactly identical to
> other sequences in the alignment. Normally they should be excluded from
> the analysis.`
>
> My question is, should I remove these 13 sequences from my subsequent
> analyses based on this information?

Yes, absolutely as they will not contribute any additional signal.

> I first thought that this it would
> be obvious to remove these redundant/clonal sequences so that they don't
> mess up the statistics for gene enrichment etc. But a counterargument is
> that these 13 sequences are being called as exactly identical to other
> sequences in my database based on the *core gene alignment*. What about
> any differences these 13 assemblies may have (from the sequences these
> are supposedly identical to) in the non-core genome?

Well in that case you may want to extend the alignment to also include
the non-core genome such as to have some additional data/information to
perhaps resolve the relationships among these.

> In other words, what if these sequences are actually completely unique
> but their uniqueness lies in terms of those genes that are not core
> genes, but those that are present in a subset of the assemblies?

Yes, good point, I guess the key question here is how important those 13
identical sequences are for the biological question you want to answer.

> Any insights would be great appreciated. I apologize for this not being
> a question directly about running RaxML but about analysis based on its
> output.

That's fine, we do try to answer these questions as well on here.

Alexis

>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/2f499359-492d-4826-9ddc-9bbff677fb1bn%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/2f499359-492d-4826-9ddc-9bbff677fb1bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP) lab,
Institute of Molecular Biology and Biotechnology, Foundation for
Research and Technology Hellas

www.exelixis-lab.org

Chahat Upreti

unread,
Sep 26, 2022, 4:42:24 PM9/26/22
to raxml
Thanks so much for that response! I have just a couple of follow up questions -

1. How can I include non-core genes in the alignment? I was unable to figure out if Roary does it.
2. Is it common for people to include non-core genes in alignment? (I am new to this field).

Thank you so much Dr. Stamatakis!
Chahat

Alexandros Stamatakis

unread,
Sep 27, 2022, 2:32:17 AM9/27/22
to ra...@googlegroups.com
Dear Chahat,

> Thanks so much for that response! I have just a couple of follow up
> questions -
>
> 1. How can I include non-core genes in the alignment? I was unable to
> figure out if Roary does it.

Sorry, I am not an expert on using this program, we are method
developers so we rarely do empirical datset assembly.

> 2. Is it common for people to include non-core genes in alignment? (I am
> new to this field).


Well, we not, first of all the question is if the definition of the core
genes is debatable or not, that is, which criteria were used to define
these genes as core genes.

Second, the difficulty score implemented in Pythia gives you very good
arguments to extend the alignment.

Alexis
> <https://groups.google.com/d/msgid/raxml/2f499359-492d-4826-9ddc-9bbff677fb1bn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/2f499359-492d-4826-9ddc-9bbff677fb1bn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP)
> lab,
> Institute of Molecular Biology and Biotechnology, Foundation for
> Research and Technology Hellas
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/e4d1f2ae-2256-4de4-b0d6-71186285eae8n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/e4d1f2ae-2256-4de4-b0d6-71186285eae8n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Chahat Upreti

unread,
Sep 28, 2022, 4:08:39 PM9/28/22
to raxml
Thanks so much!
Reply all
Reply to author
Forward
0 new messages