How to get only 1 sequence from each gene cluster

andrea firrincieli

unread,

Jul 15, 2020, 3:59:01 AM7/15/20

to Anvi'o

Hi All,

I did a pangenomic analysis on 120 genomes. When I use 'anvi-get-sequences-for-gene-clusters' I get all sequence from each cluster included in the core genome. I was wondering if it is possible to get just one sequence from each gene cluster.

Thanks

Andrea

A. Murat Eren

unread,

Jul 15, 2020, 9:08:28 AM7/15/20

to Anvi'o

Hey Andrea,

As of today anvi'o does not include any protocol to select a representative sequence from a gene cluster. This is one of those things that sound very straightforward when it is actually extremely difficult to do right when one considers all the biological implications of using a single sequence to represent a non-identical set of homologous sequences.

If you wish, you can use the `anvi-summarize` output for your pangenome to randomly select a sequence for each gene cluster quite easily through R or EXCEL.

Best,

--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/f7d9c725-04f8-4156-a53d-a7753259c4b9o%40googlegroups.com.

andrea firrincieli

unread,

Jul 26, 2020, 8:10:39 AM7/26/20

to Anvi'o

Hi Meren,

Thank you for your answer!

I understand the biological implications of this task so I did some 'quality' check for some of the gene cluster (GC) included in the core-genome to see if, using a representative sequence from each cluster is safe for the functional annotation of the protein included in a cluster. Here some backgroung information of my dataset; I am working with 117 strains belonging to the same species all sharing an ANI > 97.0% therefore, the --mcl-inflation parameter was set to 10.

Functionally speaking, I noticed that for most of the GC tested having a very high geometric homogeneity index (GHI), one random sequence truly represent all proteins in that cluster. However, things get tricky for clusters with poors GHI. For instance, one cluster included 233 protein; 117 were annotated as UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase (K01929) while the remaining were identified as ATP-dependent RNA helicase DeaD (K05592). These two class of proteins share an identity of the 20%.

Therefore, my question is, how is possible that proteins sharing a 20% sequence identity are included in the same GC? Pretty sure I am missing something.

Also, one weird thing is that for each strain, according to the gene_id, the genes respectively encoding for K01929 and K05592 are right next to each other.

Thank you for your time and thank you for Anvi'o!

Best,

Andrea

Il giorno mercoledì 15 luglio 2020 15:08:28 UTC+2, Meren ha scritto:

Hey Andrea,

As of today anvi'o does not include any protocol to select a representative sequence from a gene cluster. This is one of those things that sound very straightforward when it is actually extremely difficult to do right when one considers all the biological implications of using a single sequence to represent a non-identical set of homologous sequences.

If you wish, you can use the `anvi-summarize` output for your pangenome to randomly select a sequence for each gene cluster quite easily through R or EXCEL.

Best,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

On Wed, Jul 15, 2020 at 2:59 AM andrea firrincieli <andres.f...@gmail.com> wrote:

Hi All,

I did a pangenomic analysis on 120 genomes. When I use 'anvi-get-sequences-for-gene-clusters' I get all sequence from each cluster included in the core genome. I was wondering if it is possible to get just one sequence from each gene cluster.

Thanks
Andrea

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.

To unsubscribe from this group and stop receiving emails from it, send an email to an...@googlegroups.com.

A. Murat Eren

unread,

Jul 27, 2020, 4:04:26 PM7/27/20

to Anvi'o

Hi Andrea,

On Sun, Jul 26, 2020 at 7:10 AM andrea firrincieli <andres.fi...@gmail.com> wrote:

how is possible that proteins sharing a 20% sequence identity are included in the same GC?

20% sequence identity at the AA sequence level is quite a lot, in fact :) But in this case they seem to be doing different things, so the gene cluster is functionally heterogeneous. They probably share a domain, hence MCL pulled them together. Can you please send a screenshot of the sequences and how they are aligned in the gene cluster? I am curious now.

Maybe it is time we start thinking about how we can make anvi-refine to support manual refinement of gene clusters in pangenomes.

Best,

Ana Belen Garcia Martin

unread,

Jul 28, 2020, 3:23:53 AM7/28/20

to an...@googlegroups.com

Hi Meren,

I was wondering if it would make sense to include twice the genome that we would like to use as a reference in a pangenome analysis.

I thought that having twice the layer and the GCs, those two "exact-the-same-genome" could be selected by setting both the parameters MIN and MAX number of genomes gene clusters occurs to 2.

I guess that this is possible for those of us that want to have a "Reference" genome in the analysis. What do you have to say about it? Does it make sense? Does anvi'o support it or the "exact-the-same-genome and layers" would be considered as duplicate by anvi'o?

Thank you for your time in advance.

Best,

Ana

To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LadJEmHYqCmNGuHujTq0b%2BLbXdFR8%2BKGxsAjycJD4sjAAQ%40mail.gmail.com.

andrea firrincieli

unread,

Jul 28, 2020, 4:18:26 AM7/28/20

to an...@googlegroups.com

Hi Meren,

Here is a screenshot of the alignment. This is the full alignment but only from few of the 177 strains (the alignment pattern does not change across all other strains).

Thanks!

Andrea

--

Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LacaMJ2S_TwKwHP_hxvKX5edAEvxYOhbOd8TH79XcG1HuA%40mail.gmail.com.

A. Murat Eren

unread,

Jul 28, 2020, 9:22:29 AM7/28/20

to Anvi'o

There are many matches, but also clear from the alignment that it would have been much better if this was split into two gene clusters. Thank you, Andrea.

--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CABvc1WLkoaP2iwafgSEdJSM9ZduD4K3%3D6_3HkzHSRhUJEHMeJw%40mail.gmail.com.

A. Murat Eren

unread,

Jul 28, 2020, 9:27:04 AM7/28/20

to Anvi'o

Hi Ana,

On Tue, Jul 28, 2020 at 2:23 AM Ana Belen Garcia Martin <abgar...@gmail.com> wrote:

I was wondering if it would make sense to include twice the genome that we would like to use as a reference in a pangenome analysis.

The utility of this is not clear to me.

I thought that having twice the layer and the GCs, those two "exact-the-same-genome" could be selected by setting both the parameters MIN and MAX number of genomes gene clusters occurs to 2.

You still can select all gene clusters associated with (or only found in) a single genome. There are more straightforward and accurate ways to do that (i.e., anvi-summarize, process the output table to find GCs associated with your genome, anvi-import-colleciton).

I guess that this is possible for those of us that want to have a "Reference" genome in the analysis. What do you have to say about it? Does it make sense? Does anvi'o support it or the "exact-the-same-genome and layers" would be considered as duplicate by anvi'o?

As I said, I don't see the utility, but anvi'o will not complain if you have two separate contigs databases for the same genome. As far as anvi'o is concerned, you're the boss :)

Best,

Ana Belen Garcia Martin

unread,

Jul 29, 2020, 12:04:14 PM7/29/20

to an...@googlegroups.com

Hi Meren,

thanks for your reply.

I had understood hat if we wanted to get the information for singletons upon the pangenome analysis, we would get each singleton that belong to whatever genome included in the pangenome analysis.

If I am correct, I kind of recall to had read in another discussion that information about singletons belonging to only a single genome (considered as a reference) could not be obtained.

For this reason, I thought about including the reference genome like twice (as different "layers" in the analysis) and retrieve only the singletons that were present in these 2 "layers".

After reading this:

You still can select all gene clusters associated with (or only found in) a single genome. There are more straightforward and accurate ways to do that (i.e., anvi-summarize, process the output table to find GCs associated with your genome, anvi-import-colleciton)

I think I am ready to do it correctly ;).

Thanks a lot,

Best,

Ana

--

Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LaeqVttVS_Ske7Hk0EZhzoezxa317TJTAOHAc35VHf9pgQ%40mail.gmail.com.

sushee...@gmail.com

unread,

Oct 13, 2020, 2:53:41 AM10/13/20

to Anvi'o

Hey Meren, dear all,

I'm trying to do something similar, i.e. get one sequence per gene-cluster. I used the `anvi-summarize` with the `--report-DNA-sequences` option. I then looked into the "GUT_gene_clusters_summary.txt" file to choose a representative sequence for each GC based on the alignments (manually done).

Here's the preamble to my question - in the tail of the "GUT_gene_clusters_summary.txt" file, I have the following GCs

GC_00001661

GC_00001662

GC_00001664

GC_00001665

GC_00001666

GC_00001668

GC_00001669

If you look closely, "GC_00001663' and 'GC_00001667' are missing. I tried playing around with the '[--min-num-genomes-gene-cluster-occurs INTEGER]' and '[--max-num-genomes-gene-cluster-occurs INTEGER]' flags as described here: anvi-get-sequences-for-gene-clusters, thereafter but am getting an error saying that 0 gene_clusters were selected.

So, the questions are:

1. Are the missing GC_XXXXXX, singletons that are not reported or something else that I grossly overlooked?

2. How can I retrieve the sequences for all the GC's including the missing ones?