Annotation of biosynthetic gene clusters

175 views
Skip to first unread message

Yi-Ming Shi

unread,
Sep 24, 2019, 11:58:33 AM9/24/19
to Anvi'o
Hello,

I'm a biochemist and a beginner when it comes to computational science... I'm analyzing the pangenome of 40 strains and would like to compare their gene clusters for secondary metabolites' biosynthesis. I've walked through the workflow of PANGENOMICS. Then I exported the gene sequence (fasta) of every single strain and ran antismash for finding and annotating the biosynthetic gene clusters, and would like to import results back into the contigs database. Now it comes up with an issue:  the assembled contigs or genome in the original fasta that was used to generate a contig database was "split" into thousands of single gene in the exported gene sequence fasta. Thus the antismash is only able to annotate those single genes rather than finding and annotation gene clusters. I'm wondering if I could export the originally assembled contigs with gene IDs or gene callings conferred by Anvio? I appreciate if you could also offer an alternative approach to deal with function annotation and visualization of biosynthetic gene clusters in Anvio.

Best regards,

Yi-Ming

A. Murat Eren

unread,
Sep 24, 2019, 1:17:09 PM9/24/19
to Anvi'o
Hi Yi-Ming,

If you used anvi'o to export gene sequences (i.e., via anvi-get-sequences-for-gene-calls, or anvi-get-sequences-for-hmm-hits, etc), then it is very easy to connect them back to contigs from which they come from.

Here is the key information you need, and the rest will come together very quickly:

$ sqlite3 CONTIGS.db 'select * from genes_in_contigs limit 10;' -separator $'\t' -header | column -t
gene_callers_id  contig            start  stop  direction  partial  source    version
0                Day17a_QCcontig1  0      186   f          1        prodigal  v2.60
1                Day17a_QCcontig1  214    1219  f          0        prodigal  v2.60
2                Day17a_QCcontig1  1265   2489  f          0        prodigal  v2.60
3                Day17a_QCcontig1  2561   3452  f          0        prodigal  v2.60
4                Day17a_QCcontig1  3552   3783  f          0        prodigal  v2.60
5                Day17a_QCcontig1  4172   4613  f          0        prodigal  v2.60
6                Day17a_QCcontig1  4628   5594  f          0        prodigal  v2.60
7                Day17a_QCcontig1  5646   5874  f          0        prodigal  v2.60
8                Day17a_QCcontig1  6010   6967  f          0        prodigal  v2.60
9                Day17a_QCcontig1  6999   7929  f          0        prodigal  v2.60

I hope these column names make sense. The gene callers id is what anvi'o uses intrinsically to uniquely identify each gene, and this particular table shows you in which contigs they appear. Then you can use anvi-import-misc-data to connect genes back to the interface.

The same command will work on your own contigs database, too.


Best wishes,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter :: gpg


--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/cb933538-8416-44ca-8fe1-e456a1b16647%40googlegroups.com.

Yi-Ming Shi

unread,
Sep 25, 2019, 12:09:28 PM9/25/19
to Anvi'o
Hi Meren,

Thanks for your reply. I'm getting there. 

Since I don't need all gene caller IDs, I'm wondering how I can export certain gene caller IDs in a given range. Let's say, in Day17a_QCcontig1, how to export the gene caller IDs between 1265 (start) and 6967 (stop). 

Thanks and best regards,

Yi-Ming



On Tuesday, September 24, 2019 at 7:17:09 PM UTC+2, Meren wrote:
Hi Yi-Ming,

If you used anvi'o to export gene sequences (i.e., via anvi-get-sequences-for-gene-calls, or anvi-get-sequences-for-hmm-hits, etc), then it is very easy to connect them back to contigs from which they come from.

Here is the key information you need, and the rest will come together very quickly:

$ sqlite3 CONTIGS.db 'select * from genes_in_contigs limit 10;' -separator $'\t' -header | column -t
gene_callers_id  contig            start  stop  direction  partial  source    version
0                Day17a_QCcontig1  0      186   f          1        prodigal  v2.60
1                Day17a_QCcontig1  214    1219  f          0        prodigal  v2.60
2                Day17a_QCcontig1  1265   2489  f          0        prodigal  v2.60
3                Day17a_QCcontig1  2561   3452  f          0        prodigal  v2.60
4                Day17a_QCcontig1  3552   3783  f          0        prodigal  v2.60
5                Day17a_QCcontig1  4172   4613  f          0        prodigal  v2.60
6                Day17a_QCcontig1  4628   5594  f          0        prodigal  v2.60
7                Day17a_QCcontig1  5646   5874  f          0        prodigal  v2.60
8                Day17a_QCcontig1  6010   6967  f          0        prodigal  v2.60
9                Day17a_QCcontig1  6999   7929  f          0        prodigal  v2.60

I hope these column names make sense. The gene callers id is what anvi'o uses intrinsically to uniquely identify each gene, and this particular table shows you in which contigs they appear. Then you can use anvi-import-misc-data to connect genes back to the interface.

The same command will work on your own contigs database, too.


Best wishes,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter :: gpg


On Tue, Sep 24, 2019 at 11:58 AM Yi-Ming Shi <shiyi...@gmail.com> wrote:
Hello,

I'm a biochemist and a beginner when it comes to computational science... I'm analyzing the pangenome of 40 strains and would like to compare their gene clusters for secondary metabolites' biosynthesis. I've walked through the workflow of PANGENOMICS. Then I exported the gene sequence (fasta) of every single strain and ran antismash for finding and annotating the biosynthetic gene clusters, and would like to import results back into the contigs database. Now it comes up with an issue:  the assembled contigs or genome in the original fasta that was used to generate a contig database was "split" into thousands of single gene in the exported gene sequence fasta. Thus the antismash is only able to annotate those single genes rather than finding and annotation gene clusters. I'm wondering if I could export the originally assembled contigs with gene IDs or gene callings conferred by Anvio? I appreciate if you could also offer an alternative approach to deal with function annotation and visualization of biosynthetic gene clusters in Anvio.

Best regards,

Yi-Ming

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to an...@googlegroups.com.

A. Murat Eren

unread,
Sep 25, 2019, 1:47:30 PM9/25/19
to Anvi'o
Hi,

Unfortunately the program is not able to that. But you can use the program `anvi-script-resformat-fasta` to with the `--keep-ids` parameter to subset the genes you are interested in.


Best wishes,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter :: gpg

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/2f68b3c4-1fb1-42b3-8a5c-fbba7a810aec%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages