pulling gene information out of anvio

830 views
Skip to first unread message

Rika Anderson

unread,
Jan 5, 2016, 10:27:19 AM1/5/16
to an...@googlegroups.com
[I emailed this to Meren earlier today and he asked me to post it to the group so that everyone could have this information.]

I am exploring ways to dig into the guts of anvi'o. I want to compare the average coverages of specific genes by a metagenome and a metatranscriptome. I did this:

[randerson@minnie FS856_anvio]$ anvi-get-db-table-as-matrix FS856_Ray_contig_vs_FS856.bam-ANVIO_PROFILE/PROFILE.db --table gene_coverages -o gene_coverages_DNA.txt

and I get something like this:

entry_id        prot    sample_id       mean_coverage
0       prot_000000014004       FS856_RAY_CONTIG_VS_FS856       47.8773006135
1       prot_000000008143       FS856_RAY_CONTIG_VS_FS856       40.7819947043
2       prot_000000011641       FS856_RAY_CONTIG_VS_FS856       107.414473684
3       prot_000000013311       FS856_RAY_CONTIG_VS_FS856       9.97860962567
4       prot_000000003728       FS856_RAY_CONTIG_VS_FS856       9.37168141593
5       prot_000000017427       FS856_RAY_CONTIG_VS_FS856       10.3382899628
6       prot_000000012962       FS856_RAY_CONTIG_VS_FS856       13.0669642857
7       prot_000000019040       FS856_RAY_CONTIG_VS_FS856       48.8092105263
8       prot_000000020775       FS856_RAY_CONTIG_VS_FS856       111.851197982

Is there a way I can dig into the database and get more information about what "prot_000000014004" is? i.e. what is its annotation, and what contig does it fall in to? 

Thanks!
Rika

A. Murat Eren

unread,
Jan 5, 2016, 10:39:31 AM1/5/16
to an...@googlegroups.com
Hi Rika,

There are multiple ways to do what you want.

One is to create a summary. Which should give you a huge matrix file with any and every information regarding each entry (it should be under 'functions' tab in the summary output --see the Functions Table link in this example summary: http://anvio.org/data/INFANT-CLC-SUMMARY-SUPERVISED/).

Inline image 1

I think that is the best way, but if you want to get your hands dirty, you can also inquiry with the contigs database:

$ sqlite3 CONTIGS.db '.tables'
collections_colors       genes_in_contigs         kmer_contigs
collections_info         genes_in_splits          kmer_splits
collections_of_contigs   genes_in_splits_summary  self
collections_of_splits    hmm_hits_in_contigs      splits_basic_info
contig_sequences         hmm_hits_in_splits
contigs_basic_info       hmm_hits_info
$ sqlite3 CONTIGS.db '.schema genes_in_contigs'
CREATE TABLE genes_in_contigs (prot text, contig text, start numeric, stop numeric, direction text, figfam text, function text, t_phylum text, t_class text, t_order text, t_family text, t_genus text, t_species text);
$ sqlite3 CONTIGS.db 'select * from genes_in_contigs where prot = "prot_00007";'
prot_00007|204_10M_MERGED.PERFECT.gz.keep_contig_878|7822|8409|f||Guanylate kinase (EC 2.7.4.8)||||||


​As you can imagine, you can also export the entire table to have a better output file:


$ anvi-get-db-table-as-matrix CONTIGS.db --table genes_in_contigs
Database .....................................: "CONTIGS.db" has been initiated with its 16 tables.
Table ........................................: "genes_in_contigs" has been read with 48 entries and 13 columns.
Output .......................................: genes_in_contigs.txt
$ head genes_in_contigs.txt
prot contig start stop direction figfam function t_phylum t_class t_order t_family t_genus t_species
prot_00001 204_10M_MERGED.PERFECT.gz.keep_contig_878 13 933 f
prot_00002 204_10M_MERGED.PERFECT.gz.keep_contig_878 1114 1677 f Translation elongation factor P
prot_00003 204_10M_MERGED.PERFECT.gz.keep_contig_878 1738 2142 f Transcription termination protein NusB
prot_00004 204_10M_MERGED.PERFECT.gz.keep_contig_878 2230 3447 f Carbamoyl-phosphate synthase small chain (EC 6.3.5.5)
prot_00005 204_10M_MERGED.PERFECT.gz.keep_contig_878 3440 6820 f Carbamoyl-phosphate synthase large chain (EC 6.3.5.5)
prot_00006 204_10M_MERGED.PERFECT.gz.keep_contig_878 6820 7761 f Orotidine 5'-phosphate decarboxylase (EC 4.1.1.23)
prot_00007 204_10M_MERGED.PERFECT.gz.keep_contig_878 7822 8409 f Guanylate kinase (EC 2.7.4.8)
prot_00008 204_10M_MERGED.PERFECT.gz.keep_contig_878 8497 10350 r Dihydroxy-acid dehydratase (EC 4.2.1.9)
prot_00009 204_10M_MERGED.PERFECT.gz.keep_contig_878 10409 10855 f DNA-directed RNA polymerase omega subunit (EC 2.7.7.6)​




Best wishes,



--

A. Murat Eren (meren)
http://merenlab.org :: gpg

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAKSgtzf7EYX8R9KAqoGx0akYYbq8RoSQWn6DvitTRGAKpDBPiw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Elizabeth Hénaff

unread,
Aug 16, 2017, 4:08:50 PM8/16/17
to Anvi'o
Hi Meren and Rika, 

Picking up this thread with a couple of questions --

1) what version of Anvio are you using that has the anvi-get-db-table-as-matrix tool?

2) I am using 2.1.0 installed with pip and the closest I see to that tool is  anvi-export-table. 

anvi-export-table --table genes_in_contigs -o export-genes-in-contigs.table anvio/final.contigs.simp.fa.db


gives me a table with the following columns: 

gene_callers_id contig  start   stop    direction       partial source  version

0       k99_6   0       312     f       1       prodigal        v2.6.2

1       k99_8   2       305     r       1       prodigal        v2.6.2

2       k99_9   0       177     r       1       prodigal        v2.6.2

3       k99_9   237     297     f       1       prodigal        v2.6.2

4       k99_10  2       353     r       1       prodigal        v2.6.2

5       k99_11  1       301     r       1       prodigal        v2.6.2

6       k99_13  1       379     f       1       prodigal        v2.6.2

7       k99_15  0       378     f       1       prodigal        v2.6.2

8       k99_19  0       282     r       1       prodigal        v2.6.2


Any ideas on how to get a table with information on gene calls with contig name, function and taxonomy would be great! 

Cheers, 

Elizabeth

A. Murat Eren

unread,
Aug 16, 2017, 4:46:48 PM8/16/17
to Anvi'o
Hi Elizabeth,

Please update your anvi'o installation to v2.4.0 when you have a chance :) v2.1.0 is very old now.

You are using the right program to do what you did. anvi-get-db-table-as-matrix became anvi-export-table at some point.

To get all the information, you should be using anvi-summarize. The output will include a gene calls file that describes everything per gene call.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/1922f90b-7818-4f76-96b4-7c39a2c57fa6%40googlegroups.com.

Elizabeth Hénaff

unread,
Aug 16, 2017, 5:06:46 PM8/16/17
to Anvi'o
Thanks for the quick response Meren! Will update version ASAP. 

I now have a new question: I only have a single sample, so I was not able to run anvi-merge, which means I do not have any collections to input to anvi-summarize. Is there a way around that? 

A bit of background on the analysis I'm doing: 

I have a number of environmental metagenomic samples that range in sequencing depth -- from 10M to 100M reads. I'd like to assemble and compare them with respect to taxon diversity, functional analysis and SNV distribution. 

My question, before I embark on this is: how does the input dataset size affect assembly and annotation? This question is spurred by some results obtained in our lab that show that input dataset size affects the results of many metagenomic classifiers (http://www.biorxiv.org/content/early/2017/06/28/156919.1), so I'm wary of comparing assemblies generated from samples with varying coverage depths. 

To start answering that question, I'm doing an experiment where I downsample the number of reads to different depths: 10M, 20M, 30M, 60M, 100M reads from the same sequencing run. 

I assembled them with MEGAHIT and found that with more reads, you get more contigs, higher avg length, and higher max length -- as to be expected. 

My question is how this affects the binning and functional analysis by anvio. 

So far, I'm treating each subset as a different sample, each with own contig DB and mapped bam. 

Is there another way that you would see to make a reasonable comparison of these datasets? 

(I realize this is diverging from original thread topic, glad to repost in new one)

Thanks!! 

Elizabeth

A. Murat Eren

unread,
Aug 16, 2017, 10:56:00 PM8/16/17
to Anvi'o
Hi Elizabeth,


On Wed, Aug 16, 2017 at 4:06 PM, Elizabeth Hénaff <elizabeth...@gmail.com> wrote:
I now have a new question: I only have a single sample, so I was not able to run anvi-merge, which means I do not have any collections to input to anvi-summarize. Is there a way around that? 

​You still can create a collection for all your contigs. For instance, let's say your single profile is called PROFILE.db and your contigs database is CONTIGS.db.

If you run this you will get an output file that you can use as a mock collection do describe everything in your profile database:

for split in `sqlite3 SAMPLE-01/PROFILE.db 'select contig from atomic_data_splits;'`; do echo $split | awk '{print $1 "\tall_contigs"}'; done​ > collection.txt

Then you can import this into your profile database like this (in this example I name it 'default'):

anvi-import-collection collection.txt -p SAMPLE-01/PROFILE.db -c CONTIGS.db -C default

​Then you can summarize this collection:

anvi-summarize -c CONTIGS.db -p SAMPLE-01/PROFILE.db -C default -o summary

And the resulting directory should contain everything.


My question is how this affects the binning and functional analysis by anvio. 

​Sequencing depth impacts pretty much everything. Your ability to recover population genomes directly linked to your ability assemble your data, and clearly sequencing depth has a huge influence on that as you already mentioned. Besides, more coverage means more reliable recovery of single-nucleotide variants from the environment, and lack of coverage makes those estimates much less reliable.

I think it is a good idea to work on subsampled data independently, and see the recovery of population genomes as a function of sequencing depth.​ But it is important to set a course based on a question to make the most out of the investment of time. Because these kinds of benchmarks can take a very long time and effort without any significant finding that will be true for all environments.

​My 2 cents.​


​Best,​

Elizabeth Hénaff

unread,
Aug 17, 2017, 8:44:33 PM8/17/17
to an...@googlegroups.com
Thanks Meren! worked like a charm. 

I have a question about the output of anvi-summarize: where can I find the taxonomic classification? 

The closest I found to that is in:

 $ less all_contigs-scg_domain.txt

archaea

$ less all_contigs-scg_domain_confidence.txt 

-14.26

But what I was looking for would be a taxonomic tree. Am I looking in the right place? 


Thanks! 


Elizabeth




--

Elizabeth Hénaff, PhD | elizabeth-henaff.net 

Postdoctoral Associate
Institute for Computational Biomedicine,
Weill Cornell Medical College

+1 512 299 3051


You received this message because you are subscribed to a topic in the Google Groups "Anvi'o" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/anvio/WGU85P7Gdao/unsubscribe.
To unsubscribe from this group and all its topics, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LafaureFaVwHkMJLBZu465PGGfd%2BW%2BkWEEBW9smPpjFNYg%40mail.gmail.com.

A. Murat Eren

unread,
Aug 17, 2017, 10:15:51 PM8/17/17
to Anvi'o
Hi Elizabeth,

Taking care of good taxonomy is one of the shortcomings of anvi'o we are hoping to address in the near future.

To see how we dealt with taxonomy, you can take a look at this one:


If you are interested in gene-level taxonomy, please let me know. If you imported results from centrifuge or another program into anvi'o, you can get those annotations at the gene-level back.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAMj1pR%3DcgYTKoO6bnyLYXNV5UBU1Tsm5%2BCVVMD6Q9edzEAEvXA%40mail.gmail.com.

Elizabeth Hénaff

unread,
Aug 18, 2017, 5:51:53 PM8/18/17
to an...@googlegroups.com
Yeah taxonomy is a tricky issue in general. I'll peruse the link you gave me -- thank you. 

I did run centrifuge, and imported those results. As a sanity check -- is there a way to check what imports have been made into a contig database? I'm running a lot of tests in parallel and I want to make sure I've got all my steps straight :) 

Also, can I import taxonomy/functions after profiling a bam? Or does that need to be done before to be meaningful? 


Thanks for your continued help! 

Cheers, 

Elizabeth



--

Elizabeth Hénaff, PhD | elizabeth-henaff.net 

Postdoctoral Associate
Institute for Computational Biomedicine,
Weill Cornell Medical College

+1 512 299 3051


A. Murat Eren

unread,
Aug 19, 2017, 1:42:04 PM8/19/17
to Anvi'o
Hi Elizabeth,

On Fri, Aug 18, 2017 at 4:51 PM, Elizabeth Hénaff <elizabeth...@gmail.com> wrote:
I did run centrifuge, and imported those results. As a sanity check -- is there a way to check
​​
what imports have been made into a contig database?

​You can run these commands to have a quick idea and make sure the output is not empty:

$ sqlite3 CONTIGS.db 'select * from genes_taxonomy limit 10;'
0|1
1|1
2|1
3|1
4|1
5|1
6|1
7|1
8|1
9|1

$ sqlite3 CONTIGS.db 'select * from splits_taxonomy limit 10;'
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00001|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00002|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00003|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00004|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00005|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00006|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00007|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00008|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00009|1
204_10M_MERGED.PERFECT.gz.keep_contig_878_split_00010|1

$ sqlite3 CONTIGS.db 'select * from taxon_names limit 10;'
1|||||Bifidobacterium|Bifidobacterium adolescentis
(...)
 

I'm running a lot of tests in parallel and I want to make sure I've got all my steps straight :) 

​It is always a good idea :)​

 
Also, can I import taxonomy/functions after profiling a bam? Or does that need to be done before to be meaningful? 

​Yes. Anvi'o is very flexible in that sense, and will tell you when you can't do something in most situations.


Best,

Elizabeth Hénaff

unread,
Aug 22, 2017, 8:23:47 PM8/22/17
to Anvi'o
Awesome, this is super useful. Thanks! 

Centrifuge worked great. 
Reply all
Reply to author
Forward
0 new messages