Difference in taxonomic assignment: Greengenes vs. SILVA

732 views
Skip to first unread message

baoanxh2006

unread,
Mar 27, 2016, 10:48:27 PM3/27/16
to Qiime 1 Forum
Dear Qiime Users,

I have paired-ends reads of V1-V3 16S regions. Sequence analysis process includes assembling (using PEAR), removing primers (using multiple_extract_barcodes.py in Qiime 1.9.1), filtering and truncate seqs to 240 bp (following UPARSE pipeline).
When doing taxonomic assignment with Greengenes database (gg_13_8_otus) and Silva_119 in Qiime virtual box (v.1.9.1), I found a significant difference in taxonomic assignment for Proteobacteria as below.

Commands:
1) assign_taxonomy.py -i otus.fa
2) assign_taxonomy.py -i otus.fa -r ../../../Downloads/Silva119_release/rep_set/97/Silva_119_rep_set97.fna -t ../../../Downloads/Silva119_release/taxonomy/97/taxonomy_97_7_levels.txt -o Silva_tax_assign/)

For examples: 

Using Greengenes:                                                                                                        Sample1    Sample2  Sample3   Sample4  
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales10.50619.7968.93716.577
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;Other16.31020.64216.1451.233



k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales1.76440.4933.21719.899





k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Xanthomonadales19.2500.35123.1483.065
k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;Other0.6370.0170.4670.006

Using Silva database:
D_0__Bacteria;D_1__Proteobacteria;D_2__Alphaproteobacteria;D_3__Rhizobiales       26.41278   39.74114   24.52475   16.96269
D_0__Bacteria;D_1__Proteobacteria;D_2__Alphaproteobacteria;Other                           0.304246   0.621684   0.387718    0.8775022                     





D_0__Bacteria;D_1__Proteobacteria;D_2__Betaproteobacteria;D_3__Burkholderiales   14.31116   40.46679    17.30375    7.773602

D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Xanthomonadales  8.102449   0.357522  10.28378   2.75844
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;Other                                 0.035862   0.111394  0.017779   0.101754

It seems that the seqs assigned to "Alphaproteobacteria;_Other" with Greengenes was assigned to "Alphaproteobacteria;_Rhizobiales" when using Silva.
And, a large portion of seqs assigned to "Gammaproteobacteria;_Xanthomonadales "with Greengenes was moved to "Betaproteobacteria;_Burkhoderiales" when using Silva.

Could you please advise me what are possible reasons for this difference? which classification is better in this case?
I like the classification using Greengenes database, except for high portion of seqs was assigned to "Alphaproteobacteria;_Other". Can I based on classification when using Silva database to move this portion to "Alphaproteobacteria;_Rhizobiales"?
Please advise a source or references where I can read to understand the structure of these database (it is hard to understand for me).

Thank you very much!

I am looking forward to your advice!

Yours sincerely,
An

Jenya Kopylov

unread,
Mar 28, 2016, 10:54:57 AM3/28/16
to Qiime 1 Forum
Hi An,

Likely some abundant OTUs are matching different reference sequences in Greengenes vs. Silva with high %id (>90%) and both answers can be correct (which is a limitation of short reads against a very conservative 16S region).

However, you can do more in-depth investigation to verify or disprove this (thanks Tony!):

1. create a filtered OTU table with just the taxa in question (filter_taxa_from_otu_table.py, see this example)
2. convert the filtered OTU table to tab-delimited format
    $ biom convert -i filtered_otutable.biom --to-tsv --table-type="OTU table" -o filtered_otutable.txt --header-key taxonomy
3. sort the table by abundance in Excel
4. use the sorted table to see which OTUs/taxa are systematically different and use the OTU ID to query representative sequences (from 97_otus.fasta greengenes or Silva_119_rep_set97.fna Silva)
   $ grep "OTU ID" -A 1 97_otus.fasta
5. blasts this reference sequence on NCBI to see which results are more accurate between the two reference databases

It would also be interesting to see if your results better converge between the two databases if you set "--similarity 0.97" (rather than default --similarity 0.90) or even --similarity 0.99.

Let me know if you need more details regarding the steps listed above.

Jenya

baoanxh2006

unread,
Mar 29, 2016, 4:57:28 AM3/29/16
to Qiime 1 Forum
Thank you very much! Jenya

I will give it a try and see.


Kind regards,
An
Reply all
Reply to author
Forward
0 new messages