SILVA database

Jia Zhou

unread,

Nov 22, 2016, 8:14:55 PM11/22/16

to Qiime 1 Forum

Hi, Sir

I am trying SILVA database to do OTU-picking in qiime. Just wondering in taxonomy what are the difference between these files?

consensus_taxonomy_7_levels.txt majority_taxonomy_7_levels.txt raw_taxonomy.txt taxonomy_all_levels.txt

consensus_taxonomy_all_levels.txt majority_taxonomy_all_levels.txt taxonomy_7_levels.txt

Thanks

Jia

TonyWalters

unread,

Nov 23, 2016, 2:14:13 AM11/23/16

to Qiime 1 Forum

Jia,

The different taxonomy mapping files are described in the SILVA_notes text file that comes with the 123 database. See the pasted sections below:

axonomy strings are available in the raw format (strings pulled directly from the SILVA fasta labels), in an expanded RDP compatible format, and a seven-levels RDP format. The expanded RDP format contains expanded levels for every level present in any of the taxonomy strings. This has a consequence that the first 7 levels match domain through species for most Archaea, Bacteria, and many eukaryotes, but due to the extra levels present in many eukaryotes, one will have to look at deeper levels to get the species in many cases. When viewing taxonomy plots generated with these taxa strings, one will need to be aware that the expanded format may result in unmatched taxa levels (e.g. a species level for a bacterial taxon may be family level for a fungi taxon). The 7 level taxonomy uses 7 levels if they are present. If more than 7 levels are present, the first 3 and last 4 levels of taxonomy are used. If less than 7 levels are present, all levels present are used, and empty fields (e.g. d6__;d7__) are padded out to get 7 levels, with the text string of the last defined level replicated in the empty levels. The differences between these taxonomy strings (in the taxonomy/ folder) and those in the majority and consensus taxonomy folders are described at the end of this document.

In the taxonomy/taxonomy_all/97 folder, there are these files, with a brief description of what the file is to use as a guide when choosing which file to use:

raw_taxonomy.txt - these are the sequence IDs followed by the raw taxonomy strings directly pulled from the SILVA NR fasta file (will work with the -m blast assignment method, but not uclust/RDP)

taxonomy_7_levels.txt - This is the raw taxa, forced into exactly 7 levels as described in the preceding paragraph. This will work with all assignment methods

taxonomy_all_levels.txt - This is the raw taxa, expanded out to all levels present in any of the taxonomy strings (14 total levels). Will work with all assignment methods, but will use more memory than the 7 level taxonomy. Deeper levels of taxonomy, which will mostly come from Eukaryotes will require expansion of levels used with QIIME scripts, such as summarize_taxa.py.

consensus_taxonomy_7_levels.txt - This file is the same as the 7 levels, but uses the 100% consensus taxonomy (this is described in the “Consensus and Majority Taxonomies” section).

consensus_taxonomy_all_levels.txt - This file is the same as the all levels taxonomy, but uses the 100% consensus taxonomy (this is described in the “Consensus and Majority Taxonomies” section).

majority_taxonomy_7_levels.txt - This file is the same as the 7 levels, but uses the 90% majority taxonomy (this is described in the “Consensus and Majority Taxonomies” section).

majority_taxonomy_all_levels.txt - This file is the same as the all levels taxonomy, but uses the 90% majority taxonomy (this is described in the “Consensus and Majority Taxonomies” section).

=================================

Consensus and Majority Taxonomies

=================================

Custom scripts (linked when described) and code from Mike Robeson (https://github.com/mikerobeson/Misc_Code/tree/master/SILVA_to_RDP) was used to generate taxonomy strings from the full NR Silva 119 fasta file, which is used by the custom scripts below when generating consensus/majority taxonomy strings. The clustered OTU mapping files were generated with QIIME 1.9.0 as described earlier in this document under the “Filtering raw fasta file, creation of representative sequence files” section.

Reason for these alternative taxonomy string files:

A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavored to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.

For example, if a cluster had two reads, and one taxonomy string was:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter sp. HW3

and the second taxonomy string was:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter smithii

Then for either consensus or majority strings, the level 7 (0 is the first level, the domain) data would become ambiguous, as the species levels do not match. The above string for the representative sequence taxonomy mapping file becomes:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;Ambiguous_taxa

Because the taxonomy strings are not perfectly matched in terms of names/depths across all of the SILVA data, this can lead to some taxonomies being more ambiguous with my approach (exact string matches) than they actually are, particularly for the eukaryotes. There are over 1.5 million taxonomy strings in the non-redundant SILVA 119 release, so I can’t fault the maintainers of SILVA for these taxonomy strings being imperfect from a parsing/bioinformatics perspective.

The scripts used to create the consensus and 90% majority taxonomy strings are located here:

https://gist.github.com/walterst/bd69a19e75748f79efeb

https://gist.github.com/walterst/f6f08f6583bb320bb10d

Jia Zhou

unread,

Nov 23, 2016, 2:50:58 AM11/23/16

to Qiime 1 Forum

Hi, Tony

Those files were in 16s_only directory, which means the Eukaryotes will not be considered. Does that mean it will be better to choose 7_levels files to analysis?

Thanks

Jia

TonyWalters

unread,

Nov 23, 2016, 2:57:02 AM11/23/16

to Qiime 1 Forum

Bacteria/Archaea tend to be more consistent in using 7 levels of taxonomy (although there are often unclassified species/genera/families) whereas Eukaryotes often have more than 7 levels. If you're doing 16S-only work, you probably won't see much difference between 7-levels and all levels.

Reply all

Reply to author

Forward