assign taxonomy through BLAST against NCBI 16S database

1,714 views
Skip to first unread message

Christian Milani

unread,
Oct 31, 2012, 12:33:01 PM10/31/12
to qiime...@googlegroups.com
Hi everyone,
i need to analyze my metagenomic data (200.000 200bp reads) and would like to assign taxonomy using the NCBI 16S database trying to reach the species level.

i've tryed to modify the parameters files setting:

# Taxonomy assignment parameters
assign_taxonomy:id_to_taxonomy_fp /home/qiime/Desktop/test_blas/blastdb/gi_taxid_nucl.bin 
assign_taxonomy:assignment_method blast
assign_taxonomy:blast_db /home/qiime/Desktop/test_blas/blastdb/16SMicrobial

Where 16SMicrobial is the NCBI 16S database (16SMicrobial.nhr , 16SMicrobial.nin etc.) and gi_taxid_nucl.bin is a tabular file i usualy use for MEGAN 4 analysis.

I tryed to exec this command to:

qiime@qiime-VirtualBox:~/Desktop/paper_metagenomics/test_blast$ assign_taxonomy.py -i seqs_rep_set.fasta -b blast/16SMicrobial -t blast/gi_taxid_nucl.bin -o prova/ok

and got this error:

Traceback (most recent call last):
  File "/home/qiime/qiime_software/qiime-1.5.0-release/bin/assign_taxonomy.py", line 226, in <module>
    main()
  File "/home/qiime/qiime_software/qiime-1.5.0-release/bin/assign_taxonomy.py", line 222, in main
    result_path=result_path,log_path=log_path)
  File "/home/qiime/qiime_software/qiime-1.5.0-release/lib/qiime/assign_taxonomy.py", line 359, in __call__
    taxonomy_file, training_seqs_file = self._generate_training_files()
  File "/home/qiime/qiime_software/qiime-1.5.0-release/lib/qiime/assign_taxonomy.py", line 394, in _generate_training_files
    seq_id, lineage_str = map(strip, line.split('\t'))
ValueError: need more than 1 value to unpack


Thankyou for any help that you'll be able to provide :)

Tony Walters

unread,
Oct 31, 2012, 1:44:11 PM10/31/12
to qiime...@googlegroups.com
Hello Christian,

The taxonomy mapping file needs to be tab separated, as shown here: http://qiime.org/documentation/file_formats.html#sequence-id-to-taxonomy-mapping-files

You should check the assign_taxonomy:id_to_taxonomy_fp /home/qiime/Desktop/test_blas/blastdb/gi_taxid_nucl.bin file as this probably is not in the right format.

-Tony


--
 
 
 

Christian Milani

unread,
Oct 31, 2012, 3:57:06 PM10/31/12
to qiime...@googlegroups.com
Hi tony,
you know if there's already a taxonomy mapping file for the 16S NCBI database?


I've tryed the modified greengenes taxonomy mapping file found in another topic that allow RPD classification to reach species level and it worked... but just 5% of the sequences reached species level, other stopped at genera or in worse cases to family... you think i can get anything more of this with 200bp based of amplified V3 region?

thank you so much for your help, i don't have a lot of experience in metagenomics.

Tony Walters

unread,
Oct 31, 2012, 4:30:29 PM10/31/12
to qiime...@googlegroups.com
Hello Christian,

I don't have a QIIME compatible version of the 16S NCBI database (it doesn't mean that someone hasn't made one, but I haven't seen one).

For the classification with RDP, failure to classify at the species level can either be due to 1.  an inability to distinguish between two different taxa with the sequence fragment or 2.  A lack of defined species/genus for a given sequence in the reference database (this should manifest as something like "g__" or "s__" at the end of a given assignment rather than an actual taxa string).

There are a few potential options to improve the assignments.

Make sure there aren't any primers or barcodes/adapters left on the ends of your sequences.  The presence of these will depend on your read lengths.  The truncate_reverse_primer.py script can help you to remove these if present (it's probably best to visually inspect your sequences for your reverse primer, and reverse complement of the reverse primer, near the ends of your sequences, although not necessarily at the very end).  This can throw off the RDP classifier if they are present, as the barcodes/adapters do not match the reference database.

The next option would be to use -m blast rather than RDP.    This has the potential of creating a false sense of certainty about assignment (e.g., multiple very good blast hits to different taxa could happen, but this will return the best hit).

Another option would be to try the newest version of Greengenes (see https://groups.google.com/group/qiime-forum/browse_thread/thread/b7be2ed47e18f8db/15368478b3c4296a?lnk=gst&q=GreenGenes+12_10+-+problem+retraining+RDP+classifier#15368478b3c4296a).  To use the species level assignments with this, you would have to get the development version of QIIME running though.

Finally, you could get a bit of improvement in assignments by slicing the training sets to only include the V3 region, see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3217155/ from Werner et al.  This is automated in the svn version of Primer Prospector (http://pprospector.sourceforge.net/install/install.html#latest-development-version) with the slice_aligned_region.py script, which would take your V3 primer sequences as input and the aligned (won't work with non-aligned) sequences; the Greengenes 4 Feb. release has the 97_aligned file in the representative sequence folder.

I haven't done specific tests on the V3 region regarding assignments, but we've seen the best assignments from the V4 and V2 regions, which still had some species not assigned correctly during in silico tests, so even with the most optimal scenarios regarding sequence quality I would expect there to be a fair number of sequences not assigned to the species level with V3 reads.

-Tony

--
 
 
 

Christian Milani

unread,
Oct 31, 2012, 4:40:55 PM10/31/12
to qiime...@googlegroups.com
Hi Tony,
i will try everyhing you suggested and i'll let you know if i get any result.

Thank you so much!

Christian Milani

unread,
Nov 6, 2012, 4:20:48 AM11/6/12
to qiime...@googlegroups.com
Hi Tony,
i'm trying to run QIIME using Blast instead of RDP... I setted up the qiime_parameters file but i get stuck at this command:

# Assign taxonomy command 
/home/qiime/qiime_software/python-2.7.1-release/bin/python /home/qiime/qiime_software/qiime-1.5.0-release/bin/parallel_assign_taxonomy_blast.py -i otus/rep_set//seqs_rep_set.fasta -o otus/blast_assigned_taxonomy -T --jobs_to_start 8 --seconds_to_sleep 60 --reference_seqs_fp /home/qiime/current_GREENGENES_gg16S_unaligned.fasta --id_to_taxonomy_fp /home/qiime/greengenes_species.txt

i Don't get any error and in the top i can see at first the formatdb working but after that i can't see any process related to blast/qiime.

I tryed with a small dataset but i have the same problem.

Thanks for you help,

Christian

Tony Walters

unread,
Nov 6, 2012, 6:24:42 AM11/6/12
to qiime...@googlegroups.com
Hello Christian,

It may be worth running the single process (assign_taxonomy -m blast) rather than the parallel script to help determine if the problem is coming from the blast setup or from the parallel jobs.  There may be a discrepancy between the reference sequences used and the taxonomy file.  I would guess that the greengenes_species.txt file is from previous QIIME forum threads for the 4 Feb. Greengenes release, but I am not sure about the current_GREENGENES_gg16S_unaligned.fasta file that you're using.

--
 
 
 

Christian Milani

unread,
Nov 6, 2012, 12:37:49 PM11/6/12
to qiime...@googlegroups.com
Running the non parallel script made it work!
As i have time i'll check the results!

Greg Caporaso

unread,
Nov 6, 2012, 10:21:33 PM11/6/12
to Qiime Forum
Hi Christian,
You should try the parallel version with absolute rather than relative
paths. That's a known issue that we're currently working on fixing.

Greg

On Tue, Nov 6, 2012 at 9:37 AM, Christian Milani
> --
>
>
>

xzron...@gmail.com

unread,
Aug 23, 2015, 8:26:45 AM8/23/15
to Qiime Forum

Hi Greg,

I run Qiime1.9.0 assign taxonomy to the OTUs with assign_taxonomy.py script, species level is so little, only several otus assigned several species in the otus_tax_assignments.txt, and the latter summarize_taxa_through_plots.py script produce .txt document only from out_table_sorted.L2.txt to out_table_sorted.L6.txt without species level  

The command:

assign_taxonomy.py -i ./usearch_qf_results_80/otus.fa -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -t /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt

and the latter command:

summarize_taxa_through_plots.py -o taxa_summary/plots -i ./usearch_qf_results_80/otu_table.biom -m samples_Map_all.txt -s

 

The 16s sequences amplified V3,V4,V5 regions, ~450bp, sequencing company analysis the data reached the species level. I don’t know what can I do to reach the species level. I hope you can give me some suggestion about this question, thank you! 


在 2012年11月7日星期三 UTC+8上午11:21:33,Greg Caporaso写道:

xzron...@gmail.com

unread,
Aug 23, 2015, 8:45:08 AM8/23/15
to Qiime Forum


在 2015年8月23日星期日 UTC+8下午8:26:45,xzron...@gmail.com写道:
otus_tax_assignments.log.txt
otus_tax_assignments.xlsx
out_table_sorted.L6.xlsx
summarize taxa log.txt

Greg Caporaso

unread,
Aug 24, 2015, 11:00:32 AM8/24/15
to Qiime Forum
Hello,
Short 16S sequencing generally does not achieve species-level resolution, so I think you're likely just not getting high confidence taxonomy assignments at the species level. All of your results look correct to me, so I don't think that it's an error in your workflow, for example.

Greg

Mauro Truglio

unread,
Jun 7, 2016, 10:10:33 PM6/7/16
to qiime...@googlegroups.com
Hi all,
probably a little late on this one, but after dealing with the same issue, I wrote a bash script that creates a QIIME-compatible taxonomy file starting from any set of sequences with a GI id in a fasta file (e.g. NCBI 16S database). It allows QIIME to assign taxonomy based on the original NCBI dictionary, and it often reaches species-level accuracy.
If you think that this is a useful tool for other users, let me know and I'll open a new thread.

Here's the link to the repo: https://github.com/mtruglio/QIIME_utilities

Cheers

Mauro

xzron...@gmail.com

unread,
Oct 6, 2016, 4:24:26 AM10/6/16
to Qiime 1 Forum
Thank Mauro. I try do as your suggestion. I download 16SMicrobial database from NCBI and convert it blast_db format to fasta format using the following command:
qiime@qiime-190-virtual-box:~/mywork/tools$blastdbcmd -entry all -db 16SMicrobial -out 16SMicrobial.fasta
BLAST Database error: No alias or index file found for nucleotide database [16SMicrobial] in search path [/home/qiime/mywork/tools::]
16SMicrobial and ncbi-blast-2.2.31+ in the same file (tools).
How can i sovle the problem?

Best wishes,

Zhenrong Xie

Yoshiki Vázquez Baeza

unread,
Oct 7, 2016, 1:51:11 PM10/7/16
to Qiime 1 Forum
Can you try using absolute paths?

Thanks!

Yoshiki.
Reply all
Reply to author
Forward
0 new messages