change Min percent identity for Blast taxonomy assignment

950 views
Skip to first unread message

Lu

unread,
Sep 11, 2012, 12:41:00 PM9/11/12
to qiime...@googlegroups.com
Hi,

I'm using assign_taxonomy.py -m blast. I would like to know if it is possible to increase the Min percent identity, which is 0.9 by default. Thanks

Adamrp

unread,
Sep 12, 2012, 4:22:57 PM9/12/12
to qiime...@googlegroups.com
Hi Lu,

This step is not configurable.  If you would like to submit a feature request, please visit http://sourceforge.net/tracker/?group_id=272178&atid=1157167

Thanks,
Adam Robbins-Pianka

Lu

unread,
Sep 12, 2012, 5:18:04 PM9/12/12
to qiime...@googlegroups.com
Thanks. I already posted a feature request in Sourceforge.

Yevgeniy Marusenko

unread,
Feb 18, 2013, 12:33:39 AM2/18/13
to qiime...@googlegroups.com
Are we able to lower the Min percent identity yet? Too many OTUs are lost because they are not classified at this step...

Thanks,
Yev

Kyle Bittinger

unread,
Feb 18, 2013, 10:36:57 AM2/18/13
to qiime...@googlegroups.com
Yevgeniy,

Would you be able to provide me with some more information on your experiment?  Have you tried raising the e-value via the "--e_value" option?

I would be interested to know if the pre-set value of minimum percent identity was preventing valid assignments after setting the "--e_value" option very high.

Best,
Kyle


--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Yevgeniy Marusenko

unread,
Feb 18, 2013, 1:05:48 PM2/18/13
to qiime...@googlegroups.com
Hi Kyle,

here is some more info. I have found a solution but the same question still stands if the 90% similarity threshold can be changed using Blast method. I will post what worked for me since it involves functional genes (amoA) and may apply to issues others are having.

Experiment:
There are 4 treatments, with 2 or 3 field replicates per treatment for a total of 10 samples. Pyrosequencing data is from 454 from Mr.DNA.
There are approximately 500-1000 sequences per samples (~7,500 total sequences).

Data processing on entire file after splitting libraries:
1. OTU clustering at 97% using cd-hit to obtain 200+ OTUs. (I also tried uclust using my own template database but that only returns <10 assigned OTUs and <50 new clusters of OTUs; this is fine but not for the downstream purpose that I want of detecting community shifts of individual phylotypes)
2. Pick representative OTUs. Although, I feel like if I want to show an abundance-weighted tree, the representative sequences are pointless (if the treeing program does not account for OTU table) so at some point I will re-do all downstream steps but without picking representatives.
3. Align with PyNAST using 75% similarity. I've played around with lowering this to make sure that I don't miss relevant groups because downstream I need the PyNAST failures file (haven't decided yet how low I should go 70%, 60%, etc).  I created my own database of known amoA sequences (at the moment only using 20 sequences that I added created manually while I'm developing this pipeline; later I will add more manually but I was having trouble finding a simple explanation of how to do this automatically from finding amoA sequences online in either BLAST, fungene of RDP, ARB from Pester et al 2012, or Dryad from Fernandez-Guerra and Casamayor 2012 [any ideas?]).  
4. Assign taxonomy. I'm still trying to understand the differences in my final analyses depending whether I do this step on aligned or non-aligned representative sequences. I can successfully create a ID-to-taxonomy file and template-reference local database with whatever amoA sequences I want to add there (the few pure isolates, some enrichment cultures, etc), then running Blast automatically uses 90% similarity. Therefore, any sequence that doesn't match is later thrown out because it was not classified. My earlier issue of only being able to manually create reference database for functional gene plays a role here, I can only add a limited number of template sequences manually and many candidates do not get classified- but this 90% similarity issue would no longer be an issue if I could automatically add hundreds of reference sequences. I tried increasing/decreasing e value several magnitudes and it didn't seem to make any difference at all. I see that the matches that get classified have really low e values and rarely any matches have high e values, so it seems the 90% similarity is the limiting filter here.

Solution that works for me: I used identical files that I originally used for the blast option but instead used the rdp call:
assign_taxonomy.py -i [aligned representative candidate sequences] -r [local manual reference template sequences] -t [ID to taxonomy file] -c 0.7 -o rdp_taxonomy_aligned/

Notes: my use of brackets is only to describe the file that is used. -m blast or -m rdp is not used since rdp is the default. In the ID-to-taxonomy file, at the moment I'm dealing with archaeal amoA so I added in some fake placeholder levels to make sure there are 6 levels of taxonomy, and this works fine classifying to the lowest level that I have (species).

Now, all 200+ OTUs of my unknown sequences are used, more than half of them are classified to the last taxonomic level, the other half are not because I don't have a full reference template of all known sequences.

If anyone notices anything that seems phylogenetically inappropriate to do, please point it out.

Thanks

Yev

Kyle Bittinger

unread,
Feb 18, 2013, 4:05:58 PM2/18/13
to qiime...@googlegroups.com, ymar...@asu.edu
Yevgeniy,

Thanks for the details!  This does look like a legitimate issue and I will re-open the bug for you.

In the meantime, you can try to hack into QIIME directly to override the minimum confidence.

Try a python script like the one I have attached.  You'll need to edit the script and fill in valid file paths where I have typed "FILLMEIN".  I have not tested this out, so let me know if you have success with this method.

Best,
Kyle
assign_taxonomy_blast_override.py

Yevgeniy Marusenko

unread,
Feb 18, 2013, 5:20:41 PM2/18/13
to qiime...@googlegroups.com, ymar...@asu.edu
Ok thanks Kyle,


I edited the script where your wrote FILLMEIN.

In which folder/directory do I place the script file?
 I placed it in the folder that contains all other Qiime files here:
/home/qiime/qiime_software/qiime-1.5.0-release/

But when I call it, the command could not be found.

Yev

Yevgeniy Marusenko

unread,
Feb 18, 2013, 6:33:25 PM2/18/13
to qiime...@googlegroups.com, ymar...@asu.edu
To answer my own question, I'm not sure if this is appropriate but I copied the script file and placed it in every folder within Qiime (by the way I'm using ubuntu virtual box from windows, Qiime 1.5) whereever there is an occurrence of assign_taxonomy.py

After running assign_taxonomy_blast_override.pyI receive this message:
from: can't read /var/mail/qiime.assign_taxonomy
/home/qiime/qiime_software/qiime-1.5.0-release/bin/assign_taxonomy_blast_override.py: line 3: syntax error near unexpected token `('
/home/qiime/qiime_software/qiime-1.5.0-release/bin/assign_taxonomy_blast_override.py: line 3: `assigner = BlastTaxonAssigner({'



Yev

Kyle Bittinger

unread,
Feb 18, 2013, 9:40:02 PM2/18/13
to qiime...@googlegroups.com
Yevgeniy,

Place the script in your working directory, and run the script using the command

python assign_taxonomy_blast_override.py

Please let me know if you get additional errors.

--Kyle

Yevgeniy Marusenko

unread,
Feb 18, 2013, 9:56:35 PM2/18/13
to qiime...@googlegroups.com
For the 5 instances where you wrote FILLMEIN, do I put the path to the directory (ending with the last folder name) or for some of them I should add the name of the file (with extension for fasta, txt, etc.)?

Also, in addition to using the command you indicated in the last post, I assume that I don't need to also run -m blast, -i xx, etc, and other parts of the options that I normally would for the assign_taxonomy.py since all of these are specified within the modified script... 

Yevgeniy Marusenko

unread,
Feb 18, 2013, 9:59:34 PM2/18/13
to qiime...@googlegroups.com
This is my call:

qiime@qiime-VirtualBox:~/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1$ python assign_taxonomy_blast_override.py

This is what is in the edited python script that you uploaded:

from qiime.assign_taxonomy import BlastTaxonAssigner

assigner = BlastTaxonAssigner({
    'Min percent identity': 90.0,
    'Max E value': 1e-30,
    'id_to_taxonomy_filepath': "/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1/IDtaxonomy6lvl.txt",
    'reference_seqs_filepath': "/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1/Arch_references_andsubclustersshortid.fas",
    })
assigner(
    seq_path="/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1/pynast_aligned_70/rep_set1_aligned.fasta",
    result_path="/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1/",
    log_path="/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1/",
    )

And this is the error that I am receiving:

Traceback (most recent call last):
  File "assign_taxonomy_blast_override.py", line 12, in <module>
    log_path="/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1/",
  File "/home/qiime/qiime_software/qiime-1.5.0-release/lib/qiime/assign_taxonomy.py", line 124, in __call__
    logger = self._get_logger(log_path)
  File "/home/qiime/qiime_software/qiime-1.5.0-release/lib/qiime/assign_taxonomy.py", line 234, in _get_logger
    handler = logging.FileHandler(log_path, mode='w')
  File "/home/qiime/qiime_software/python-2.7.1-release/lib/python2.7/logging/__init__.py", line 893, in __init__
    StreamHandler.__init__(self, self._open())
  File "/home/qiime/qiime_software/python-2.7.1-release/lib/python2.7/logging/__init__.py", line 912, in _open
    stream = open(self.baseFilename, self.mode)
IOError: [Errno 21] Is a directory: '/home/qiime/qiime_software/qiime-1.5.0-release/yevamoa/Arch/Analysis1'

Kyle Bittinger

unread,
Feb 18, 2013, 10:02:36 PM2/18/13
to qiime...@googlegroups.com
Try this:

from qiime.assign_taxonomy import BlastTaxonAssigner

assigner = BlastTaxonAssigner({
    'Min percent identity': 90.0,
    'Max E value': 1e-30,
    'id_to_taxonomy_filepath': "IDtaxonomy6lvl.txt",
    'reference_seqs_filepath': "Arch_references_andsubclustersshortid.fas",
    })
assigner(
    seq_path="pynast_aligned_70/rep_set1_aligned.fasta",
    result_path="rep_set1_aligned_blast_assignments.txt",
    log_path="rep_set1_aligned_blast_assignments.log",
    )

I am assuming that the forum does not mess up my indentation...
Message has been deleted

Yevgeniy Marusenko

unread,
Feb 18, 2013, 10:51:58 PM2/18/13
to qiime...@googlegroups.com
Ok great, I discovered a major issue in the process, but atleast now it is all resolved, yay! The script from your previous post works, the file names are sufficient since the script is placed in the working directory.

I ran the standard align_taxonomy.py which uses the standard 0.9 and then compared results to the new script entering "0.9" and the results are identical (same number of OTUs that have no hits). I played around with increasing/decreasing similarity and it makes a difference sometimes depending on what the e value is, similarly I played around increasing/decreasing e value and it makes a difference sometimes. For example, if the e value is high and positive 10000 (basically terrible quality) then changing from 90 to 70% similarity won't make any difference (but a change like 90 to 15% might have an effect [although thats not a good idea for real data, I'm just testing extreme values]), on the other hand, if e value is really low and negative 0.00001 then changing from 97% down to 60% similarity increases the number of OTUs that have no hits because they are out of the threshold for one of the parameters (e-value or similarity).

Now, the issue. In your script (from the last post) you wrote the similarity as "90.0" and thats what it appears as in the log file. When I run the align_taxonomy.py and look at log file, the similarity is outputted as "0.90", as in 0.9% and not 90%, so changing from 0.9 to 0.7 is incorrectly to assume that it is 90% to 70%. Does that make sense? Hopefully this decimal place issue is not a problem in many of the other Qiime scripts.

Thanks Kyle!

Yev

Kyle Bittinger

unread,
Feb 19, 2013, 7:38:55 AM2/19/13
to qiime...@googlegroups.com
Yev,

Glad it worked for you.

I still don't quite understand what the discrepancy is that produces the two values of 90.0 and 0.90.  Would you run my script and send me the full command plus the log file, then do the same for assign_taxonomy.py (using -m blast)?

Best,
Kyle


Jai Ram Rideout

unread,
Feb 19, 2013, 10:45:26 AM2/19/13
to qiime...@googlegroups.com
Hi Yev,

It looks like this bug was fixed for the QIIME 1.6.0 release (you're using QIIME 1.5.0):


-Jai

Yevgeniy Marusenko

unread,
Feb 19, 2013, 12:14:10 PM2/19/13
to qiime...@googlegroups.com
Oh perfect, thanks for pointing that out Jai,

Kyle, would you still like the log files or seems like it's all cleared up now?

Yev

Kyle Bittinger

unread,
Feb 19, 2013, 12:14:49 PM2/19/13
to qiime...@googlegroups.com
Nope, I'm good.  Glad we cleared this up!

Yevgeniy Marusenko

unread,
Jan 28, 2015, 6:58:04 PM1/28/15
to qiime...@googlegroups.com
Hi Kyle,

I am trying something similar to what we discussed in this old post.

I am having an error though. Now I am using Qiime 1.8 through a computing cluster. 

My goal is to change the 'min percent identity' for Blast assignments (e.g. 75% instead of 90%). Not sure what to modify in this script for it to run appropriately. I'm attaching the shell and python file, and the error message.

Any ideas?

Thanks
Yev
Assign_taxonomy.e3164781
assigntaxonomyblast75.sh
assign_taxonomy_blast_override.py

Kyle Bittinger

unread,
Jan 29, 2015, 3:22:12 PM1/29/15
to qiime...@googlegroups.com
The python script doesn't have the line

#!/usr/bin/env python

at the top, so Bash is trying to run it as a shell script.  Either add the line above or run the script like:

python assign_taxonomy_blast_override.py

Best,
Kyle

For more options, visit https://groups.google.com/d/optout.

Yevgeniy Marusenko

unread,
Jan 31, 2015, 1:16:41 PM1/31/15
to qiime...@googlegroups.com
Perfect, thanks. This feature now works for me.
Reply all
Reply to author
Forward
0 new messages