Hi Kyle,
here is some more info. I have found a solution but the same question still stands if the 90% similarity threshold can be changed using Blast method. I will post what worked for me since it involves functional genes (amoA) and may apply to issues others are having.
Experiment:
There are 4 treatments, with 2 or 3 field replicates per treatment for a total of 10 samples. Pyrosequencing data is from 454 from Mr.DNA.
There are approximately 500-1000 sequences per samples (~7,500 total sequences).
Data processing on entire file after splitting libraries:
1. OTU clustering at 97% using cd-hit to obtain 200+ OTUs. (I also tried uclust using my own template database but that only returns <10 assigned OTUs and <50 new clusters of OTUs; this is fine but not for the downstream purpose that I want of detecting community shifts of individual phylotypes)
2. Pick representative OTUs. Although, I feel like if I want to show an abundance-weighted tree, the representative sequences are pointless (if the treeing program does not account for OTU table) so at some point I will re-do all downstream steps but without picking representatives.
3. Align with PyNAST using 75% similarity. I've played around with lowering this to make sure that I don't miss relevant groups because downstream I need the PyNAST failures file (haven't decided yet how low I should go 70%, 60%, etc). I created my own database of known amoA sequences (at the moment only using 20 sequences that I added created manually while I'm developing this pipeline; later I will add more manually but I was having trouble finding a simple explanation of how to do this automatically from finding amoA sequences online in either BLAST, fungene of RDP, ARB from Pester et al 2012, or Dryad from Fernandez-Guerra and Casamayor 2012 [any ideas?]).
4. Assign taxonomy. I'm still trying to understand the differences in my final analyses depending whether I do this step on aligned or non-aligned representative sequences. I can successfully create a ID-to-taxonomy file and template-reference local database with whatever amoA sequences I want to add there (the few pure isolates, some enrichment cultures, etc), then running Blast automatically uses 90% similarity. Therefore, any sequence that doesn't match is later thrown out because it was not classified. My earlier issue of only being able to manually create reference database for functional gene plays a role here, I can only add a limited number of template sequences manually and many candidates do not get classified- but this 90% similarity issue would no longer be an issue if I could automatically add hundreds of reference sequences. I tried increasing/decreasing e value several magnitudes and it didn't seem to make any difference at all. I see that the matches that get classified have really low e values and rarely any matches have high e values, so it seems the 90% similarity is the limiting filter here.
Solution that works for me: I used identical files that I originally used for the blast option but instead used the rdp call:
assign_taxonomy.py -i [aligned representative candidate sequences] -r [local manual reference template sequences] -t [ID to taxonomy file] -c 0.7 -o rdp_taxonomy_aligned/
Notes: my use of brackets is only to describe the file that is used. -m blast or -m rdp is not used since rdp is the default. In the ID-to-taxonomy file, at the moment I'm dealing with archaeal amoA so I added in some fake placeholder levels to make sure there are 6 levels of taxonomy, and this works fine classifying to the lowest level that I have (species).
Now, all 200+ OTUs of my unknown sequences are used, more than half of them are classified to the last taxonomic level, the other half are not because I don't have a full reference template of all known sequences.
If anyone notices anything that seems phylogenetically inappropriate to do, please point it out.
Thanks
Yev