Hello,
I'm performing open-reference OTU picking with 18s data using the SILVA_128_QIIME_release database, and I'm very confused with the taxonomy assigned to some OTUs.
I used the following command:
pick_open_reference_otus.py -i seqs_chimeras_filtered.fna -m usearch61 -r /users/SILVA_128_QIIME_release/rep_set/rep_set_all/97/97_otus.fasta -o otus_chimera_filtered/ -p params.txt
and the params.txt file contains:
pick_otus:enable_rev_strand_match True
assign_taxonomy:id_to_taxonomy_fp /users/SILVA_128_QIIME_release/taxonomy/taxonomy_all/97/taxonomy_all_levels.txt
assign_taxonomy:reference_seqs_fp /users/SILVA_128_QIIME_release/rep_set/rep_set_all/97/97_otus.fasta
align_seqs:template_fp /users/SILVA_128_QIIME_release/rep_set_aligned/97/97_otus_aligned.fasta
filter_alignment:suppress_lane_mask_filter True
filter_alignment:entropy_threshold 0.10
So I'm using the default uclust to assign_taxonomy. I understood from other posts in this forum that for each representative sequence (found in rep_set.fna) "uclust consensus taxonomy assigner searches sequences against a reference database, identifies the top hits (default --uclust_max_accepts 3) that are above a user-defined similarity threshold (default --similarity 0.9), and then assigns the deepest level of taxonomy that is common to --min_consensus_fraction (default 2 out of 3 = 0.51)."
A good number OTUs are getting "Unassigned" and having a look at the .uc files I could not understand the reason why some of these OTUs were unassigned. Here it is one example, the OTU id AF387159.1.1690:
H 114794 129 99 + 0 0 1589I64MI36M29D KT20_1394883;size=1; AF387159.1.1690
H 114794 130 99 + 0 0 1589I101M29D KT20_1395696;size=1; AF387159.1.1690
H 114794 130 99 + 0 0 1589I101M29D KT20_1371526;size=1; AF387159.1.1690
H 114794 130 98 + 0 0 1589I101M29D KT20_1376898;size=1; AF387159.1.1690
(From final_otu_map_mc2.txt file I have: AF387159.1.1690 KT20_1394883 KT20_1395696 KT20_1371526 KT20_1376898). The AF387159.1.1690 is also an ID from the SILVA_128_QIIME_release/taxonomy/taxonomy_all/97 database and corresponds to D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta;D_4__Chlorophyceae;D_5__Chlamydomonadales;D_6__Cylindrocapsa;D_7__Cylindrocapsa geminella;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__. I played a bit with --uclust_max_accepts --similarity and --min_consensus_fraction, but always getting the same OTUs unassigned.
When I blast the AF387159.1.1690 rep sequence using the whole Silva reference sequences (downloaded from
https://www.arb-silva.de and formatted accordingly to use with local blast) the top 3 hits I get are:
subject_seq_id %ident length evalue bitscore
AF387159.1.1690_Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Chlamydomonadales;Cylindrocapsa;Cylindrocapsa 99.01 101 7.00E-44 180
AF387159.1.1690_Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Chlamydomonadales;Cylindrocapsa;Cylindrocapsa 99.01 101 7.00E-44 180
GQ375104.1.2539_Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Westella;Westella 87.31 134 3.00E-33 145
And in fact when I add assign_taxonomy:assignment_method blast to my parameters file, I get this same OTU assigned to D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta;D_4__Chlorophyceae;D_5__Chlamydomonadales;D_6__Cylindrocapsa;D_7__Cylindrocapsa (I know that with the blast method taxonomy is assigned based only on the top 1 blast hit and this may not be ideal).
So my actual question is: Should this sequence really be unassigned? If yes, could someone help me to understand why this OTU is getting unassigned? I have a feeling that I'm missing a very simple point here, but I could not figure out alone, so I appreciate any comment.
Thank you,
Daniela