uclust assign_taxonomy with 18s data

dpaim

unread,

Oct 4, 2017, 3:46:26 PM10/4/17

to Qiime 1 Forum

Hello,

I'm performing open-reference OTU picking with 18s data using the SILVA_128_QIIME_release database, and I'm very confused with the taxonomy assigned to some OTUs.

I used the following command:

pick_open_reference_otus.py -i seqs_chimeras_filtered.fna -m usearch61 -r /users/SILVA_128_QIIME_release/rep_set/rep_set_all/97/97_otus.fasta -o otus_chimera_filtered/ -p params.txt

and the params.txt file contains:

pick_otus:enable_rev_strand_match True

assign_taxonomy:id_to_taxonomy_fp /users/SILVA_128_QIIME_release/taxonomy/taxonomy_all/97/taxonomy_all_levels.txt

assign_taxonomy:reference_seqs_fp /users/SILVA_128_QIIME_release/rep_set/rep_set_all/97/97_otus.fasta

align_seqs:template_fp /users/SILVA_128_QIIME_release/rep_set_aligned/97/97_otus_aligned.fasta

filter_alignment:suppress_lane_mask_filter True

filter_alignment:entropy_threshold 0.10

So I'm using the default uclust to assign_taxonomy. I understood from other posts in this forum that for each representative sequence (found in rep_set.fna) "uclust consensus taxonomy assigner searches sequences against a reference database, identifies the top hits (default --uclust_max_accepts 3) that are above a user-defined similarity threshold (default --similarity 0.9), and then assigns the deepest level of taxonomy that is common to --min_consensus_fraction (default 2 out of 3 = 0.51)."

A good number OTUs are getting "Unassigned" and having a look at the .uc files I could not understand the reason why some of these OTUs were unassigned. Here it is one example, the OTU id AF387159.1.1690:

H 114794 129 99 + 0 0 1589I64MI36M29D KT20_1394883;size=1; AF387159.1.1690

H 114794 130 99 + 0 0 1589I101M29D KT20_1395696;size=1; AF387159.1.1690

H 114794 130 99 + 0 0 1589I101M29D KT20_1371526;size=1; AF387159.1.1690

H 114794 130 98 + 0 0 1589I101M29D KT20_1376898;size=1; AF387159.1.1690

(From final_otu_map_mc2.txt file I have: AF387159.1.1690 KT20_1394883 KT20_1395696 KT20_1371526 KT20_1376898). The AF387159.1.1690 is also an ID from the SILVA_128_QIIME_release/taxonomy/taxonomy_all/97 database and corresponds to D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta;D_4__Chlorophyceae;D_5__Chlamydomonadales;D_6__Cylindrocapsa;D_7__Cylindrocapsa geminella;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__. I played a bit with --uclust_max_accepts --similarity and --min_consensus_fraction, but always getting the same OTUs unassigned.

When I blast the AF387159.1.1690 rep sequence using the whole Silva reference sequences (downloaded from https://www.arb-silva.de and formatted accordingly to use with local blast) the top 3 hits I get are:

subject_seq_id %ident length evalue bitscore

AF387159.1.1690_Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Chlamydomonadales;Cylindrocapsa;Cylindrocapsa 99.01 101 7.00E-44 180

GQ375104.1.2539_Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Westella;Westella 87.31 134 3.00E-33 145

And in fact when I add assign_taxonomy:assignment_method blast to my parameters file, I get this same OTU assigned to D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Chlorophyta;D_4__Chlorophyceae;D_5__Chlamydomonadales;D_6__Cylindrocapsa;D_7__Cylindrocapsa (I know that with the blast method taxonomy is assigned based only on the top 1 blast hit and this may not be ideal).

So my actual question is: Should this sequence really be unassigned? If yes, could someone help me to understand why this OTU is getting unassigned? I have a feeling that I'm missing a very simple point here, but I could not figure out alone, so I appreciate any comment.

Thank you,

Daniela

Kyle Bittinger

unread,

Oct 5, 2017, 12:18:19 AM10/5/17

to Qiime 1 Forum

Daniela,

A couple of things seem weird in the log file output. When I went back and looked at a recent experiment of my own, I see lines like:

H 43902 313 94.9 + 0 0 20I190M2I123M1066I denovo1 er.7366.1_2423323 177468

That is, the query ID shows up in column 9 (denovo1 er.7366.1_2423323), and the reference ID shows up in column 10 (177468). Your columns seem reversed from what I have. Would you mind sending the assign_taxonomy.py command you used?

The other thing that seems off is the semicolon after the ID in column 9. I would expect something like "KT20_1394883" rather than "KT20_1394883;size=1;" -- would you check your files to see where this ";size=1;" business is in the input file? Apologies if I'm missing something here.

Best,

Kyle

dpaim

unread,

Oct 5, 2017, 10:31:25 AM10/5/17

to Qiime 1 Forum

Hi Kyle,

Thanks for your message. In fact my log file seems weird based on what you described, and I've checked some other results from other data and I always have the same pattern in the log files.

The assign_taxonomy.py command from the log file of pick_open_reference_otus.py is

# Assign taxonomy command

assign_taxonomy.py -o otus_usearch_chimera_filtered//uclust_assigned_taxonomy -i otus_usearch_chimera_filtered//rep_set.fna --reference_seqs_fp /users/SILVA_128_QIIME_release/rep_set/rep_set_all/97/97_otus.fasta --id_to_taxonomy_fp /users/SILVA_128_QIIME_release/taxonomy/taxonomy_all/97/taxonomy_all_levels.txt

I've checked the input files and the ";size=1;" is not there, it only appears in the log files. I am attaching a sample of my input file (i.e. rep_set.fna) and also the output of print_qiime_config.py -t.

All the best,

Daniela

qiime_config_output.txt

sample_rep_set.fna

Kyle Bittinger

unread,

Oct 5, 2017, 10:53:06 PM10/5/17

to Qiime 1 Forum

I'll try assigning these against the same reference set and let you know what I find -- might be over the weekend when I get back to you.

Best,

Kyle

dpaim

unread,

Oct 10, 2017, 3:31:27 PM10/10/17

to Qiime 1 Forum

Hi Kyle,

Sorry to disturb you again, but did you have the chance assign taxonomy to my sample data? I've tried also to use rdp to assign taxonomy to the sequences and even though the log files still look like:

H 65347 128 98.4 + 0 0 1637I66MI59M3D KT18_1250754;size=23731; AB081640.1.1763

N * * * . * * * KT6_433270;size=23230; *

H 14287 129 98.4 + 0 0 1644I129M19I KT1_69219;size=23151; GU067954.1.1798

I got better results, meaning few OTUs were unclassified and most of those unassigned OTUs now have at least a domain classification. However I still find cases where an OTU is classified only at the domain level but when blasting the representative sequence against the same Silva_db all the top hits agreed at the family level.