High percent of "unassigned" taxa from Bacterial data (16S)

Rachel C.

unread,

Sep 1, 2016, 9:37:59 AM9/1/16

to Qiime 1 Forum

Hi all,

After running the pick_open_reference_otus.py script with the Greengenes database on some 16S MiSeq data, I looked at the taxa summaries and found that I have a very high percent of unassigned taxa for most of the samples, ranging from 50% to 90%, with most being in the 70% range of unassigned taxa. I have not worked with bacterial data before but it is my understanding that this number is unusually high.

I do not think it is due to the samples themselves or the MiSeq run because ITS primers were used on the same samples and the fungal data looks pretty good, with low numbers of unassigned reads. Additionally, this sample is not from soil or ocean (which I know are more likely to have unassigned taxa with this database); it is from a controlled processing experiment.

Does anyone have any insight as to what the issue might be, or if this is actually a typical amount of unassigned reads for 16S?

Thanks

Jose Antonio Navas Molina

unread,

Sep 1, 2016, 10:17:28 AM9/1/16

to Qiime 1 Forum

Hi Rachel,

Your understanding is correct. However, if your samples are from a poorly understood environment this will not be surprising. What is such environment?

Also, I would suggest grabbing some of the representative sequences of those unassigned taxa and blast them, to see what they actually are. This will help to identify if there is no 16S data in the reads.

As another alternative, you can try to use the Silva database as a reference. I'm not expecting it to recover more sequences, but worth the shot. I would recommend running only closed reference, as it will be faster and will allow you to get the grasp if more of those sequences have been recovered.

Thanks,

Rachel C.

unread,

Sep 1, 2016, 11:32:06 AM9/1/16

to Qiime 1 Forum

Thanks Jose,

I ran a blast on a handful of random sequences from my rep_set.fna file and most of them came up with high % identity matches to "Uncultured bacteria clone" and in some cases "uncultured streptomyces/halomonas/etc/ clone". I'm assuming that these sequences would not be in the GG database? I did not have anything so far appear to have no good blast hits or blast to a non-bacterial species, so I do think that the data is mostly all 16S data.

Oddly enough, I ran the OTU picking at the 94% level instead of 97% and the results were worse - over 90% unassigned for several samples.

I will try with a different database to see if it improves it at all.

Thanks,

Jose Antonio Navas Molina

unread,

Sep 2, 2016, 11:44:02 AM9/2/16

to Qiime 1 Forum

Hi Rachel,

It is possible that those sequences are not in Greengenes. You can also try to change the assignment method in Qiime to use blast rather than uclust to see how that changes the results. If that improves the result, that could indicate that there is non-16S data in the reads, given that your blast matches are not 100% matches.

(Thanks Tony for the tips!)

Cheers,

Colin Brislawn

unread,

Sep 2, 2016, 1:31:17 PM9/2/16

to Qiime 1 Forum

Hello Rachel,

I like Jose's suggestion to try different taxonomy assignment programs like blast. If many of your reads are not in the database, you could also consider trying another database, say Silva. https://www.arb-silva.de/download/archive/qiime/

I also wanted to comment on something you mentioned:

"Oddly enough, I ran the OTU picking at the 94% level instead of 97% and the results were worse - over 90% unassigned for several samples."

Keep in mind that OTU picking and taxonomy assignment are different steps. Lowing the similarity threshold for OTU picking will means that each OTU has a larger radius from the centroid, so you will have fewer OTUs with more reads in them. This should not dramatically change taxonomy assignment, because these new OTU will won't be in the database.