fundamental difference in open and closed reference otu picking for taxonomic assignment

115 views

OTUcloseidentityopenreferencepickingreftaxonomic

Skip to first unread message

Rittik Deb

unread,

Jun 6, 2016, 5:08:25 AM6/6/16

to Qiime 1 Forum

Hi all,

First of all I would like to apologize for a naive question, but I have failed to find a satisfactory answer to my problem in the web and hence the query.

So,

When we do an open ref otu picking in qiime, it does a closed ref otu picking step 1st (using reference database). Then in step two, it creates clusters of failed reads and uses their centroid as a reference and uses matching algorithm to pick new otus using the newly found reference (centroids) as reference dataset. And this step goes on. Later it concatenates step 1 and the follow up results in a single file and can do taxonomy assignment.

Now my problem is if the database is not known, and hence its taxonomic id is unknown, how does the new otus generated in step 2 and follow up steps gets assigned to a taxonomy. If the logic is that for step 1 taxonomic assignment has higher threshold of matching and it is reduced for step 2 onwards, then how is it achieved. The other query is doesn't closed ref otu with lower threshold give a similar outcome (theoretically it should atleast do the same to the best of my understanding, but practically not; I have tried few options to lower the threshold). Why I think the results should be same is due to the fact that for closed ref otu picking it might take on a lower threshold but the taxonomic assignment is only done using top 3 assignments (which are always of high threshold untill and unless the read/cluster is of a newer sequence not perfectly matching to database).

So is then these two steps (open and closed ref) done separately to reduce computation time (as their I understand why open ref will be faster) or when I am not interested in taxonomic assignment rather I am interested in otu maps and trees.

Please elucidate and pardon my naivety

Regards

Rittik

Colin Brislawn

unread,

Jun 6, 2016, 4:58:28 PM6/6/16

to Qiime 1 Forum

Great questions Rittik! I'll respond in line.

On Monday, June 6, 2016 at 2:08:25 AM UTC-7, Rittik Deb wrote:

Hi all,

First of all I would like to apologize for a naive question, but I have failed to find a satisfactory answer to my problem in the web and hence the query.

This is actually a very good question, because it highlights the differences between OTU picking and taxonomy assignment.

So,

When we do an open ref otu picking in qiime, it does a closed ref otu picking step 1st (using reference database). Then in step two, it creates clusters of failed reads and uses their centroid as a reference and uses matching algorithm to pick new otus using the newly found reference (centroids) as reference dataset. And this step goes on. Later it concatenates step 1 and the follow up results in a single file and can do taxonomy assignment.

Correct! Very well said.

Now my problem is if the database is not known, and hence its taxonomic id is unknown, how does the new otus generated in step 2 and follow up steps gets assigned to a taxonomy. If the logic is that for step 1 taxonomic assignment has higher threshold of matching and it is reduced for step 2 onwards, then how is it achieved.

You are correct that OTU picking and taxonomy assignment use different minimum thresholds by default; 97% for OTU picking, 90% for taxonomy assignment. Reads that form new clusters in step 2 got to this step because they were not >97% similar to OTUs in step 1, however, these new OTUs are often >90% and so taxonomy assignment will still work.

The other query is doesn't closed ref otu with lower threshold give a similar outcome (theoretically it should atleast do the same to the best of my understanding, but practically not; I have tried few options to lower the threshold). Why I think the results should be same is due to the fact that for closed ref otu picking it might take on a lower threshold but the taxonomic assignment is only done using top 3 assignments (which are always of high threshold untill and unless the read/cluster is of a newer sequence not perfectly matching to database).

While the ending taxonomy percentages of open-ref vs closed-ref at 90% could be similar, in most cases, the composition of OTUs changes for a few different reasons. If you compare the results of closed-ref at 90% to open-ref at 97%, closed-ref at 90% would...

- have fewer total OTUs. (The uclust algorithm would place reads into the first few OTUs it found over 90%, instead of looking for OTUs at higher thresholds).

- have fewer denovo OTUs (because the minimum cutoff of 90% is lower than 97%)

- have unrealistically specific taxonomy assignments (reads matching in step 1 (closed-ref picking) are given the taxonomy of their OTUs in the database, while de novo OTUs are run through the taxonomy assigner. Taxonomy assignment, which takes the levels which match among the three best hits of 90%, is much more realistic. If you run closed-ref at 90%, you could have a read that 92% similar to OTU centroid, and still claim that it's know to the species level. If you ran this through the taxonomy assigner, you would find multiple hits at 92% and 91% to different species, and provide a much more realistic classification to family level, where all these hits could agree.)

So is then these two steps (open and closed ref) done separately to reduce computation time (as their I understand why open ref will be faster)

Step 1 of open-ref picking, can be done in parallel, and the more intensive step 2 can be done on a smaller data set.

or when I am not interested in taxonomic assignment rather I am interested in otu maps and trees.

Please elucidate and pardon my naivety

Regards

Rittik

I hope this helps answer your questions. Let me know if you have any new questions!

Colin Brislawn

Rittik Deb

unread,

Jun 6, 2016, 11:47:13 PM6/6/16

to Qiime 1 Forum

Hi Colin,

Thanks a lot for your detailed reply. This has cleared much of my doubts. I will read a little more to understand this better and will get back to you if I have farther doubts.

With warm regards

Rittik

Reply all

Reply to author

Forward

0 new messages