Reference OTUs from biom table

381 views
Skip to first unread message

Vanessa V.

unread,
Jun 2, 2016, 1:53:02 PM6/2/16
to Qiime 1 Forum
Hi all, 

I have a few questions about OTUs and classification.  I converted (to txt file) and downloaded the biom otu table from OTU open reference picking step and wanted to analyze the data separate from the summarize taxa step.  

1) I see that before the OTU number, some of the OTUs say New.referenceOTU and others New.CleanUp.ReferenceOTU. What is the difference?

2) Also, one of the prevalent taxa in my data set is Lactobacillus reuteri (which make sense based on our experiment), however there are over 1500 OTUs classified to L. reuteri with 12 abundant OTUs with the majority of the sequences.  That to me seems unusual - especially to have 1500 different OTUs to L. reuteri.  Does anyone have any thoughts on this?  Is there something I'm missing here?

3) Is there a way to retrieve the reference sequences from the OTUs so I could do a manual analysis?


I used green genes as the default database for the otu picking and classifying.  Below is the script I used.  I did check for and filter out chimeras prior to running the pick open reference script.

pick_open_reference_otus.py -i  /gpfs0/home/resgoodman/vav002/qiime_NEC/seqs_chimeras_filtered_R1_q19.fna -o /gpfs0/home/resgoodman/vav002/qiime_NEC/pick_OTU_R1_q19 -f -a -O 40

abir...@gmail.com

unread,
Jun 2, 2016, 3:32:21 PM6/2/16
to Qiime 1 Forum
Hi, Vanessa,
A couple of answers for you:
1) Please take a look at the "Discussion of subsampled open-reference OTU picking in QIIME" page at http://qiime.org/tutorials/open_reference_illumina_processing.html , which explains new reference OTUs and clean-up reference OTUs.  Basically, "new reference OTUs" are the output of de novo clustering of subsampled failures, and "clean-up OTUs" are de novo OTUs on all reads that failed to hit the reference collection after it was expanded to include the new reference OTUs.  
3) In your OTU-picking output folder, there should be a file called "new_refseqs.fna", which contains the OTU ids and their reference sequences for all OTUs observed in this dataset (including the new ones that were added).

I'll let someone more knowledgeable about the taxa assignments weigh in on #2 :)
Thanks,
Amanda

Vanessa V.

unread,
Jun 2, 2016, 3:54:42 PM6/2/16
to Qiime 1 Forum
Thank for your answers to 1) and 3).  Yes, if anyone has any thoughts with my question for 2) that would be great as I'm quite confused by the amount of OTUs for one species.

Two more questions to follow up with you:  

1a) If the read does not initially hit to the reference data base and so is clustered de novo, how is the de novo OTU clustered able to be assigned to via the reference db in the end (taxonomically)?  

1b) Given the explanation on new reference and clean up OTUs, I'm surprised with my OTU table.  Every OTU is either new reference or clean up.  None are just reference OTUs.  So does that mean all of my reads failed to align and all OTUs had to be clustered de novo?  Is that typical?

Thanks, 
Vanessa

Colin Brislawn

unread,
Jun 2, 2016, 7:41:04 PM6/2/16
to Qiime 1 Forum
Hello Vanessa,

I'll take a shot at 1a)
1a) If the read does not initially hit to the reference data base and so is clustered de novo, how is the de novo OTU clustered able to be assigned to via the reference db in the end (taxonomically)?  

During clustering, a read must be above 97% to a single item in the database. However, taxonomic assignment is different; a reasonable taxonomic assignment is inferred from the the top three hits which are 90% or up. Because the cutoffs are different, a read may not be able to find a hit during clustering, but could find several hits for taxonomic assignment. 

I hope that helps!
(I'll let a qiime dev comment on the steps in the open-ref picking workflow.)

Colin 

Vanessa V.

unread,
Jun 3, 2016, 7:39:06 AM6/3/16
to Qiime 1 Forum
Hi Colin, 

Thanks.  However, I'm still unclear as to why I have New.Reference and New.CleanUp.Refernece OTUs being assigned down to a genus and species level (e.g. Lactobacillus reuteri)? Because the reads in these New and CleanUp OTUs presumably did not match anything in the reference db at 97% in the first round of reference OTU picking and had to be clustered de novo, so then how were these OTUs then taxonomically assigned to a species at 97% sequence identity?  Wouldn't these OTUs only be able to be assigned to a genus (or family or higher) level at best?  

Sorry if I'm missing something here; just really confused and want to make sure I'm understanding this accurately!

Thanks, 
Vanessa

Vanessa V.

unread,
Jun 3, 2016, 8:33:57 AM6/3/16
to Qiime 1 Forum
Hi all, 

Sorry for the confusion but I realized part of my issue is due to the fact I used the wrong database for the reference step in the open reference otu picking step.  When I originally ran the script I had the -r as the SILVA db (and it was in the wrong QIIME incompatible format too) thinking I was picking OTUs and assigning taxonomy to SILVA.  However, I think what happened was that it was unable to cluster any of the reads at 97% to the ref db because the db was wrong and incompatible.  So it had all of the otus as new or clean up and then it assigned all of those via greengenes.  If that makes sense!

I re-ran it correctly with using greengenes as the default reference (I didn't specify a -r) and assigning taxonomy and the OTU table is more reasonable in terms of the OTUs.  Some have OTU numbers (e.g. 433676) and a portion are New or Clean up.  

However, I still see the same large number of OTUs (thousands) to Lactobacillus reuteri.  The majority of the sequences fall into 12 OTUs that do have OTU numbers (they are not New or Clean up).  In terms of this, I'm still not sure I understand how I can get 12 different OTUs assigning to the same species.  Is there something else I have wrong here?

Thanks your your help in advance, 
Vanessa

Vanessa V.

unread,
Jun 3, 2016, 12:35:35 PM6/3/16
to Qiime 1 Forum
Would it make sense to see this effect if my sequences are different lengths?  I checked two of the L. reuteri OTU reference sequences and each are identical in sequence except one is 262 bp and the other is 223 bp.  

Daniel McDonald

unread,
Jun 3, 2016, 5:08:39 PM6/3/16
to Qiime 1 Forum
Hi Vanessa,

It isn't unusual for a species to map to multiple OTUs. For instance, L. reuteri maps to 19 OTUs at the 97% level in Greengenes 13_8:

$ grep -c "g__Lactobacillus; s__reuteri" gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
19

There are a few reasons for this. First, it is important to remember that 97% is not equivalent to any specific taxonomic rank. Second, taxonomy is hazy particularly with older lineages (as in when they were first noted by humans) as a reasonable portion of the annotations were originally done based off of morphological or biochemical characteristics as opposed to molecular sequence. Third, 16S reads to do not reliably differentiate at increased levels of resolution, so you want to be careful particularly below genus. 

Best,
Daniel

Vanessa V.

unread,
Jun 3, 2016, 7:11:23 PM6/3/16
to Qiime 1 Forum
Hi Dan, 

Thanks for your explanation.  However, does it make sense that I would have 1500 OTUs all for Lactobacillus reuteri?  I have ~12 major OTUs (~98% of the L. reuteri sequences), 15 smaller OTUs (1% or so), and then the rest are thousands with just a few sequences each (none are singletons). This seems bizarre to me?!  Based on our experimental design, we should only have 1 strain of Lactobacillus present.  I can see there might be a few others at very low abundance, but looking at my OTU table, it still doesn't make sense to me.  I went back and rechecked my chimera filtering (I tested both union and intersection methods) and I get the same answer (there is less, but still over 1000).  

One thing to note is that I analyzed two of the OTU reference sequences (from rep_set.fna) and they were the identical sequence, just with different lengths (both hit to L. reuteri in NCBI).  What I'm concerned with is whether or not all of these OTUs are real?  If they're are an artifact, could these be artifactually increasing or influencing my diversity analyses (either alpha or beta)?  Do I need to make sure all of the my sequences are the same length - would that correct the problem?  I am only analyzing my reads from one end (R1) of the paired end sequencing (the coverage was not good enough to join them; I'm still trying to work on that).  

Thanks for your help!!
Vanessa

Daniel McDonald

unread,
Jun 6, 2016, 3:48:57 PM6/6/16
to Qiime 1 Forum
Hi Vanessa,

Please keep in mind that an OTU isn't real. Some of the expansion of OTUs that you're observing may fall down to sequencing error. Others are going to be attributed to sequence lengths as you note. The impact on alpha and beta diversity will (likely) be small for phylogenetic measures (e.g., PD and UniFrac). As you note, you might be able to reduce the number of OTUs you have by filtering for sequence length prior to OTU picking, however, it may very well be the case that the conclusions you draw from alpha or beta diversity remain unchanged. 

Two things to note with joining reads, first is that for V4, R1 is generally more informative. And Second, R2 tends to be lower quality, so while you can increase your sequence length you may be introducing additional error. 

Best,
Daniel
Message has been deleted

Vanessa V.

unread,
Jun 7, 2016, 3:06:30 PM6/7/16
to Qiime 1 Forum
Hi Daniel (and Colin), 

Thanks. I appreciate it.  I'm using the V1-V3 region for my project.  So with using the R1 reads, I'm mostly looking at the V1-2.  I'm still trying to wrap my head around how so many OTUs are being classified to L . reuteri.  So sequence length can affect whether two identical sequences (except for length) falls into the same or different OTU - is that just how uclust works, is there a way to check distribution of sequence length within OTUs?  Can that affect taxonomy assignment as well?  Is there a way/method to filter sequence length prior to OTU picking (in QIIME or other program)?  

Also, I have a question about how taxonomy is assigned.  Previous in this post, Colin commented on the taxonomy stating: 
"During clustering, a read must be above 97% to a single item in the database. However, taxonomic assignment is different; a reasonable taxonomic assignment is inferred from the the top three hits which are 90% or up. Because the cutoffs are different, a read may not be able to find a hit during clustering, but could find several hits for taxonomic assignment."  I'm not sure that I understand that.  What does "the top three hits which are 90% are up" mean?  Sorry I'm not understanding this.  

Thanks very much for your help!!!

Vanessa

Colin Brislawn

unread,
Jun 7, 2016, 5:35:08 PM6/7/16
to qiime...@googlegroups.com
Hello Vanessa,

So sequence length can affect whether two identical sequences (except for length) falls into the same or different OTU - is that just how uclust works,
Yes. This boils down to a difference between local alignment and global alignment. Imagined you aligned this read to a referance:
read:          AAGTCGCTGATAGCTAGTCG
ref:  TAGCCCTAGAAGTCGCTGATAGCTAGTCGCGTATTAGCTAGACTA 
A global alignment will see the gaps on both sides of the read and say that this alignment sucks, while a local alignment will see the 100% match in the area of overlap and say that alignment is perfect! Depending on which alignment definition you use (local or global) you will score your alignment differently and choose the best OTU differently. 

 is there a way to check distribution of sequence length within OTUs?
Not within qiime. You could theoretically parse this out from the uclust / usearch / vsearch alignments if you wanted to. 
Another standard option is to trim your reads to a fixed length so that you avoid this problem entirely; when your reads are all the same lengths local alignment == global alignment. 



What does "the top three hits which are 90% are up" mean?  Sorry I'm not understanding this.  
My bad. I should probably give a little more background on how the uclust taxonomy assignment algorithm works.
During taxonomy assignment, an OTU centroid is searched against a database and the top three hits (over 90%) are recorded. Each OTU is given a taxonomy based on the multiple, good OTUs it matches to. The process of assigning a tax label is called LCA for lowest common ancestor. More info on LCA here:

Another qiime user gives a great example of how this works in qiime: https://groups.google.com/d/msg/qiime-forum/Kq_v9XOXEYk/kk9U2lygCwAJ

Does that help answer your question Vanessa?
Colin 

Vanessa V.

unread,
Jun 8, 2016, 8:55:25 AM6/8/16
to Qiime 1 Forum
Hi Colin, 

Thanks!  That makes sense regarding the length and OTUs.  Can you recommend a program or script for trimming reads to the same lengths?

Let me see if I understand this.  Basically, reads are first clustered into a 97% OTU if they match (at 97% sequence similarity) to the reference in the reference db and are 97% similar to the centroid of that OTU and the same length as the centroid in that OTU. Then the representative sequence (centroid) of each OTU is then assigned taxonomy based on searching the reference database for the top 3 hits (>90% sequence similarity).  The OTU is then assigned taxonomy based on these 3 top hits which match the centroid.  Based on the LCA, if these form a line in the taxonomy tree then it can assign it to a specific level (for eg. Lactobacillus reuteri) but if not, then just the most specific level it can (genus, family, order, etc.).  

So, if I manually analyze the centroid (i.e. the representative sequence from rep_set.fna) and see that it is 100% identical to Lactobacillus reuteri and not others, can I confidently say that the OTU is Lactobacillus reuteri.  Or least the centroid is very representative of all the reads in that OTU?

Thanks!  I really appreciate your help with this.

Vanessa

Colin Brislawn

unread,
Jun 8, 2016, 1:48:16 PM6/8/16
to Qiime 1 Forum
Hello Vanessa,


On Wednesday, June 8, 2016 at 5:55:25 AM UTC-7, Vanessa V. wrote:
Hi Colin, 

Thanks!  That makes sense regarding the length and OTUs.  Can you recommend a program or script for trimming reads to the same lengths?
Both vsearch and usearch offer read trimming commands. I've had a good experience using vsearch, which is fast and open source: https://github.com/torognes/vsearch  

Let me see if I understand this.  Basically, reads are first clustered into a 97% OTU if they match (at 97% sequence similarity) to the reference in the reference db and are 97% similar to the centroid of that OTU and the same length as the centroid in that OTU. Then the representative sequence (centroid) of each OTU is then assigned taxonomy based on searching the reference database for the top 3 hits (>90% sequence similarity).  The OTU is then assigned taxonomy based on these 3 top hits which match the centroid.  Based on the LCA, if these form a line in the taxonomy tree then it can assign it to a specific level (for eg. Lactobacillus reuteri) but if not, then just the most specific level it can (genus, family, order, etc.).
Well said!  

So, if I manually analyze the centroid (i.e. the representative sequence from rep_set.fna) and see that it is 100% identical to Lactobacillus reuteri and not others, can I confidently say that the OTU is Lactobacillus reuteri.  Or least the centroid is very representative of all the reads in that OTU?
Correct!
Once you do this, you may find that certain OTUs are a 100% match to different things in the database, which is why LCA uses the top three by default. This is a limitation related to the total length of your read; if you had full length 16S, you might be able to resolve subspecies, but V1-2 could be identical between similar organisms. I would encourage you to try this search for yourself and see what you get! 

Vanessa V.

unread,
Jun 8, 2016, 6:24:57 PM6/8/16
to Qiime 1 Forum
OK, thanks!!! This was incredibly helpful for me.

One last thing, I looked at the Vsearch site and I didn't see anything about trimming commands?  Can you give me more detail on that?  Or direct me to where I could find how to do that?  

Thanks a ton!
Vanessa

Colin Brislawn

unread,
Jun 8, 2016, 7:22:14 PM6/8/16
to Qiime 1 Forum
Hello Vanessa,

Oh yeah, the vsearch documentation is in the built PDF file and not on the website. 

You may already know this, but vsearch is a clean-room implementation of usearch. This means that some of the usearch commands will also work with vsearch, and you can refer to the usearch documentation. 

The command may be something like:
vsearch --fastq_filter reads.fastq --fastq_trunclen 250 --fastqout truncated_reads.fastq
In this example, you could replace 250 with a good minimum length for your reads. I'm not sure how long the V1-2 region is. 

Great work Vanessa. We have made some awesome progress in this thread. 
Colin 

Vanessa V.

unread,
Jun 9, 2016, 5:17:09 PM6/9/16
to Qiime 1 Forum
Awesome, thanks again!  I'm learning a lot!!!

Vanessa V.

unread,
Jun 10, 2016, 10:13:05 AM6/10/16
to Qiime 1 Forum
One more thing that I realized I should have asked - so perhaps I'm still not quite there on understanding this fully.  

How is it that I have "new clean up" OTUs being assigned to Genus and Species (L. reuteri) level?  The reads in these OTUs presumably did not hit anything in the reference database at 97% and so had to be clustered de novo.  So then how could they have been assigned to a species level once clustered during the assign taxonomy step?  I could see them being assigned to the Genus, Family or higher but I'm not sure I understand how they could not hit L. reuteri in the database during OTU picking (so had to be de novo) but then they could be assigned to L. reuteri during taxonomy assign?

Colin Brislawn

unread,
Jun 10, 2016, 10:55:14 AM6/10/16
to Qiime 1 Forum
Good morning Vanessa,

So then how could they have been assigned to a species level once clustered during the assign taxonomy step?
Great question! This will serve as a good case study highlighting the difference between OTU picking and taxonomy assignment. 

Just like you described, all the new clean up OTUs were picked de novo because they are >3% different from anything in greengenes. When taxonomy assignment happens, the top three hits over 90% are recorded. So maybe you get three hits like this:
hit   93.1%    k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__reuteri
hit   90.9%    k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__reuteri
hit   91.9%    k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__reuteri


So none of these are over 97%, but they all happen to agree down to the species level. Looks like some pretty different OTUs are all in the s__reuteri, so the LCA algorithm concludes that your equally different OTU could be part of this species too!
(I did a quick 'grep' search on the greengenes taxonomy, and found 19(!) OTUs with the 'reuteri' as the species. The example I gave is very possible with 19 members in the reference.)

3) Is there a way to retrieve the reference sequences from the OTUs so I could do a manual analysis?
You can extract this OTU centroid from your rep_set.fna file, and try manually blasting in on NCBI and see what you get. 


This might be a good time for me to mention that I have a pretty dim view of taxonomy assignment. I view microbe names like the points on "whose line is it anyway" and caution my traditional microbiologists to view amplicon taxonomy as a reasonable guess, not ground truth. Robert Edgar is also pretty critical of taxonomy, and has been working on the UTAX algorythm to improve things: http://www.drive5.com/usearch/manual/tax_bench.html 

Keep in touch! 
Colin

Vanessa V.

unread,
Jun 10, 2016, 8:37:47 PM6/10/16
to Qiime 1 Forum
OK, this is really good to know about the taxonomy assignments.  So if the top 3 hits were L. reuteri, and two other Lactobacillus species, (instead of all 3 being to L. reuteri) the OTU would be classified as just "Lactobacillus" to the genus level?  

That's so interesting there are 19 OTUs of L. reuteri in greengenes.  I know there are a lot of strains of L. reuteri but I would think the 16S sequences would be >97% to have put them together into 1 OTU.  Especially since the reference db is full length 16S?

Thanks again!!  I will be sure to be cautious and manually verify the reference/representative sequences in my OTUs.  

Vanessa

Colin Brislawn

unread,
Jun 13, 2016, 1:12:35 PM6/13/16
to Qiime 1 Forum
Hello Vanessa,

 
OK, this is really good to know about the taxonomy assignments.  So if the top 3 hits were L. reuteri, and two other Lactobacillus species, (instead of all 3 being to L. reuteri) the OTU would be classified as just "Lactobacillus" to the genus level?  
Yep!  
 
That's so interesting there are 19 OTUs of L. reuteri in greengenes.  I know there are a lot of strains of L. reuteri but I would think the 16S sequences would be >97% to have put them together into 1 OTU.  Especially since the reference db is full length 16S?
Because the referance does include the full length sequence, it's possible that the full sequence is 97% different, but 100% the same in the V1 region, for example. You can look in your log file to see how similar the top three hits were. Sometimes I see that an OTU is an 100% match to three different things, which tells me I could have separated this species given their full 16S gene, but can't with my region.

Thanks again!!  I will be sure to be cautious and manually verify the reference/representative sequences in my OTUs.  

Vanessa

Keep in touch!
Colin  
Reply all
Reply to author
Forward
0 new messages