Picking OTUs and open reference OTUs returns huge number of similar OTUs

Alex Crits-Christoph

unread,

Apr 20, 2016, 7:34:46 PM4/20/16

to Qiime 1 Forum

Hello all,

I recently ran the command below on a quality filtered FASTA file of 808,184 Illumina reads from ~10 low diversity environmental samples.

pick_otus.py -i ../seqs.fasta -r ./97_otus.fasta -s 0.97 -m usearch61 --threads 8 -o otu_output_pick_manual_usearch

The command completes successfully, and the log file shows:

Usearch610DeNovoOtuPicker parameters:

Application:usearch61

minlen:64

output_dir:otu_output_pick_manual_usearch

percent_id:0.97

remove_usearch_logs:False

rev:False

save_intermediate_files:True

sizeorder:False

threads:8

usearch61_maxaccepts:1

usearch61_maxrejects:32

usearch61_sort_method:abundance

usearch_fast_cluster:False

verbose:False

wordlength:8

Num OTUs:98403

98,000 OTUs? How is that possible? Upon building the reference set, I then aligned a portion of them against each other using MUSCLE and found that many were more than 97% (many 98 or 99.xx% identical representative sequences) similar to other OTU references - what is going on?

I encounter a similar issue when running pick_open_reference_otus.py:

pick_open_reference_otus.py -i ./seqs.fasta -r ./97_otus.fasta -o ./otu_output_pipeline/ -m usearch61 -s 0.1

In this case, we get the following OTU numbers per step:

Step1: 280

Step2: 104

Step3: 104

Step4: 88364

Anyone understand why I'd be getting so many OTUs at the denovo level, or why the OTUs created are not distinct at the 97% threshold? I've been able to replicate this on a couple of different datasets. Using UCLUST also gives an OTU figure far too high (~2,000).

Here is my QIIME config:

http://pastie.org/10806141#9

Thanks!

Alex

TonyWalters

unread,

Apr 20, 2016, 9:49:21 PM4/20/16

to Qiime 1 Forum

Hello,

The number of reads that are clustering against the reference database (Greengenes?) are low. Can reverse complementing the reads (adjust_seqs_orientation.py) or enabling the pick_otus:enable_rev_strand_match in a QIIME parameters (http://qiime.org/documentation/qiime_parameters_files.html) file to see if more of the reads will cluster against the reference database?

Are these reads covering the same region of the 16S gene? What primer set were they generated with? Most of the sequences being clustered in the last step of open-reference OTU picking, which indicates something odd about the sequences that is making the usearch algorithm (written by the same person as muscle btw) fail to cluster them together, even as it subsamples from the fasta file to create new reference reads in the intermediate steps. Did uclust give a total of 2000 clusters versus ~88000 for usearch61?

Second, are there any non-16S sequence data left in the reads, such as adapters/barcodes/pads? If you blast some of the reads on NCBI, and you see overhangs at the end(s) of the reads that aren't hitting 16S genes, it could indicate this issue. This could also alter clustering. Also, are there gaps in the reads when you cluster them, or are the continuously hitting reference genes?

For both uclust and usearch61 you can increase the max_accepts and max_rejects (e.g. 20 and 500) for pick_otus, in the qiime parameters file, which could decrease the number of OTUs, although I don't think it will be that huge of an effect.

Oftentimes, low abundance OTUs, classically singletons, but can be higher for Illlumina datasets, are filtered out for being suspected noise or chimeras. For open-reference OTU picking, this will happen later, so you'll need to see the OTU count from the OTU table itself. biom summarize-table for the details.

What do the taxa look like? Are there known taxa that should be present showing up? What about Pseudomonas or other taxa that can be common contaminants?

Alex Crits-Christoph

unread,

Apr 20, 2016, 11:19:17 PM4/20/16

to Qiime 1 Forum

Hey Tony,

Thanks for your response.

The final number of OTUs that end up in the OTU table with pick_open_reference_otus with usearch61 is 6,713 and 2,823 for uclust, so the overwhelming majority of these step4 OTUs are being filtered out. Both of these numbers still seem way higher than they should be. It's worth noting that both uclust and usearch give similarly high numbers of OTUs in step4 - uclust gives 90,000 there (compared to 88,000), the difference seems to be in filtering for the OTU table. I re-ran the pipeline with usearch61 and included a params file with enable_rev_strand_match, and the number of OTUs is now down to 5,607 - which still seems far too high. Additionally, an all vs all alignment of some of these representative sequences shows that many are 98%+ or 99%+ identical, which as far as I know, really shouldn't be happening with a usearch algorithm.

The taxa are exactly as expected and seem very accurate, except there is a large number of very low abundance (near singletons) OTUs assigned to taxa also found in high abundance in the samples. No signs of contaminants. The reads fully map to the 16S reference genes and there's no sign of any barcodes or adapters in them, which were removed when the FASTQ's were converted to FASTAs.

Running the pick_open_reference_otus pipeline with usearch61 and suppressing step4 gives a total of 289 OTUs in the OTU table- that's exactly about the number I'd expect from these low diversity samples.

Are there any other qiime parameters I should include in a params file? Before you suggested pick_otus:enable_rev_strand_match, I wasn't using one at all and was just going with defaults. I could also share some of the FASTA file.

Thanks for your help,

Alex

TonyWalters

unread,

Apr 20, 2016, 11:29:11 PM4/20/16

to Qiime 1 Forum

Hello again,

I would add these as well and see how much it limits the low abundance OTU creation:

pick_otus:max_accepts 20

pick_tous:max_rejects 500

Note that with these and the reverse strand matching, it will use more memory and take longer to run.

You might also try running vsearch (https://github.com/torognes/vsearch), which you can plug in as a usearch61 replacement (rename the current usearch61 executable, and name the vsearch executable usearch61 and put it in the $PATH). It won't run as fast as usearch61, and you probably don't need to use the qiime_parameters file with it.

Alex Crits-Christoph

unread,

Apr 21, 2016, 5:25:17 PM4/21/16

to Qiime 1 Forum

Hi Tony,

Thanks, this helps reduce the OTU count.

I noticed that vsearch only identifies closed reference OTUs when running this pipeline as a drop-in usearch61 replacement.

I also noticed that the moving pictures Illumina tutorial has a default output of a few thousand OTUs with the open reference picking pipeline - is 2,000-4,000 OTUs just a standard number of OTUs for deeply sequenced data with the QIIME open reference picking pipeline?

Thanks!

Alex

TonyWalters

unread,

Apr 21, 2016, 5:39:02 PM4/21/16

to Qiime 1 Forum

Hello Alex,

There is arguably OTU inflation in most data sets, with some various approaches (filtering OTUs of certain size, chimera-checking, and/or discarding reads that fail to match a known reference set, and some other efforts and implementing a form of denoising (deblurring) for Illumina data) used to minimize this. For human gut samples, there does tend to be thousands of OTUs.

Just to clarify, did the vsearch clustering lead to everything clustering against the reference data, or did it just not cluster the reads that failed to match?

Alex Crits-Christoph

unread,

Apr 21, 2016, 5:42:39 PM4/21/16

to Qiime 1 Forum

Hi Tony,

What I meant is that vsearch only ran step1 of the open reference OTU picking pipeline when I used it. That could also have been my error, but I think I ran it similarly to the command above with usearch61 replaced by vsearch in my path.

Thanks!

Alex

Colin Brislawn

unread,

Apr 21, 2016, 6:07:17 PM4/21/16

to Qiime 1 Forum

Hello Alex,

While you can rename vseach to usearch61 so that qiime ends up using vsearch for some commands, vsearch is not technically through qiime yet. It's possible that some qiime scripts work great with vsearch, but trip up. This could be happening here.

Here is the github issue tracking vsearch integration into qiime, if you want to subscribe for updates or contribute code.

https://github.com/biocore/qiime/issues/1962

Keep in touch!

Colin

Reply all

Reply to author

Forward