Question to experts related to non-singletons

shriram patel

unread,

Oct 16, 2016, 12:19:16 AM10/16/16

to Qiime 1 Forum

Hello Experts,

I want to ask a question related to 16S amplicon analysis.

We obtained Illumni Miseq amplicon sequences from V3-V4 hypervariable region. The total number of sequence were around 6 billion. During closed reference OTU picking we could only get 65% of sequence assignment. So we moved towards denovo approach (using vsearch).

But in that during dereplication step, when we set parameter to remove singletons (sequences with < 2 exact match), almost 90% of sequences gets discarded. How can we justify that? What could be the possible explanation.

Thanking You in advance,

Best

Colin Brislawn

unread,

Oct 17, 2016, 1:15:19 PM10/17/16

to Qiime 1 Forum

Hello Shriram,

Thanks for getting in touch with us. I also use de novo clustering with vsearch, so hopefully I can answer your question.

almost 90% of sequences gets discarded

Help me understand of that 90% refers to. If 90% of your totals reads from your input (seqs.fna) are discarded, that's a problem! If 90% of your dereplicated reads are singletons, that makes a lot more sense; most reads will end appear many times and end up in a derep read of high abundance, while the remaining will end up as singletons. 90% singletons may only account for 5% of total reads in seqs.fna.

Depending on your version of vsearch, the output will tell you if 90% refers to outputs reads or input reads.

Also, one time I have had problems with singletons when I accidentally left my barcodes in my clustered sequences. These unique barcodes prevented reads from different samples from combining, and I got lots of singletons too.

I hope some of that helps. Let me know what you find,

Colin

shriram patel

unread,

Oct 18, 2016, 6:25:59 AM10/18/16

to Qiime 1 Forum

Hello Colin,

Thank You very much for the quick reply.

You are right, I checked derep and my original file and found that 90% of my dereplicated reads are singleton.

And my sequences are already demultiplexed.

Moreover, I am planning to keep minotu_size=2 during clustering step to apply addition filtering of low abundant otu. Is it right or is not required?

Best,

Shriram

Colin Brislawn

unread,

Oct 18, 2016, 11:01:18 AM10/18/16

to Qiime 1 Forum

Hello Shriram,

Glad you got this working. You can keep minotu_size; I don't think it's needed, but it won't hurt.

When making de novo OTUs with vsearch, the way to make a .biom file is to remap your input seqs.fna reads to your clustered, chimera checked OTU centroids, say using vsearch --usearch_global. This step will tell you how many of your seqs.fna reads were able to map to your centroids. This is a good sanity check, to make sure that your OTU centroids do a good job describing your overall seq.fna file. I usually get 85% -95% of my reads to map to my de novo OTU centroids. If only 10% can map, there could be a problem!

Keep in touch,

Colin

shriram patel

unread,

Nov 13, 2016, 2:09:37 AM11/13/16

to Qiime 1 Forum

Hello Colin,

Adding to my question, as I have said we obtained data sequenced through 250*2 chemistry on Illumina Miseq. After that 10 bases from the end of both forward and reverse read were trimmed (to get rid of low quality on end on sequences) and merged them using Pandaseq (You can suggest other better alternatives).

What I am planning before denovo OTU picking using vsearch is to also trim starting 10 bases from both forward and reverse read (after that all the sequences will have average size of around 400-410 bases). You think this additional trimming is required and can reduce number of singletons generated during derep and OTU picking?

Best,

Shriram

Colin Brislawn

unread,

Nov 13, 2016, 4:35:26 PM11/13/16

to Qiime 1 Forum

Hello Shriram,

Merging is important because it corrects errors and dramatically improves read quality. I like vsearch --fastq_mergepairs, but most methods work well.

Trimming ends could both improve pairing and also dereplication. (We can prove that it must improve duplication; by reducing the number of base pairs, we are reducing the number of places that could be different, thus we are guaranteed to have fewer unique reads in that smaller region.)

I'm not sure that it will improve clustering that much. Keep in mind, while I only use reads that appear >= 2 times for clustering, I included all reads (including singletons!) when I remap reads with --usearch_global. You can see how many of your reads were remapped to an OTU during this step. If you throw away lots of singletons, but 95% of your reads remap with --usearch_global, that means that these singletons still managed to find their ways into your OTUs and omitting them from clustering did not remove anything unique.

Colin

shriram patel

unread,

Nov 14, 2016, 12:14:42 PM11/14/16

to Qiime 1 Forum

Hello Colin,

I will certainly try vsearch -fastq_mergepair function to merger sequences.

And Yes, It is obvious that reducing the sequences length will decreases number of unique reads.