Advice needed on pick de novo parameters and sequence quality threshold

49 views
Skip to first unread message

sha...@vt.edu

unread,
Nov 10, 2016, 4:19:52 PM11/10/16
to Qiime 1 Forum

Hello All,

I have dataset that I am classifying based on a non-marker functional gene, so I am using the pick_de_novo workflow here. UCLUST is run at 97% similarity to pick otus (default parameters). When I am using the data at a phred quality of 30, I have ~400,000 reads total across 50 samples, and the number of observations in the summary biom file is approximately ~12,000. When I checked the alpha_rarefaction curves, it seems that the depth is not enough.


I decided to reanalyse the dataset with a phred quality threshold of 25 to increase number of available reads. After tha, I ran the same pick_de_novo workflow with same parameters as before. The new biom summary shows nearly 2 million reads, but with 335,000 observations (of which 267,000 are singletons!!). I went on with the core_diversity analysis. The taxonomy assignment have a similar profile even at the species level (few additional taxonomies in the summary plots in the run with more reads at phred quality 25).


The number of uclust clusters increased by nearly 28 times whereas the number of reads increased by just 5 times when I decreased the phred quality threshold from 30 to 25 for the initial filtering of the reads. A similar behaviour is obtained when clustering with cd-hit-est as well. What would you suggest here?


I am using QIIME 1.8 with all associated executables properly configured.


Thanks,
Hazem


Colin Brislawn

unread,
Nov 10, 2016, 5:15:33 PM11/10/16
to Qiime 1 Forum
Hello Hazem,

Thanks for getting in touch with us! You mentioned:
I decided to reanalyse the dataset with a phred quality threshold of 25 to increase number of available reads.
Your plan makes sense: including more reads in analysis will let you detect more diversity. But there is a catch; lowing that q score threshold also gives you more low quality reads and could form spurious OTUs. 

The number of uclust clusters increased by nearly 28 times whereas the number of reads increased by just 5 times when I decreased the phred quality threshold from 30 to 25 for the initial filtering of the reads. A similar behaviour is obtained when clustering with cd-hit-est as well.
Wow. Having 28x more OTUs shows you how important that quality filtering threshold was.  (Having 267,000 are singletons sounds pretty high to me too!)


I would recommend trying to increase the quality, not just quantity, of your data set. Is this paired-end Illumina data? If so, you could try joining the ends together, which would allow the reads to correct each other in the area of overlap and increase overall quality.

Are you performing chimera checking? When using a fully de novo approach, I find that many of my less common OTUs are chimeras formed during PCR. Removing them gives me a more reasonable number of OTUs. Your alpha rarefaction curves could level off, once you have a smaller number of higher quality OTUs.


You have a lot of good options, Hazem. Let me know what you try next!
Colin

Hazem

unread,
Nov 11, 2016, 9:33:23 AM11/11/16
to Qiime 1 Forum
Good Morning Colin,

Thanks for your response.

Well, I already join the reads after quality filtering, adapter and primer removal. However I did not do a chimera checking. I will certainly do that. I already did that on the reference database that I am using for taxonomy assignment through identify_chimeric_seqs and usearch61. I am thinking of using the vsearch for chimera removal this time (following this thread). Actually I need one more advise here, should I remove the singletons before running the pick_de_novo workflow as well?


Thanks!
Hazem

Colin Brislawn

unread,
Nov 11, 2016, 1:14:21 PM11/11/16
to Qiime 1 Forum
Hello Hazem,

That's another great question. Removing chimeras before or after OTU picking has been debated by people and both methods are possible. Check out this discussion: 

You will have to make sure that your OTU picking step and chimera step 'match up' with each other; some combinations are harder to perform. In that thread, I outline the pipeline I use in this post:
So in this example, I choose to check for Chimera afterwards.

Singleton removal is common in OTU picking. I think this is a safe choice. You can easily do this during the dereplication step. 

Keep in touch,
Colin

Hazem

unread,
Nov 11, 2016, 4:41:55 PM11/11/16
to Qiime 1 Forum
Hello Colin,

Thanks again!

I will proceed with both approaches. Apparently, the removing the chimeras before OTU picking, on the whole set of reads, will keep on running through the weekend. For the second method, it seems more feasible to perform that with VSEARCH and then do the diversity analysis in QIIME.

I just have one last question regarding the second approach, accroding to the workflow you mentioned in the link in the discussion with Sanjeev, the chimera cleaned file is similar to the rep_set.fna, right? Also I will be using the same dataset, the one with the merged paired files, right?

Have a nice weekend!

--
Cheers,
Hazem

Colin Brislawn

unread,
Nov 11, 2016, 7:59:22 PM11/11/16
to Qiime 1 Forum
Hello Hazem,

Like you said, I use the merged pairs as the input to this data set. 

During OTU picking in Qiime, the final list of OTU centroids is called rep_set.fna. What file matches up to that exactly depends on your pipeline.

In my pipeline, my naming convention looks something like this:
seqs.fna
seqs.derep.fna
seqs.derep.min2.fna
otus.fna
otus_no_chimeras.fna
seqs_mapped_to_otus_no_chimera.uc

I hope this is helping... There are many reasonable ways to do this process, and I worry that I'm making more questions then answers. 

Colin

Hazem

unread,
Nov 14, 2016, 4:09:30 PM11/14/16
to Qiime 1 Forum
Hello Colin,

This is great help. It worked very smoothly.

Actually the denovo chimera detection on the raw reads file did not work. It took about 36 hours and did not output any chimera reads from the 2 million reads dataset. However, when I did it on the clustered OTUs set, it worked perfectly and removed about 7% of clusters. So chimera removal is better done after picking OTUs.

I am getting used to VSearch. I know I should ask this question on their forum, but I am a bit worried the program runs really fast. The dereplication step took about 10 seconds on the 2 million reads dataset (mean length of about 300bp). Is this normal?

Thanks again for your help.

Colin Brislawn

unread,
Nov 15, 2016, 5:26:48 PM11/15/16
to Qiime 1 Forum
Hello Hazem,

I'm glad you are taking the leap to vsearch, and some more 'manual' processing of your data. Let's see if I can help answer these questions.

Yes, vsearch is super fast! Dereplication, especially is super quick. Search and clustering is somewhat slower than usearch, but more exact (high quality alignments).

Glad you got reference based chimera detection up and running. That step is often faster because you can pass --threads 10 to paralyze it 10x times. Just like you discovered, denovo chimera does NOT work on raw data. --uchime_denovo requires your reads to be dereplicated with size annotations. So you might run
vsearch --derep_fulllength seqs.fna --output seqs.derep.fna --sizeout
vsearch --uchime_denovo seqs.derep.fna --sizein --xsize --nonchimeras seqs.derep.checked.fna

Notice how --derep_fulllength adds on the size annotations that --uchime_denovo needs. Bonus: because your data set is dereplicated, it will run in much less than 36 hours. 

I've really appreciated your detailed questions, Hazem. Keep in touch,
Colin Brislawn


Reply all
Reply to author
Forward
0 new messages