SILVA vs GreenGenes discrepancy

64 views
Skip to first unread message

Tatyana Zamkovaya

unread,
Jan 5, 2018, 11:53:18 AM1/5/18
to Qiime 1 Forum
Hi, 
I recently ran pick_open_reference_otus.py on my data first using the default GreenGenes (GG) reference database and then using the SILVA 128 16S only database with the following parameters in my parameters file:

align_seqs:template_fp /home/tatyanaz/core_alignment_SILVA128.fna

assign_taxonomy:reference_seqs_fp /home/tatyanaz/97_otus_16S.fasta

assign_taxonomy:id_to_taxonomy_fp /home/tatyanaz/consensus_taxonomy_7_levels.txt


In both cases (with GreenGenes and with SILVA), I had prefilter_percent_id set to 0.0.

For GG,I did not include any parameters and just ran pick_open_reference_otus.py as is :

pick_open_reference_otus.py -i /home/tatyanaz/fastq1_newsplit/seqs.fna -o /home/tatyanaz/fastq1_newpickotu --prefilter_percent_id 0.0 


When I compared the overall # OTUS found within the sequences of my samples, I noticed that certain samples appeared to have significantly different results based on whether I had used GreenGenes or SILVA- for instance, one sample (37) using GG had 113554.0 Counts/Sample and with SILVA only had 79936.0 Counts/Sample. 

Is this normal? What could be causing this discrepancy? Is it due to the different databases or is there something I am missing in the parameters for either SILVA or GG? Please let me know if you know how to solve this issue or have come across the same issue.


Thanks! 

Colin Brislawn

unread,
Jan 15, 2018, 12:30:44 AM1/15/18
to Qiime 1 Forum
Hello Tatyana,

That is strange! Open-ref OTU picking should use all of your sequences, especially if you set the prefilter to 0. My best guess is that the --percent_subsample command, which is 0.001 or 0.1% by default, is reducing the number of sequences being used in the de novo step. Because greengenes covers more than the older silva database, more of your reads would end up in the de novo clustering step, and the subsampling would reduce total read count. More info here: http://qiime.org/scripts/pick_open_reference_otus.html

To avoid this problem, you could change the precent_subsample, or use fully de novo OTU clustering then use the two different databases to assign taxonomy is a second, separate step. 

Let me know if that helps, or if you have already solved this problem,
Colin

Reply all
Reply to author
Forward
0 new messages