Clarification on denoising questions

Serena Thomson

unread,

Sep 24, 2012, 7:32:23 AM9/24/12

to qiime...@googlegroups.com

Hi there,

A while ago you helped me with some queries regarding the denoising of my sequences (I have attached some of the thread in a word document as a reminder). I am now writing up and wanted to confirm a couple of points in order to fully understand the processes:

When running denoise_wrapper.py the advice was that the reverse primer could not be removed during split libraries. Therefore it is expected that all sequences after quality filtering, denoising and chimera checking will still contain the reverse primer. Why was this the case? Why could I not just specify -z during the split libraries step, then perform the denoiser on this filtered data (with the reverse primer removed)? The way the question was phrased before, implied I wanted to remove the reverse primer with the denoising step, which wouldn't make sense.

Secondly, I wanted to confirm that as per the last thread on the attached document- leaving the reverse primers in the sequences would be fine if they didn't result in artificial separation of OTU's (checked by alignments). What would the options have been had they affected the alignment and resulted in differences? Did that mean denoising wasn't recommended?

Thank you for the clarification on these two points.

Serena

Tony Walters

unread,

Sep 24, 2012, 10:14:26 AM9/24/12

to qiime...@googlegroups.com

Hello Serena,

denoise_wrapper.py does not actually pull the sequence itself from the output of split_libraries.py, but just uses the fasta labels present in the file to get the subset of sequences that passed all the filters in split_libraries.py (to save computational time later). To remove the reverse primers, after doing all of the steps for denoise_wrapper.py, you would want to take the resulting fasta file and run truncate_reverse_primer.py (http://qiime.org/scripts/truncate_reverse_primer.html).

The largest concern is not so much the primer itself, although this means you're adding ~15-20 highly conserved bases to your sequences which could alter clustering, but rather what follows the primer, like barcodes and/or adapter sequences. Those can throw off clustering (particularly reference based clustering) as well as taxonomic assignments with RDP (BLAST would be less affected).

Hope this helps,

Tony Walters

Serena

--

Serena Thomson

unread,

Sep 24, 2012, 11:48:36 AM9/24/12

to qiime...@googlegroups.com

Hi Tony,

Thanks for your fast response. I don't think truncate_reverse_primer was available back in December2011/Jan2012 was it? Would you recommend going back through the data and re-running the analysis from this point to include truncate...? Would this be done immediately after denoise_wrapper or after inflate_denoise_wrapper? Is this much of an issue on short sequences. The histogram for one of my data sets seems to tail off after 200bp).

(Presumably the split libraries dealt with the forward primer so I don't need to worry about this one)?

Many thanks

Serena

--

Tony Walters

unread,

Sep 24, 2012, 12:59:47 PM9/24/12

to qiime...@googlegroups.com

Hello again Serena,

It was implemented in fairly recent history (primarily to handle situations like handling denoised fasta data). It may not be an issue if your reads were fairly short compared to the size of the amplicon construct (for example if you have 200 base pair reads on average, and you'd have to read 300 to hit the reverse primer). Usually problems will manifest themselves as a lot of "Root" or "Root;Bacteria" assignments with RDP, so if you have decent assignments, it probably isn't worth redoing. The forward primer is removed by default with split_libraries.py (and denoise_wrapper.py).

-Tony

--

Serena Thomson

unread,

Sep 24, 2012, 3:44:42 PM9/24/12

to qiime...@googlegroups.com

Thanks Tony, as always I do appreciate the help!

Serena

--

Serena Thomson

unread,

Sep 24, 2012, 3:57:59 PM9/24/12

to qiime...@googlegroups.com

Sorry, one last question - which order would this step fit in - after inflate_denoise and before pick otus?

Thanks

Tony Walters

unread,

Sep 24, 2012, 3:59:14 PM9/24/12

to qiime...@googlegroups.com

Hello Serena,

That is the correct place to do it.

-Tony

--

Serena Thomson

unread,

Sep 25, 2012, 10:28:05 AM9/25/12

to qiime...@googlegroups.com

Hi Tony,

Following our emails yesterday, I experimented with the command and some of my datasets. I would expect for one of my datasets, to have NOT found any reverse primers, due to the short read lengths. However log output (below) shows that out of 505,000 only 18,000 sequences contained no reverse primer?

Reverse primer written as TGATCCTTCTGCAGGTTCACCTAC in mapping found, found as GTAGGTGAACCTGCAGAAGGATCA in inflated fasta file.

Details for removal of reverse primers

Original fasta filepath: denoised_seqs_inflated.fna

Total seqs in fasta: 505695

Mapping filepath: /Users/serena/Documents/Academia/E_Soi/Archive/Mapping_Files/Soi_Map2.txt

Truncation option: truncate_only

Mismatches allowed: 0

Total seqs written: 505695

SampleIDs not found: 0

Reverse primers not found: 18007

Am I right in thinking then that this file will need to be re-prossessed given that in the majority of cases the reverse primer has been located and removed (albeit on very short read lengths)? (Also the reverse primer is written as TGATCCTTCTGCAGGTTCACCTAC in the mapping file, but was removed as the reverse complement of this GTAGGTG... without me changing it).

Can I do much in terms of assignments with read lengths this short(see sample at bottom)? Would it be better to leave the primer in?

For most of my datasets the truncate_reverse_primer command has failed to highlight and remove any reverse primers (showing the following):

Details for removal of reverse primers

Original fasta filepath: Denoised_seqs.fna

Total seqs in fasta: 5064

Mapping filepath: /Users/serena/Documents/Academia/Cerco/mapping_info_Cerco/Cerco_only_corrected.txt

Truncation option: truncate_only

Mismatches allowed: 0

Total seqs written: 5064

SampleIDs not found: 0

Reverse primers not found: 5064

I suspected it might be a problem with the orientation of the primer in the mapping file specified (although it found the reverse complement in the above example)? I therefore experimented with the reverse primer orientation by reverse complementing and complementing the primer, saving as new mapping files and re-running.

In most cases, some reverse primers were found whereby the number of sequences in which the 'reverse primer was not found' was less than the total. Can I presume that this script has therefore worked?

Would you expect it to have removed the reverse primer from the majority of sequences? For example one dataset has ~5000 sequences and 4,400 instances where the reverse primer was not found. Do you think it is worth re-running the rest of the processing steps given that only 10% of the data appear to have had the reverse primer removed?

Did I even need to change the orientation of the primer in the mapping file? Presumably it is ok to have changed this in order to find the reverse primers? It is just strange that it changed it automatically in one dataset and failed in another until I manually changed the mapping file.

Many thanks for your help

Serena

#Sequences after inflated_denoise_wrapper but before truncate_reverse_primer:

>Soils.LoddBcrop.crop8_8323 GV2FPVH01ECN49 orig_bc=CGTACTCAGA new_bc=CGTACTCAGA bc_diffs=0