Clarification on denoising questions

77 views
Skip to first unread message

Serena Thomson

unread,
Sep 24, 2012, 7:32:23 AM9/24/12
to qiime...@googlegroups.com
Hi there,

A while ago you helped me with some queries regarding the denoising of my sequences (I have attached some of the thread in a word document as a reminder). I am now writing up and wanted to confirm a couple of points in order to fully understand the processes:

When running denoise_wrapper.py the advice was that the reverse primer could not be removed during split libraries. Therefore it is expected that all sequences after quality filtering, denoising and chimera checking will still contain the reverse primer. Why was this the case? Why could I not just specify -z during the split libraries step, then perform the denoiser on this filtered data (with the reverse primer removed)? The way the question was phrased before, implied I wanted to remove the reverse primer with the denoising step, which wouldn't make sense. 

Secondly, I wanted to confirm that as per the last thread on the attached document- leaving the reverse primers in the sequences would be fine if they didn't result in artificial separation of OTU's (checked by alignments). What would the options have been had they affected the alignment and resulted in differences? Did that mean denoising wasn't recommended?

Thank you for the clarification on these two points. 

Serena
 

Tony Walters

unread,
Sep 24, 2012, 10:14:26 AM9/24/12
to qiime...@googlegroups.com
Hello Serena,

denoise_wrapper.py does not actually pull the sequence itself from the output of split_libraries.py, but just uses the fasta labels present in the file to get the subset of sequences that passed all the filters in split_libraries.py (to save computational time later).  To remove the reverse primers, after doing all of the steps for denoise_wrapper.py, you would want to take the resulting fasta file and run truncate_reverse_primer.py (http://qiime.org/scripts/truncate_reverse_primer.html).

The largest concern is not so much the primer itself, although this means you're adding ~15-20 highly conserved bases to your sequences which could alter clustering, but rather what follows the primer, like barcodes and/or adapter sequences.  Those can throw off clustering (particularly reference based clustering) as well as taxonomic assignments with RDP (BLAST would be less affected).

Hope this helps,
Tony Walters


Serena
 

--
 
 
 

Serena Thomson

unread,
Sep 24, 2012, 11:48:36 AM9/24/12
to qiime...@googlegroups.com
Hi Tony,

Thanks for your fast response. I don't think truncate_reverse_primer was available back in December2011/Jan2012 was it? Would you recommend going back through the data and re-running the analysis from this point to include truncate...? Would this be done immediately after denoise_wrapper or after inflate_denoise_wrapper? Is this much of an issue on short sequences. The histogram for one of my data sets seems to tail off after 200bp).

(Presumably the split libraries dealt with the forward primer so I don't need to worry about this one)?

Many thanks

Serena

--
 
 
 

Tony Walters

unread,
Sep 24, 2012, 12:59:47 PM9/24/12
to qiime...@googlegroups.com
Hello again Serena,

It was implemented in fairly recent history (primarily to handle situations like handling denoised fasta data).  It may not be an issue if your reads were fairly short compared to the size of the amplicon construct (for example if you have 200 base pair reads on average, and you'd have to read 300 to hit the reverse primer).  Usually problems will manifest themselves as a lot of "Root" or "Root;Bacteria" assignments with RDP, so if you have decent assignments, it probably isn't worth redoing.  The forward primer is removed by default with split_libraries.py (and denoise_wrapper.py).

-Tony

--
 
 
 

Serena Thomson

unread,
Sep 24, 2012, 3:44:42 PM9/24/12
to qiime...@googlegroups.com
Thanks Tony, as always I do appreciate the help!

Serena

--
 
 
 

Serena Thomson

unread,
Sep 24, 2012, 3:57:59 PM9/24/12
to qiime...@googlegroups.com
Sorry, one last question - which order would this step fit in - after inflate_denoise and before pick otus?

Thanks

Tony Walters

unread,
Sep 24, 2012, 3:59:14 PM9/24/12
to qiime...@googlegroups.com
Hello Serena,

That is the correct place to do it.

-Tony

--
 
 
 

Serena Thomson

unread,
Sep 25, 2012, 10:28:05 AM9/25/12
to qiime...@googlegroups.com
Hi Tony,

Following our emails yesterday, I experimented with the command and some of my datasets. I would expect for one of my datasets, to have NOT found any reverse primers, due to the short read lengths. However log output (below) shows that out of 505,000 only 18,000 sequences contained no reverse primer?

Reverse primer written as TGATCCTTCTGCAGGTTCACCTAC in mapping found, found as GTAGGTGAACCTGCAGAAGGATCA in inflated fasta file. 

Details for removal of reverse primers
Original fasta filepath: denoised_seqs_inflated.fna
Total seqs in fasta: 505695
Mapping filepath: /Users/serena/Documents/Academia/E_Soi/Archive/Mapping_Files/Soi_Map2.txt
Truncation option: truncate_only
Mismatches allowed: 0
Total seqs written: 505695
SampleIDs not found: 0
Reverse primers not found: 18007

Am I right in thinking then that this file will need to be re-prossessed given that in the majority of cases the reverse primer has been located and removed (albeit on very short read lengths)? (Also the reverse primer is written as TGATCCTTCTGCAGGTTCACCTAC in the mapping file, but was removed as the reverse complement of this GTAGGTG... without me changing it). 

Can I do much in terms of assignments with read lengths this short(see sample at bottom)? Would it be better to leave the primer in?

For most of my datasets the truncate_reverse_primer command has failed to highlight and remove any reverse primers (showing the following):

Details for removal of reverse primers
Original fasta filepath: Denoised_seqs.fna
Total seqs in fasta: 5064
Mapping filepath: /Users/serena/Documents/Academia/Cerco/mapping_info_Cerco/Cerco_only_corrected.txt
Truncation option: truncate_only
Mismatches allowed: 0
Total seqs written: 5064
SampleIDs not found: 0
Reverse primers not found: 5064

I suspected it might be a problem with the orientation of the primer in the mapping file specified (although it found the reverse complement in the above example)? I therefore experimented with the reverse primer orientation by reverse complementing and complementing the primer, saving as new mapping files and re-running. 

In most cases, some reverse primers were found whereby the number of sequences in which the 'reverse primer was not found' was less than the total. Can I presume that this script has therefore worked? 
Would you expect it to have removed the reverse primer from the majority of sequences? For example one dataset has ~5000 sequences and 4,400 instances where the reverse primer was not found. Do you think it is worth re-running the rest of the processing steps given that only 10% of the data appear to have had the reverse primer removed?

Did I even need to change the orientation of the primer in the mapping file? Presumably it is ok to have changed this in order to find the reverse primers? It is just strange that it changed it automatically in one dataset and failed in another until I manually changed the mapping file. 

Many thanks for your help

Serena

#Sequences after inflated_denoise_wrapper but before truncate_reverse_primer:

>Soils.LoddBcrop.crop8_8323 GV2FPVH01ECN49 orig_bc=CGTACTCAGA new_bc=CGTACTCAGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.ClaxMcrop.crop9_251032 GV2Q3QI01BW9UE orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.LoddAcrop.crop7_314059 GV2Q3QI01CH43A orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.Oward.crop.crop12_324340 GV2Q3QI01C8JOX orig_bc=TCGATAGTGA new_bc=TCGATAGTGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.LoddAcrop.crop7_387555 GV2Q3QI01D13VF orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.EdgBaside.aside6_486595 GW10GNM03DBVAN orig_bc=CGCAGTACG new_bc=CGCAGTACG bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.Yettcrop.crop17_228673 GV2Q3QI01EA8BB orig_bc=ACGACAGC new_bc=ACGACAGC bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.ClaxMcrop.crop9_243228 GV2Q3QI01CPN7Y orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG


#Sequences after truncate_reverse_primers

>Soils.LoddBcrop.crop8_8323 GV2FPVH01ECN49 orig_bc=CGTACTCAGA new_bc=CGTACTCAGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.ClaxMcrop.crop9_251032 GV2Q3QI01BW9UE orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.LoddAcrop.crop7_314059 GV2Q3QI01CH43A orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.Oward.crop.crop12_324340 GV2Q3QI01C8JOX orig_bc=TCGATAGTGA new_bc=TCGATAGTGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.LoddAcrop.crop7_387555 GV2Q3QI01D13VF orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.EdgBaside.aside6_486595 GW10GNM03DBVAN orig_bc=CGCAGTACG new_bc=CGCAGTACG bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.Yettcrop.crop17_228673 GV2Q3QI01EA8BB orig_bc=ACGACAGC new_bc=ACGACAGC bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.ClaxMcrop.crop9_243228 GV2Q3QI01CPN7Y orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCF

--
 
 
 

Tony Walters

unread,
Sep 25, 2012, 10:51:52 AM9/25/12
to qiime...@googlegroups.com
Hello Serena,

The primers should always be written in 5'->3' orientation in the mapping file, so for the reverse primer, if you have  GTAGGTGAACCTGCAGAAGGATCA   in your reads, you would want to have its reverse complement ( TGATCCTTCTGCAGGTTCACCTAC ) in the mapping file.

As for taxonomic assignments, remember that the primer sequence is highly conserved, so you're very unlikely to add useful information to the sequence by retaining the sequence.  If all of your sequences end with AAATACCAGATT for instance, there isn't any thing in this sequence to discriminate the sequences by for clustering.  I suppose with degenerate primers you might get some bases that are different, but there is always the danger of primer mismatch during PCR followed by what would effectively be primer-directed mutagenesis in the resulting amplicon, so I wouldn't rely on it.

-Tony

--
 
 
 

Reply all
Reply to author
Forward
0 new messages