just not cutting it - help with extract_barcodes.py for dual-barcoded reads

96 views
Skip to first unread message

Devon O'Rourke

unread,
Jun 29, 2016, 11:50:16 AM6/29/16
to Qiime 1 Forum
Hi folks,

Looking for reasons why past suggestions from Tony Walters and Sophie have failed to work(see here and here, for example).

An explanation of the data:
  • I have amplicon sequence data from a paired-end Illumina HiSeq run; I've already quality trimmed and joined paired end reads. 
  • These amplicons contain barcodes at both ends as well as primers internal to those barcodes.
  • While there are multiple libraries, let's just assume that I have one for the moment.
What I have as a result is a fastq file with reads of this structure:

forwardbarcodeFORWARDPRIMERsequencesequencesequenceREVERSEPRIMERreversebarcode

  • Forward and reverse barcodes are both 8bp. 
    • Note that the 8bp sequence for a given forwardbarcode is identical to the reversebarcode (though there are 96 unique barcodes used across 96 independent samples).
  • Forward primer is 24 bp.
  • Reverse primer is 30 bp.

However, it's clear that some reads are in the reverse orientation, such that this is the following structure (...technically it's the complement of what's listed below...):
reversebarcodeREVERSEPRIMERsequencesequencesequenceFORWARDPRIMERforwardbarcode

I thought I could try using the following qiime script to solve the issue:

extract_barcodes.py --input_type barcode_paired_stitched \
-f /pathto/testforqiime.fastq \
--bc1_len 8 \
--bc2_len 8 \
-o parsed_barcodes \
--attempt_read_reorientation \
--mapping_fp /pathto/mapping.txt \
--rev_comp_bc1 \
--rev_comp_bc2

The program runs just swell. I get the following files generated:
barcodes.fastq  barcodes_not_oriented.fastq  reads1_not_oriented.fastq  reads2_not_oriented.fastq  reads.fastq

The resulting reads.fastq file contains reads with the following sequence structure:
FORWARDPRIMERsequencesequencesequenceREVERSEPRIMERreversebarcode
As you can see, just the forward barcode is removed. Recall that the original sequence was:
forwardbarcodeFORWARDPRIMERsequencesequencesequenceREVERSEPRIMERreversebarcode 
Primers are both still there, and the reverse barcode is still present. Interestingly, the program does seem to orient the sequences in the proper direction despite the input fastq file having sequences in both directions.

In addition, I get lots of sequences in the newly created barcodes.fastq file. These appear to be in the expected orientation regardless of which mapping file I feed it.

So what was that mapping file I used? Actually, I've fed through mapping files of four different flavors, altering the sequence in the "BarcodeSequence" field. Oddly, it doesn't matter what mapping file I give it as the result is always the same - I get the forward barcode removed, while both primers and the reverse barcode are still there. 

An example of this mapping file is below. Note that only the forward barcode sequence was included initially:

#SampleID   BarcodeSequence LinkerPrimerSequence    ReversePrimer   Description
s1  AGACAGTG    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    553
s13 AGACGAGT    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    264
..........93 more samples..........
s96 TCTCGAGA    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    393

It made sense to me that by feeding just this mapping file I would only remove the forward primer - the "BarcodeSequence" field was filled with the forward primer sequence only.

I then changed the mapping file to contain altered fields. Here were the other flavors:
- alternate1: added a concatenated forward and reverse barcode sequence. Both the forward and reverse barcode sequences are in their  5' --> 3' orientation... 

#SampleID   BarcodeSequence LinkerPrimerSequence    ReversePrimer   Description
s1  AGACAGTGAGACAGTG    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    553
s13 AGACGAGTAGACGAGT    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    264
..........93 more samples..........
s96 TCTCGAGATCTCGAGA    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    393


- alternate2: added a concatenated forward and reverse barcode sequence, however, forward barcode sequence was in 5 --> 3' orientation, while reverse barcode sequence was in reverse complement orientation 
#SampleID   BarcodeSequence LinkerPrimerSequence    ReversePrimer   Description
s1  AGACAGTGCACTGTCT    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    553
s13 AGACGAGTACTCGTCT    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    264
..........93 more samples..........
s96 TCTCGAGATCTCGAGA    AGATATTGGAACWTTATATTTTATTTTTGG  WACTAATCAATTWCCAAATCCTCC    393


To go back to the beginning, the goal here is to run extract_barcodes.py to extract the barcodes, but also to trim the primers, right? Am I mistaken in that the primers shouldn't be removed? Is this program actually acting normally and I'm making a big fuss about nothing?

Thanks for any insights you can offer.
Reply all
Reply to author
Forward
0 new messages