Hi folks,
Looking for reasons why past suggestions from Tony Walters and Sophie have failed to work(see
here and
here, for example).
An explanation of the data:
- I have amplicon sequence data from a paired-end Illumina HiSeq run; I've already quality trimmed and joined paired end reads.
- These amplicons contain barcodes at both ends as well as primers internal to those barcodes.
- While there are multiple libraries, let's just assume that I have one for the moment.
What I have as a result is a fastq file with reads of this structure:
forwardbarcodeFORWARDPRIMERsequencesequencesequenceREVERSEPRIMERreversebarcode
- Forward and reverse barcodes are both 8bp.
- Note that the 8bp sequence for a given forwardbarcode is identical to the reversebarcode (though there are 96 unique barcodes used across 96 independent samples).
- Forward primer is 24 bp.
- Reverse primer is 30 bp.
However, it's clear that some reads are in the reverse orientation, such that this is the following structure (...technically it's the complement of what's listed below...):
reversebarcodeREVERSEPRIMERsequencesequencesequenceFORWARDPRIMERforwardbarcode
I thought I could try using the following qiime script to solve the issue:
extract_barcodes.py --input_type barcode_paired_stitched \
-f /pathto/testforqiime.fastq \
--bc1_len 8 \
--bc2_len 8 \
-o parsed_barcodes \
--attempt_read_reorientation \
--mapping_fp /pathto/mapping.txt \
--rev_comp_bc1 \
--rev_comp_bc2
The program runs just swell. I get the following files generated:
barcodes.fastq barcodes_not_oriented.fastq reads1_not_oriented.fastq reads2_not_oriented.fastq reads.fastq
The resulting reads.fastq file contains reads with the following sequence structure:
FORWARDPRIMERsequencesequencesequenceREVERSEPRIMERreversebarcode
As you can see, just the forward barcode is removed. Recall that the original sequence was:
forwardbarcodeFORWARDPRIMERsequencesequencesequenceREVERSEPRIMERreversebarcode
Primers are both still there, and the reverse barcode is still present. Interestingly, the program does seem to orient the sequences in the proper direction despite the input fastq file having sequences in both directions.
In addition, I get lots of sequences in the newly created barcodes.fastq file. These appear to be in the expected orientation regardless of which mapping file I feed it.
So what was that mapping file I used? Actually, I've fed through mapping files of four different flavors, altering the sequence in the "BarcodeSequence" field. Oddly, it doesn't matter what mapping file I give it as the result is always the same - I get the forward barcode removed, while both primers and the reverse barcode are still there.
An example of this mapping file is below. Note that only the forward barcode sequence was included initially:
#SampleID BarcodeSequence LinkerPrimerSequence ReversePrimer Description
s1 AGACAGTG AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 553
s13 AGACGAGT AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 264
..........93 more samples..........
s96 TCTCGAGA AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 393
It made sense to me that by feeding just this mapping file I would only remove the forward primer - the "BarcodeSequence" field was filled with the forward primer sequence only.
I then changed the mapping file to contain altered fields. Here were the other flavors:
- alternate1: added a concatenated forward and reverse barcode sequence. Both the forward and reverse barcode sequences are in their 5' --> 3' orientation...
#SampleID BarcodeSequence LinkerPrimerSequence ReversePrimer Description
s1 AGACAGTGAGACAGTG AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 553
s13 AGACGAGTAGACGAGT AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 264
..........93 more samples..........
s96 TCTCGAGATCTCGAGA AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 393
- alternate2: added a concatenated forward and reverse barcode sequence, however, forward barcode sequence was in 5 --> 3' orientation, while reverse barcode sequence was in reverse complement orientation
#SampleID BarcodeSequence LinkerPrimerSequence ReversePrimer Description
s1 AGACAGTGCACTGTCT AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 553
s13 AGACGAGTACTCGTCT AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 264
..........93 more samples..........
s96 TCTCGAGATCTCGAGA AGATATTGGAACWTTATATTTTATTTTTGG WACTAATCAATTWCCAAATCCTCC 393
To go back to the beginning, the goal here is to run extract_barcodes.py to extract the barcodes, but also to trim the primers, right? Am I mistaken in that the primers shouldn't be removed? Is this program actually acting normally and I'm making a big fuss about nothing?
Thanks for any insights you can offer.