Hi Stacks users,
I apologize for the length of this post, but refresh your coffee and read on.
We are using STACKs to analyze ddRAD input. Background is that 96 samples have been prepared using EcoR1 and Msp1. Samples are uniquely tagged with a combinatorial approach using 48 forward adapters (5-mers) and 2 reverse (6-mers). The MiSeq does an initial crude demultiplexing, producing a R1 and R2 file for sample sets 1-48 and 49-96 using the reverse adapter barcode as a guide. During this process, the reverse barcode is removed from the sequence meaning that the first bases in the R2 sequence are the residual Msp1 recognition site (more on this later).
We then perform a clipping and trimming step (using FastX) that trims all reads the same length (in this case 295bp) and excludes reads shorter than this.
Now we are ready for process radtags but since R2 reads lacking their original barcode we reintroduce the sequence (6bp) into each R2 read along with 6 corresponding artificial quality scores (G). Samples 1-48 receive one barcode, while 49-96 receive another. This is done with this command:
sed -i '2~4s/^/GCCAAT/' sample1to48_S1_L001_R2_001.fastq
sed -i '0~4s/^/GGGGGG/' sample1to48_S1_L001_R2_001.fastq
At this point we have data that looks like this:
Read in Forward Read file “sample1to48_S1_L001_trimmed_R1_001.fastq”
@M02210:18:000000000-A5BL2:1:1101:19335:2058 1:N:0:1
GGTTGAATTCTCTCACAAGTCACATTTCCTCTTTTTCACTGTGGTGAGTTTCAGTATATCTTTACTTGAAGGAAAGAACTGTGTAGTTCAAGGTAGTTTTTTTTATGCTGTTTAAAAAGCTCAGCTGATTCATTTTGTGTGGTTATAAAATCATTGTGATTTTATATATTTGGCTATAAAATGATTTTGTGACAGACGGAGAGCGAGGAAGGGCTACATTCCCTTTCCTCCACTTGTTTATGTGCCCATATGTCTCCCAGATGAATTTCTGGAGATAAAAAGTTCGAGTCTTTCC
+
9CCCC9ECFFEFGGFGCFGGGGGGGGGGGGGFGFGGGCGGAFF8FD9CFEEF99FCFAECFGGFGGC6,<C8E88,;6FE@EFFF9FFF,C98FCCGC<E@CCFFFCFE,CFFFC9<9BE<C,6BFC<FGE9?FEEEFCFE+,B<FABA,FEDFFAEE<FGFGGGCE<FGFECFGGEEAA9FFGGGGGFG,,A,,,,++++>:+8+88+@,++@DDFCGGGAFGFGE;CCG6DFG7;DFFFAGAEGGGG6E?FGGGEC4C,9?2:C:CGFG77D4DBFF8?CFFGF)=FFFFFFF
Matching read in Reverse Read file “sample1to48_S1_L001_trimmed_R2_001.fastq”
@M02210:18:000000000-A5BL2:1:1101:19335:2058 2:N:0:1
GCCAATCGGACCACAGTTCCCTCCCACTCAATCTAACTCCGTTTTACATCCACGGAAAGACTCGACCTTTTTATCTCCAGAAATTCATCTGGGAGACATATGGGCACATAAACAAGTCGAGGAAAGGGAATGTAGCCCTTCCTCTCTCTCCTTCTGTCACAAAATCATTTTATAGCCAAATATATAAAATCACAATGATTTTATATCCACACAAAATGAATCAGCTGAGCTTTTTAAACAGCATAAAAAAAACTACCTTGAACTACACAGTTCTTTCCTTCAAGTAAAGATATACTGAAAC
+
GGGGGGA--6CCEDFGGGGFGGFGGGGGGG8CFEC@EFGGGGGGGGGCGGGFD;CFGGGGGFGDE+@F9CCFAFCDFA,EC,<FFGFGFGF,,<F@FCGGG,@,C@FGE?9FFCFCE,B:7++,,<?F8<E@FEFEDEGGGGEFEGFG@BFGGGGFFGFFGFGGGGGGGGGGGG99=E?>DADCFGFACFFDGG?F8DEAAFGGFGGGG8EGGDG?D@2C7CFG?7=?9?BFFFFF569FA*98@9@6F2?>AEFAAEE*56)1.>:=@677<CEEEFFFEFEC7).)/))).)..)))47
The barcodes are highlighted in RED the residual restriction sites in GREEN. .
We then run process radtags in a non-standard way, but rather within a slightly more complex system of scripts that is intended to make routine STACKS analysis easier and less prone to command-line errors and inconsistencies. One “main script” defines variables ($folders etc) for a particular run and controls the execution of “sub-scripts” which include the trimming and clipping, process radtags, denovo, database preparation etc. The process radtags sub-script follows:
#!/bin/bash
#start process_radtags
for samples in ${GROUP_SAMPLES1}_S1_L001 ${GROUP_SAMPLES2}_S2_L001;
do
$MKDIR -p $PROCESS_RADTAGS_OUT/$samples;
process_radtags -P -1 $TRIM_WORK_DIR/${samples}_trimmed_R1_001.fastq -2 $TRIM_WORK_DIR/${samples}_trimmed_R2_001.fastq -b $BARCODE_FILE -o $PROCESS_RADTAGS_OUT/$samples -q -E phred33 -r --inline_inline --renz_1 ecoRI --renz_2 mspI -i fastq
done
This script then works to demultiplex the reads according to the barcodes file (tab delimited, 2 columns). The result should be (as I understand) to find the F and R reads from a single cluster in files called sample_GGTTG-GCCAAT.1.fq and sample_GGTTG-GCCAAT.2.fq respectively. Files with these names (and the names of every other possible 96-barcode combinations) are indeed generated in the output folder but, unfortunately, I do not find the reads as I expect.
By running grep using the 5 and 4 digit numbers are the end of the header (19335_2058) on all forward files (*.1.fq) I identify the above example read in the correctly demultiplexed file (ie GGTTG-GCCAAT.1.fq; grep output below) :
Sample_GGTTG-GCCAAT.1.fq:@1_1101_19335_2058_1
AATTCTCTCACAAGTCACATTTCCTCTTTTTCACTGTGGTGAGTTTCAGTATATCTTTACTTGAAGGAAAGAACTGTGTAGTTCAAGGTAGTTTTTTTTATGCTGTTTAAAAAGCTCAGCTGATTCATTTTGTGTGGTTATAAAATCATTGTGATTTTATATATTTGGCTATAAAATGATTTTGTGACAGACGGAGAGCGAGGAAGGGCTACATTCCCTTTCCTCCACTTGTTTATGTGCCCATATGTCTCCCAGATGAATTTCTGGAGATAAAAAGTTCGAGTCTTTCC
+
9ECFFEFGGFGCFGGGGGGGGGGGGGFGFGGGCGGAFF8FD9CFEEF99FCFAECFGGFGGC6,<C8E88,;6FE@EFFF9FFF,C98FCCGC<E@CCFFFCFE,CFFFC9<9BE<C,6BFC<FGE9?FEEEFCFE+,B<FABA,FEDFFAEE<FGFGGGCE<FGFECFGGEEAA9FFGGGGGFG,,A,,,,++++>:+8+88+@,++@DDFCGGGAFGFGE;CCG6DFG7;DFFFAGAEGGGG6E?FGGGEC4C,9?2:C:CGFG77D4DBFF8?CFFGF)=FFFFFFF
But searching the reverse read files (*.2.fq) in the process radtags output, I find the header term appearing in a demultiplexed file called TACCG-GCCAAT.2.fq which is wrong. Remember we know the F+R read pair was tagged with GGTTG-GCCAAT.
sample_TACCG-GCCAAT.2.fq:@1_1101_19335_2058_2
CGGACCACAGTTCCCTCCCACTCAATCTAACTCCGTTTTACATCCACGGAAAGACTCGACCTTTTTATCTCCAGAAATTCATCTGGGAGACATATGGGCACATAAACAAGTCGAGGAAAGGGAATGTAGCCCTTCCTCTCTCTCCTTCTGTCACAAAATCATTTTATAGCCAAATATATAAAATCACAATGATTTTATATCCACACAAAATGAATCAGCTGAGCTTTTTAAACAGCATAAAAAAAACTACCTTGAACTACACAGTTCTTTCCTTCAAGTAAAGATATACTGAAAC
+
6CCEDFGGGGFGGFGGGGGGG8CFEC@EFGGGGGGGGGCGGGFD;CFGGGGGFGDE+@F9CCFAFCDFA,EC,<FFGFGFGF,,<F@FCGGG,@,C@FGE?9FFCFCE,B:7++,,<?F8<E@FEFEDEGGGGEFEGFG@BFGGGGFFGFFGFGGGGGGGGGGGG99=E?>DADCFGFACFFDGG?F8DEAAFGGFGGGG8EGGDG?D@2C7CFG?7=?9?BFFFFF569FA*98@9@6F2?>AEFAAEE*56)1.>:=@677<CEEEFFFEFEC7).)/))).)..)))47
This is just one example, but the problem is very consistent. For any read pair (F+R) the F read is correctly demultiplexed, while the reverse read is not. I have struggled to understand this and would be very grateful for any words of advice.
Has anyone else experienced (or checked) for read-pair consistency in demultiplexed, double-digested, PE data?
Thanks for your time and attention.
Matthew
AAGGA DUK_14_001
ACACA DUK_14_003
AGCTA DUK_14_002
ATACG DUK_14_008
CAACC DUK_14_007
CTGAT DUK_14_014
CTTGG DUK_14_011
GAGTC DUK_14_389
GCCGT DUK_14_388
GGCCA DUK_14_048
GTAGT AMI_14_376
TCTGC AMI_14_083
GTCGA AMI_14_387
TAGTA AMI_14_689
TCAGT AMI_14_587
TCCGG AMI_14_691
And input looks like this:
$ zcat Norah_pool10_001_ATCACG_R1.fastq.gz | head
@HWI-D00536:268:C9WCBANXX:7:1101:1210:1966 1:N:0:NTCACG
NTCGACATGCAAACCGGCGGTCAATGGCGGATTTCTGGGCAATAGCAGTCTTCTTCACATCCTCCCCGAGAATAGA
+
#<BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFBFFFFFFF
@HWI-D00536:268:C9WCBANXX:7:1101:1244:1973 1:N:0:ATCACG
NGCTACATGTGTAGCCTTTTTGTAATGTAAGGCTCCTTCAGTTCTTTGGATCATCTCCTAATGAAACCGAACGCGG
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFBFFFFFFFFF
@HWI-D00536:268:C9WCBANXX:7:1101:1214:1987 1:N:0:NTCACG
NCTGCCATGCCAATGCTAAGTCGTAGATATTGGTCTGATCCGTTAGAGTCTTAATGTTTAAGTTAAAAATTAGATC
... the output doesn't appear to have matching pairs, for example:
$ zcat DUK_14_002.1.fq.gz | head
@1_1101_2205_1997/1
CATGAAGATCAAAAACCACATCTGCTTTGCGAATAAGAAGGAGGATCGTGAAAATATATATGTTTATGCCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@1_1101_10548_1977/1
CATGTAAGTGCATATAAGTGTATATGTTAGAAGAAAAGCAAGTGTAGCATATACAACTTATTTCGTTTTCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB
@1_1101_20486_1985/1
CATGCCTCTATTATCTATTTGGTTTGTTGTACCATATCGCGAGGAGAAAATAGCGAAACAACAATCACTTT
$ zcat DUK_14_002.2.fq.gz | head
@1_1101_2552_1966/2
AATTAGAAAAGCTCGGTGCACAGGTCGAGAGCCCATTCTGGGGAATCAGTAATGCGAAATAAATAGGTAACGTGCC
+
DCDDDIIHIHHH<CEDECEHHHHCEE?/EGE?FHHHHC@GHI<CHEHCHC@CF1FE0C<DHCGEFGHHHC?GC11<
@1_1101_9787_1989/2
AATTTTATTAGATTTTTAGTTTAAAATGAAATAGTGGCCCGGCATTCGTCATTACATACCATCACCATCACGTTAC
+
CDDDDIIEEEHIHIIIIIIHIIIIHIIHFHHHHCHE?G@CFHHHCH@@HHGHGH@GHCE1<F@1DHHIIEH@1DG@
@1_1101_18676_1987/2
AATTCGCCGACCGTTCAGATTGAGCAAACAGCATTTCGTCACATAAATGTACTGGTGTTCAAGTGCGGGTCACCTT
I realized this when trying to modify the output (removed "/1 and /2" to make the headers match exactly) for BWA. I reran the process_radtags step, and this problem output consistently occurs in all of the runs.
It would be difficult for me to change the version of Stacks that I am using because I do not have enough permissions on the system I use to re-install Stacks myself. Let me know if there is a known remedy for this.
Thank you in advance for your time, it is much appreciated!
Norah Saarman