Demultiplexing ddRAD data with process-radtags doesn't correctly assign F and R reads.

758 views

Skip to first unread message

Matthew

unread,

Jul 18, 2014, 2:06:39 AM7/18/14

to stacks...@googlegroups.com

Hi Stacks users,

I apologize for the length of this post, but refresh your coffee and read on.

We are using STACKs to analyze ddRAD input. Background is that 96 samples have been prepared using EcoR1 and Msp1. Samples are uniquely tagged with a combinatorial approach using 48 forward adapters (5-mers) and 2 reverse (6-mers). The MiSeq does an initial crude demultiplexing, producing a R1 and R2 file for sample sets 1-48 and 49-96 using the reverse adapter barcode as a guide. During this process, the reverse barcode is removed from the sequence meaning that the first bases in the R2 sequence are the residual Msp1 recognition site (more on this later).

We then perform a clipping and trimming step (using FastX) that trims all reads the same length (in this case 295bp) and excludes reads shorter than this.

Now we are ready for process radtags but since R2 reads lacking their original barcode we reintroduce the sequence (6bp) into each R2 read along with 6 corresponding artificial quality scores (G). Samples 1-48 receive one barcode, while 49-96 receive another. This is done with this command:

sed -i '2~4s/^/GCCAAT/' sample1to48_S1_L001_R2_001.fastq

sed -i '0~4s/^/GGGGGG/' sample1to48_S1_L001_R2_001.fastq

At this point we have data that looks like this:

Read in Forward Read file “sample1to48_S1_L001_trimmed_R1_001.fastq”

@M02210:18:000000000-A5BL2:1:1101:19335:2058 1:N:0:1

GGTTGAATTCTCTCACAAGTCACATTTCCTCTTTTTCACTGTGGTGAGTTTCAGTATATCTTTACTTGAAGGAAAGAACTGTGTAGTTCAAGGTAGTTTTTTTTATGCTGTTTAAAAAGCTCAGCTGATTCATTTTGTGTGGTTATAAAATCATTGTGATTTTATATATTTGGCTATAAAATGATTTTGTGACAGACGGAGAGCGAGGAAGGGCTACATTCCCTTTCCTCCACTTGTTTATGTGCCCATATGTCTCCCAGATGAATTTCTGGAGATAAAAAGTTCGAGTCTTTCC

9CCCC9ECFFEFGGFGCFGGGGGGGGGGGGGFGFGGGCGGAFF8FD9CFEEF99FCFAECFGGFGGC6,<C8E88,;6FE@EFFF9FFF,C98FCCGC<E@CCFFFCFE,CFFFC9<9BE<C,6BFC<FGE9?FEEEFCFE+,B<FABA,FEDFFAEE<FGFGGGCE<FGFECFGGEEAA9FFGGGGGFG,,A,,,,++++>:+8+88+@,++@DDFCGGGAFGFGE;CCG6DFG7;DFFFAGAEGGGG6E?FGGGEC4C,9?2:C:CGFG77D4DBFF8?CFFGF)=FFFFFFF

Matching read in Reverse Read file “sample1to48_S1_L001_trimmed_R2_001.fastq”

@M02210:18:000000000-A5BL2:1:1101:19335:2058 2:N:0:1

GCCAATCGGACCACAGTTCCCTCCCACTCAATCTAACTCCGTTTTACATCCACGGAAAGACTCGACCTTTTTATCTCCAGAAATTCATCTGGGAGACATATGGGCACATAAACAAGTCGAGGAAAGGGAATGTAGCCCTTCCTCTCTCTCCTTCTGTCACAAAATCATTTTATAGCCAAATATATAAAATCACAATGATTTTATATCCACACAAAATGAATCAGCTGAGCTTTTTAAACAGCATAAAAAAAACTACCTTGAACTACACAGTTCTTTCCTTCAAGTAAAGATATACTGAAAC

GGGGGGA--6CCEDFGGGGFGGFGGGGGGG8CFEC@EFGGGGGGGGGCGGGFD;CFGGGGGFGDE+@F9CCFAFCDFA,EC,<FFGFGFGF,,<F@FCGGG,@,C@FGE?9FFCFCE,B:7++,,<?F8<E@FEFEDEGGGGEFEGFG@BFGGGGFFGFFGFGGGGGGGGGGGG99=E?>DADCFGFACFFDGG?F8DEAAFGGFGGGG8EGGDG?D@2C7CFG?7=?9?BFFFFF569FA*98@9@6F2?>AEFAAEE*56)1.>:=@677<CEEEFFFEFEC7).)/))).)..)))47

The barcodes are highlighted in RED the residual restriction sites in GREEN. .

We then run process radtags in a non-standard way, but rather within a slightly more complex system of scripts that is intended to make routine STACKS analysis easier and less prone to command-line errors and inconsistencies. One “main script” defines variables ($folders etc) for a particular run and controls the execution of “sub-scripts” which include the trimming and clipping, process radtags, denovo, database preparation etc. The process radtags sub-script follows:

#!/bin/bash

#start process_radtags

for samples in ${GROUP_SAMPLES1}_S1_L001 ${GROUP_SAMPLES2}_S2_L001;

$MKDIR -p $PROCESS_RADTAGS_OUT/$samples;

process_radtags -P -1 $TRIM_WORK_DIR/${samples}_trimmed_R1_001.fastq -2 $TRIM_WORK_DIR/${samples}_trimmed_R2_001.fastq -b $BARCODE_FILE -o $PROCESS_RADTAGS_OUT/$samples -q -E phred33 -r --inline_inline --renz_1 ecoRI --renz_2 mspI -i fastq

done

This script then works to demultiplex the reads according to the barcodes file (tab delimited, 2 columns). The result should be (as I understand) to find the F and R reads from a single cluster in files called sample_GGTTG-GCCAAT.1.fq and sample_GGTTG-GCCAAT.2.fq respectively. Files with these names (and the names of every other possible 96-barcode combinations) are indeed generated in the output folder but, unfortunately, I do not find the reads as I expect.

By running grep using the 5 and 4 digit numbers are the end of the header (19335_2058) on all forward files (*.1.fq) I identify the above example read in the correctly demultiplexed file (ie GGTTG-GCCAAT.1.fq; grep output below) :

Sample_GGTTG-GCCAAT.1.fq:@1_1101_19335_2058_1

AATTCTCTCACAAGTCACATTTCCTCTTTTTCACTGTGGTGAGTTTCAGTATATCTTTACTTGAAGGAAAGAACTGTGTAGTTCAAGGTAGTTTTTTTTATGCTGTTTAAAAAGCTCAGCTGATTCATTTTGTGTGGTTATAAAATCATTGTGATTTTATATATTTGGCTATAAAATGATTTTGTGACAGACGGAGAGCGAGGAAGGGCTACATTCCCTTTCCTCCACTTGTTTATGTGCCCATATGTCTCCCAGATGAATTTCTGGAGATAAAAAGTTCGAGTCTTTCC

9ECFFEFGGFGCFGGGGGGGGGGGGGFGFGGGCGGAFF8FD9CFEEF99FCFAECFGGFGGC6,<C8E88,;6FE@EFFF9FFF,C98FCCGC<E@CCFFFCFE,CFFFC9<9BE<C,6BFC<FGE9?FEEEFCFE+,B<FABA,FEDFFAEE<FGFGGGCE<FGFECFGGEEAA9FFGGGGGFG,,A,,,,++++>:+8+88+@,++@DDFCGGGAFGFGE;CCG6DFG7;DFFFAGAEGGGG6E?FGGGEC4C,9?2:C:CGFG77D4DBFF8?CFFGF)=FFFFFFF

But searching the reverse read files (*.2.fq) in the process radtags output, I find the header term appearing in a demultiplexed file called TACCG-GCCAAT.2.fq which is wrong. Remember we know the F+R read pair was tagged with GGTTG-GCCAAT.

sample_TACCG-GCCAAT.2.fq:@1_1101_19335_2058_2

CGGACCACAGTTCCCTCCCACTCAATCTAACTCCGTTTTACATCCACGGAAAGACTCGACCTTTTTATCTCCAGAAATTCATCTGGGAGACATATGGGCACATAAACAAGTCGAGGAAAGGGAATGTAGCCCTTCCTCTCTCTCCTTCTGTCACAAAATCATTTTATAGCCAAATATATAAAATCACAATGATTTTATATCCACACAAAATGAATCAGCTGAGCTTTTTAAACAGCATAAAAAAAACTACCTTGAACTACACAGTTCTTTCCTTCAAGTAAAGATATACTGAAAC

6CCEDFGGGGFGGFGGGGGGG8CFEC@EFGGGGGGGGGCGGGFD;CFGGGGGFGDE+@F9CCFAFCDFA,EC,<FFGFGFGF,,<F@FCGGG,@,C@FGE?9FFCFCE,B:7++,,<?F8<E@FEFEDEGGGGEFEGFG@BFGGGGFFGFFGFGGGGGGGGGGGG99=E?>DADCFGFACFFDGG?F8DEAAFGGFGGGG8EGGDG?D@2C7CFG?7=?9?BFFFFF569FA*98@9@6F2?>AEFAAEE*56)1.>:=@677<CEEEFFFEFEC7).)/))).)..)))47

This is just one example, but the problem is very consistent. For any read pair (F+R) the F read is correctly demultiplexed, while the reverse read is not. I have struggled to understand this and would be very grateful for any words of advice.

Has anyone else experienced (or checked) for read-pair consistency in demultiplexed, double-digested, PE data?

Thanks for your time and attention.

Matthew

Julian Catchen

unread,

Jul 22, 2014, 3:16:21 PM7/22/14

to stacks...@googlegroups.com, mpke...@gmail.com

Hi Matthew,

We'll need to see the contents of your barcodes file, if you can send it
along. A few comments though:

1) Why are you using fastx to trim/clip your reads, as process_radtags
can handle this for you?

2) Why are you going to all the trouble to reinsert an inline barcode on
the second read? Was this barcode originally inline, or was it
originally an Illumina index barcode? Anyway, you should be able to run
process_radtags once on each half of the data set (1-48 and 49-96)
simply giving it the remaining, single-end barcodes.

Best,

julian

Matthew wrote:
> Hi Stacks users,
>
> I apologize for the length of this post, but refresh your coffee and
> read on.
>
> We are using STACKs to analyze ddRAD input. Background is that 96
> samples have been prepared using EcoR1 and Msp1. Samples are uniquely
> tagged with a combinatorial approach using 48 forward adapters (5-mers)
> and 2 reverse (6-mers). The MiSeq does an initial crude demultiplexing,
> producing a R1 and R2 file for sample sets 1-48 and 49-96 using the
> reverse adapter barcode as a guide. During this process, the reverse
> barcode is removed from the sequence meaning that the first bases in the
> R2 sequence are the residual Msp1 recognition site (more on this later).
>
> We then perform a clipping and trimming step (using FastX) that trims
> all reads the same length (in this case 295bp) and excludes reads
> shorter than this.
>
> Now we are ready for process radtags but since R2 reads lacking their
> original barcode we reintroduce the sequence (6bp) into each R2 read
> along with 6 corresponding artificial quality scores (G). Samples 1-48
> receive one barcode, while 49-96 receive another. This is done with this
> command:
>
> sed -i '2~4s/^/GCCAAT/' sample1to48_S1_L001_R2_001.fastq
>
> sed -i '0~4s/^/GGGGGG/' sample1to48_S1_L001_R2_001.fastq
>
> At this point we have data that looks like this:
>

> Read in *Forward Read* file “sample1to48_S1_L001_trimmed_R1_001.fastq”

>
> @M02210:18:000000000-A5BL2:1:1101:19335:2058 1:N:0:1
>
> GGTTGAATTCTCTCACAAGTCACATTTCCTCTTTTTCACTGTGGTGAGTTTCAGTATATCTTTACTTGAAGGAAAGAACTGTGTAGTTCAAGGTAGTTTTTTTTATGCTGTTTAAAAAGCTCAGCTGATTCATTTTGTGTGGTTATAAAATCATTGTGATTTTATATATTTGGCTATAAAATGATTTTGTGACAGACGGAGAGCGAGGAAGGGCTACATTCCCTTTCCTCCACTTGTTTATGTGCCCATATGTCTCCCAGATGAATTTCTGGAGATAAAAAGTTCGAGTCTTTCC
>
> +
>
> 9CCCC9ECFFEFGGFGCFGGGGGGGGGGGGGFGFGGGCGGAFF8FD9CFEEF99FCFAECFGGFGGC6,<C8E88,;6FE@EFFF9FFF,C98FCCGC<E@CCFFFCFE,CFFFC9<9BE<C,6BFC<FGE9?FEEEFCFE+,B<FABA,FEDFFAEE<FGFGGGCE<FGFECFGGEEAA9FFGGGGGFG,,A,,,,++++>:+8+88+@,++@DDFCGGGAFGFGE;CCG6DFG7;DFFFAGAEGGGG6E?FGGGEC4C,9?2:C:CGFG77D4DBFF8?CFFGF)=FFFFFFF
>

> Matching read in *Reverse Read* file

> called /sample_GGTTG-GCCAAT.1.fq/ and /sample_GGTTG-GCCAAT.2.fq/

> respectively. Files with these names (and the names of every other
> possible 96-barcode combinations) are indeed generated in the output
> folder but, unfortunately, I do not find the reads as I expect.
>
> By running grep using the 5 and 4 digit numbers are the end of the
> header (19335_2058) on all forward files (*.1.fq) I identify the above
> example read in the correctly demultiplexed file (ie GGTTG-GCCAAT.1.fq;
> grep output below) :
>
> Sample_GGTTG-GCCAAT.1.fq:@1_1101_19335_2058_1
>
> AATTCTCTCACAAGTCACATTTCCTCTTTTTCACTGTGGTGAGTTTCAGTATATCTTTACTTGAAGGAAAGAACTGTGTAGTTCAAGGTAGTTTTTTTTATGCTGTTTAAAAAGCTCAGCTGATTCATTTTGTGTGGTTATAAAATCATTGTGATTTTATATATTTGGCTATAAAATGATTTTGTGACAGACGGAGAGCGAGGAAGGGCTACATTCCCTTTCCTCCACTTGTTTATGTGCCCATATGTCTCCCAGATGAATTTCTGGAGATAAAAAGTTCGAGTCTTTCC
>
> +
>
> 9ECFFEFGGFGCFGGGGGGGGGGGGGFGFGGGCGGAFF8FD9CFEEF99FCFAECFGGFGGC6,<C8E88,;6FE@EFFF9FFF,C98FCCGC<E@CCFFFCFE,CFFFC9<9BE<C,6BFC<FGE9?FEEEFCFE+,B<FABA,FEDFFAEE<FGFGGGCE<FGFECFGGEEAA9FFGGGGGFG,,A,,,,++++>:+8+88+@,++@DDFCGGGAFGFGE;CCG6DFG7;DFFFAGAEGGGG6E?FGGGEC4C,9?2:C:CGFG77D4DBFF8?CFFGF)=FFFFFFF
>
> But searching the reverse read files (*.2.fq) in the process radtags
> output, I find the header term appearing in a demultiplexed file called

> _*TACCG*_-GCCAAT.2.fq which is wrong. Remember we know the F+R read pair
> was tagged with _*GGTTG*_-GCCAAT.

Norah Saarman

unread,

Sep 13, 2018, 11:36:23 AM9/13/18

to Stacks

Dear Julian,

I am having a similar issue as Matthew, using Stacks v1.46. I did not notice this problem until I tried to use the process_radtags.pl output in BWA rather than Bowtie2 (because BWA looks for exact matches in the headers of paired reads). Now I am finding that after I run process_radtags in this way...

$ process_radtags -1 ./Norah_pool1_001_ATCACG_R1.fastq.gz -2 ./Norah_pool1_001_ATCACG_R2.fastq.gz -b ./barcodes_Norah_pool1_001_ATCACG.txt -o ../processed -q -r --renz_1 nlaIII --renz_2 mluCI -i gzfastq

With a barcode file that looks like this:

AAGGA DUK_14_001

ACACA DUK_14_003

AGCTA DUK_14_002

ATACG DUK_14_008

CAACC DUK_14_007

CTGAT DUK_14_014

CTTGG DUK_14_011

GAGTC DUK_14_389

GCCGT DUK_14_388

GGCCA DUK_14_048

GTAGT AMI_14_376

TCTGC AMI_14_083

GTCGA AMI_14_387

TAGTA AMI_14_689

TCAGT AMI_14_587

TCCGG AMI_14_691

And input looks like this:

$ zcat Norah_pool10_001_ATCACG_R1.fastq.gz | head

@HWI-D00536:268:C9WCBANXX:7:1101:1210:1966 1:N:0:NTCACG

NTCGACATGCAAACCGGCGGTCAATGGCGGATTTCTGGGCAATAGCAGTCTTCTTCACATCCTCCCCGAGAATAGA

#<BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFBFFFFFFF

@HWI-D00536:268:C9WCBANXX:7:1101:1244:1973 1:N:0:ATCACG

NGCTACATGTGTAGCCTTTTTGTAATGTAAGGCTCCTTCAGTTCTTTGGATCATCTCCTAATGAAACCGAACGCGG

#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFBFFFFFFFFF

@HWI-D00536:268:C9WCBANXX:7:1101:1214:1987 1:N:0:NTCACG

NCTGCCATGCCAATGCTAAGTCGTAGATATTGGTCTGATCCGTTAGAGTCTTAATGTTTAAGTTAAAAATTAGATC

... the output doesn't appear to have matching pairs, for example:

$ zcat DUK_14_002.1.fq.gz | head

@1_1101_2205_1997/1

CATGAAGATCAAAAACCACATCTGCTTTGCGAATAAGAAGGAGGATCGTGAAAATATATATGTTTATGCCA

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@1_1101_10548_1977/1

CATGTAAGTGCATATAAGTGTATATGTTAGAAGAAAAGCAAGTGTAGCATATACAACTTATTTCGTTTTCA

FFFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB

@1_1101_20486_1985/1

CATGCCTCTATTATCTATTTGGTTTGTTGTACCATATCGCGAGGAGAAAATAGCGAAACAACAATCACTTT

$ zcat DUK_14_002.2.fq.gz | head

@1_1101_2552_1966/2

AATTAGAAAAGCTCGGTGCACAGGTCGAGAGCCCATTCTGGGGAATCAGTAATGCGAAATAAATAGGTAACGTGCC

DCDDDIIHIHHH<CEDECEHHHHCEE?/EGE?FHHHHC@GHI<CHEHCHC@CF1FE0C<DHCGEFGHHHC?GC11<

@1_1101_9787_1989/2

AATTTTATTAGATTTTTAGTTTAAAATGAAATAGTGGCCCGGCATTCGTCATTACATACCATCACCATCACGTTAC

CDDDDIIEEEHIHIIIIIIHIIIIHIIHFHHHHCHE?G@CFHHHCH@@HHGHGH@GHCE1<F@1DHHIIEH@1DG@

@1_1101_18676_1987/2

AATTCGCCGACCGTTCAGATTGAGCAAACAGCATTTCGTCACATAAATGTACTGGTGTTCAAGTGCGGGTCACCTT

I realized this when trying to modify the output (removed "/1 and /2" to make the headers match exactly) for BWA. I reran the process_radtags step, and this problem output consistently occurs in all of the runs.

It would be difficult for me to change the version of Stacks that I am using because I do not have enough permissions on the system I use to re-install Stacks myself. Let me know if there is a known remedy for this.

Thank you in advance for your time, it is much appreciated!

Norah Saarman

unread,

Sep 15, 2018, 8:17:53 PM9/15/18

to Stacks

Dear Julian and other users with a similar R1/R2 headers question,

I have figured out that the headers in my input file were the problem. The R2 headers do not match the R1 headers in the pre-demultiplexed fastq.gz files as I downloaded them from my sequencing service, and this is only true for a subset of the lanes I got sequenced. From the example I posted about above:

R1 headers in order they appear in the input files:
@HWI-D00536:214:C8752ANXX:1:1101:1309:1981 1:N:0:ATCACG
@HWI-D00536:214:C8752ANXX:1:1101:1336:1981 1:N:0:ATCACG
@HWI-D00536:214:C8752ANXX:1:1101:1863:1983 1:N:0:ATCACG
@HWI-D00536:214:C8752ANXX:1:1101:1957:1988 1:N:0:ATCACG
@HWI-D00536:214:C8752ANXX:1:1101:2105:1973 1:N:0:ATCACG

R2 headers in order they appear in the input files:
@HWI-D00306:689:HK7TNBCXX:1:1101:1236:1980 2:N:0:ATCACG
@HWI-D00306:689:HK7TNBCXX:1:1101:1178:1992 2:N:0:ATCACG
@HWI-D00306:689:HK7TNBCXX:1:1101:1473:1966 2:N:0:ATCACG
@HWI-D00306:689:HK7TNBCXX:1:1101:1614:1974 2:N:0:ATCACG
@HWI-D00306:689:HK7TNBCXX:1:1101:1576:1974 2:N:0:AGCACG

I do not know why this has happened with the raw data I downloaded, and if it means that the data itself is not the correct match, but I thought would let you all know that the problem is not with the process_radtags script.

I do suggest that in process_radtags an error get thrown when the R1 headers do not match the R2 headers, as BWA does... only for selfish (and obvious) reasons.

Regardless, it is great that the Stacks script is not the issue!

Thanks for reading.

Sincerely,

Norah Saarman

Reply all

Reply to author

Forward

0 new messages