process_radtags Qs. Trim/exclude low quality Mspl RAD cutsite? Barcode rescue? Variable read length? Parallelization?

894 views
Skip to first unread message

Roseanna Gamlen-Greene

unread,
Mar 27, 2021, 1:38:25 AM3/27/21
to Stacks
Hi

I've got a few questions about using process_radtags to demultiplex and clean GBS data in stacks 2.53. I'd really appreciate some help as I'm still stuck after many trials and searches (including the manual, journal articles and messages in this group). I'm new to stacks and haven't gone past process_radtags yet. 

We have double-digested (Sbfl and Mspl), paired-end (2x101bp), Illumina HiSeq data with an individual level barcode on the forward read only. The data provided from the sequencing platform (NovaSeq) came back to us as demultiplexed plates with the Illumina adapter sequences and plate barcodes trimmed off (just leaving the individual barcode and cutsite on the forward read, and just the cutsite on the reverse read). Now we need to demultiplex (and clean) the reads from the plates into individual sample files. And eventually we'll run all these files through the rest of the stacks pipeline. 

-------------------------- I have 5 questions, the first one is long ------------------------------

1) Trim or ignore cutsite on reverse read? My reverse reads have poor "per base sequence quality" for the first three bases (i.e. the Mspl RAD cutsite) for most samples. We spoke to the lab that did the work and they said that's normal because it's a low complexity region and so I said I could either ignore or trim off the cutsite. I'm wondering how best to go about this because I know process_radtags is discarding lots of reverse reads if I don't deal with it. I also want make sure that however I deal with the reverse reads now will be compatible with the rest of the stacks pipeline. 

a) The default situation (do nothing): if I specify renz 2 (but also rescue) - only get 67% reads retained
- makes sense so many would be discarded as the mstI site often has more than one base incorrect so rescue isn't enough

process_radtags -1 /project/../plate1_R1.fastq.gz -2  /project/../plate1_R2.fastq.gz   -b /scratch/../stacks_barcode_D701.txt -o /scratch/../D701 -w 0.25 -s 20 -y gzfastq --inline_null --renz_1 sbfI --renz_2 mspI --rescue --quality --barcode_dist_1 1 -D --numCPU 48 &> process_radtags_standoutputerror_D701.oe 

log output
Total Sequences 534119336
Barcode Not Found       989162
Low Quality     3249572
RAD Cutsite Not Found   169965372
Retained Reads  359915230 (67%)

b) If I try not specifying renz 2 - then I get 97% of my reads retained (yay)
       - I want to know, would there be any negative ramifications of doing method b) (not specifying renz2) further down the stacks pipeline? Anything else I should consider with this? E.g. since the bases at the mspl cutsite won't match for many reads (not rescued), later in the pipeline, will stacks misidentify the mismatches in the mspl cutsite as SNPs? So, would it be better to just trim the first three bases off the reverse reads before(? or after?) process_radtags rather than no renz2?
     
process_radtags -1 /project/../plate1_R1.fastq.gz -2  /project/../plate1_R2.fastq.gz   -b /scratch/../stacks_barcode_D701.txt -o /scratch/../D701 -w 0.25 -s 20 -y gzfastq --inline_null --renz_1 sbfI --rescue --quality --barcode_dist_1 1 -D --numCPU 48 &> process_radtags_standoutputerror_D701.oe 

log output
Total Sequences 534119336
Barcode Not Found 989162
Low Quality 6363120
RAD Cutsite Not Found 9425479
Retained Reads 517341575 (97%)

If it's not correct to do method b) (not specifying renz 2), what should I do to stop stacks throwing away so many reverse reads? 
c) Trim off the first three bases of the reverse read somehow?
     i) If so, should I do it before process_radtags? If so, how - what about Trimmomatic (which is already installed on my HPC?)
    ii) or should I do it after process_radtags? With something like trimmomatic?
d) disable_rad_check? This might be problematic - as it would also stop checking the forward read cutsite which I don't think would be good idea. 

P.S. here is an example of a couple of reverse reads after running process_radtags - first example with both renz2 and rescue, second example with no renz2. I highlighted two identical reads (one yellow and other red) between method 1a) and 1b) so you can see how process_radtags is dealing with not specifying renz2.

Method 1a). using both renz2 = mspl and --rescue (full process_radtags code in Q 1a) above)
 - corrects everything to CGG if there's only 1 base incorrect - drops reads that don't match at least 2 mstl cutsite bases

zcat R02-CW-KA-02.2.fq.gz | head 
@184_2_1101_29803_1000/2
CGGGGTGTCCAGGGGAGAACTGCTGCTCTCCCTCAAGCCACTGTCTCCACCGCTGCTGAACCCCCTAGCAAGCGTCCCCGTAGCTACGCAAGTGTGGGGCA
+
:FF::F,FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,
@184_2_1101_5466_1031/2
CGGCCGATATCCGCGACGCCCGTGTGAAAGAGGCCTTACAGCTTTGTGCCCCCTTCTACAGATATGTGAATATCCTATAAAGCTGAGTACATGCCTGGAAT
+
,F:FFFF,FFFFFFFFFFFFFF:F:FFFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:F,FFFFFFFFFFF

Method 1b) NO renz2 specified (full code in Q 1b) above)
  - ignores mspl cutsite and would only remove read if the Phred quality score was low
 - means there are many more reverse reads kept than the previous one!
- But for further down the pipeline, do we need to either trim the off the first three bases in all of these reads - otherwise will stacks have trouble making loci? Especially since we're not treating it as a cutsite? Or should we replace the first three bases with CGG?

zcat R02-CW-KA-02.2.fq.gz | head -20
@184_2_1101_29803_1000/2
GGGGGTGTCCAGGGGAGAACTGCTGCTCTCCCTCAAGCCACTGTCTCCACCGCTGCTGAACCCCCTAGCAAGCGTCCCCGTAGCTACGCAAGTGTGGGGCA
+
:FF::F,FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,
@184_2_1101_5466_1031/2
GGGCCGATATCCGCGACGCCCGTGTGAAAGAGGCCTTACAGCTTTGTGCCCCCTTCTACAGATATGTGAATATCCTATAAAGCTGAGTACATGCCTGGAAT
+
,F:FFFF,FFFFFFFFFFFFFF:F:FFFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:F,FFFFFFFFFFF
.
. other reads in between -- wanted to highlight the ones below b/c as you can see, all three bases mismatch for mstl cutsite CGG
.
@184_2_1101_19298_1063/2
GCGGGTCGTCACTTTAGAGCTTGCGGACCTGTGGCCAGAGAATTGGAATTTTACTTTGAGTGGGGCAAACATTCATGCTTTCCGTAAATCAGTGTATATTA
+
F,,FFFFFFFF,F:,:,FF:FF,,:FF,F:FFF,::,FF,,FF:FFF:F,FFFF,F,F,F,:,F:FFFFFFF:,:F,F:FFF:FFFF:FF,,:F::F:FFF
@184_2_1101_27977_1063/2
AACTCCTTAAGGAGTTAAAATTTGATCATTTTCAGTTTTACCTGCTAATAACTCCCACAAATGCATGCTACTTTCCTATTCAGCAGCTCTCATCTCATAGT
+
,::FF:FFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

ALSO something else to consider: with method 1b) (no renz2), process_ratags is retaining many more reverse reads in the "remainder" files than forward reads (below). And the opposite is true for method 1a) (specifying renz2). This makes sense to me but would more rem.2 than rem.1 be a problem for later on in the stacks pipeline if I use method 1b)? I guess this would also happen if we trimmed off the first three bases?

E.g. Read counts of demultiplexed files from 1b) run of norenz2 and 1a) specificying renz2

File name            number of reads       number reads 
                                                1b) no renz2            1a) specifying renz2
R01-CR-CE-01.1.fq.gz 1971117 796780
R01-CR-CE-01.2.fq.gz 1971117 796780
R01-CR-CE-01.rem.1.fq.gz 26580 1200917
R01-CR-CE-01.rem.2.fq.gz 150626 54101
R01-CR-CE-02.1.fq.gz 1708873 697828
R01-CR-CE-02.2.fq.gz 1708873 697828
R01-CR-CE-02.rem.1.fq.gz 23912 1034957
R01-CR-CE-02.rem.2.fq.gz 42286 15127
R01-CR-CE-03.1.fq.gz 637564 256243
R01-CR-CE-03.2.fq.gz 637564 256243
R01-CR-CE-03.rem.1.fq.gz 7635 388956
R01-CR-CE-03.rem.2.fq.gz 31518 11265
R01-CR-CE-04.1.fq.gz 1012871 420087
R01-CR-CE-04.2.fq.gz 1012871 420087
R01-CR-CE-04.rem.1.fq.gz 14640 607424
R01-CR-CE-04.rem.2.fq.gz 41498 15052

So, in conclusion, what do you think I should do about the low quality mspl cutsite? Thank you!
-------------------------------------------------------------------------------------------------------------------------
                                         END Q1, begin Qs 2 - 5
-------------------------------------------------------------------------------------------------------------------------
2) .discards output - not rescuing barcodes: I took a look at the .discards file and noticed that forward reads are being thrown away with only one mismatch in the barcode (due to an "N") despite me specifying both rescue and barcode_dist_1. There is an "N" in location 2 of many reads. Is stacks throwing it away because of the low Phred score for that base? If so, how do I stop it doing that? I want to retain reads with barcodes with one mismatch (as I specified in process_radtags). I have checked my list of barcodes and there is no ambiguity if one base of the barcode is incorrect/N. Is stacks discarding the reads because the sliding window is looking at quality in the barcode? I just want the sliding window to look at quality in the sequence after the barcode. Does the order of rescue and barcode_dist_1 vs "w" and "s" and "quality" in process_radtags matter?  I.e. if I moved "rescue" and "barcode_dist_1" before,"w" and "s" and "quality", would that stop stacks deleting reads with a barcode with an N in it? 

Sample of two forward reads in the .discards file. Barcodes in bold

@A00516:184:HLLKFDSXY:2:1101:25880:1000 1:N:0:ATTACTCG
GNAAGATGCAGATTTCTGTGTGACACACAGTGCATCCATCGCCTGGTTATAGTCTAACAGTGTAAATCCAGCGTTATTTGCCAAAAAAGTAACTTTTGGCA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFF:FFFFFFFF
@A00516:184:HLLKFDSXY:2:1101:26133:1000 1:N:0:ATTACTCG
CNATTATGCAGAAATAGTAATCCTTGCCTGTAACTACAGCAGGATTCTGATAACCACTCAGGTTGCAAACTCCATTCCGACCTATCAATATGTCTTCTATA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFF:FFFFFFFFFF::F

-------------------------------------------------------------------------------------------------------------------------
3) Variable length forward reads. Once demultiplexed, the forward reads vary from 93 - 97 bases between individuals (due to variable barcode lengths of 4 to 8 bases between individuals). Do I need to trim my forward reads to all be the same length between individuals to use in the rest of the stacks pipeline (e.g .for making loci) - i.e. trim the 3' end? If I do need to trim the ends off in process_radtags, do I use -t 93 or --len-limit 93 or something else? Or is it ok to have variable forward read lengths between individuals because all forward reads within a single individual will have the same length (because all reads from one individual have the same barcode)? 
-------------------------------------------------------------------------------------------------------------------------
4)  parallelization. I'm running process_radtags on a compute canada server and it's taking a lot of time and resources despite me requesting 48 cpus, and I'm wondering if stacks doesn't know to use 48 cpus and is just trying one cpu. Does stacks automatically detect the number of cpus available? Or do I need to add a term for number of cpus? I tried adding --numCPU 48 to the list of process_radtags terms (as I saw that in a previous version) but it failed b/c it couldn't recognise the term so that clearly wasn't it. Any ideas?
-------------------------------------------------------------------------------------------------------------------------
5) barcode_dist. Just wanted to double check. Do I use barcode_dist_1 or barcode_dist_2 for my barcode rescue? I think it should be barcode_dist_1 because I have paired-end reads but only one barcode (just on the forward read). 
---------------------------------------------------------------------------------

Thank you so much for reading all of that and thank you in advance if you can help with any of it! 

Many thanks,
Roseanna

Roseanna Gamlen-Greene (she/her)
Forest and Conservation Sciences
University of British Columbia, Vancouver


Julian Catchen

unread,
Mar 29, 2021, 10:53:16 AM3/29/21
to stacks...@googlegroups.com, Roseanna Gamlen-Greene
Hi Roseanna,

Questions 1+2:

You have described the issue and options well. My only question would
be: are your throwing away paired-end reads because the MspI cutsite is
low quality (phred score) or because there are 'N' sequences present? It
doesn't appear, from the command line you included, that you have
specified the '--clean' option, if you had, it would discard reads with
Ns in them.

Assuming you did not use --clean, then I think what you outline is fine:
you can ignore the cutsite on the paired-end reads, and if you are
concerned the software will mistake errors for SNPs, then you can trim
that part of the read off with trimmomatic. (Again, if there were Ns in
the MspI cutsite I would not be worried about errant SNP calls.) Keeping
the cutsite overhang will not help you infer any additional biology in
the downstream data, it will slightly help in the assembly process.

If you can't decide which way to go, try it both ways and look at the
resulting SNP calls.

#2: For the barcode, process_radtags will try to correct it first,
before it does any quality checking (because it is pointless to continue
if it cannot be assigned to a sample). Beyond the barcode, if you are
not specifying --clean, then reads are being dropped based only on phred
scores.

#3: process_radtags understands variable length barcodes and will trim
them to be a uniform length.

#4: process_radtags is not parallelized, asking for additional threads
has no effect (other parts of the pipeline are parallelized). The
process_radtags program is dominated by input/output (reading the raw
sequencing files from disk -- very slow on most clusters) which is
orders of magnitude slower than the computation it is doing once the
reads are in memory.

As an aside, it is almost never useful to specify 48 threads for any
threaded program (there are some rare exceptions for specific kinds of
computation), in fact, adding too many threads will often slow down your
program as the threads start to conflict with each other in accessing
the shared portions of the code/data.

#5: --barcode_dist_1 is the only relevent command line option for
barcode distance as you only have a single barcode.

Best,

julian



Roseanna Gamlen-Greene wrote on 3/27/21 12:38 AM:
> Hi
>
> I've got a few questions about using process_radtags to demultiplex and
> clean GBS data in stacks 2.53. I'd really appreciate some help as I'm
> still stuck after many trials and searches (including the manual,
> journal articles and messages in this group). I'm new to stacks and
> haven't gone past process_radtags yet.
>
> We have double-digested (Sbfl and Mspl), paired-end (2x101bp), Illumina
> HiSeq data with an individual level barcode on the forward read only.
> The data provided from the sequencing platform (NovaSeq) came back to us
> as demultiplexed plates with the Illumina adapter sequences and plate
> barcodes trimmed off (just leaving the individual barcode and cutsite on
> the forward read, and just the cutsite on the reverse read). Now we need
> to demultiplex (and clean) the reads from the plates into individual
> sample files. And eventually we'll run all these files through the rest
> of the stacks pipeline.
>
> -------------------------- I have 5 questions, the first one is long
> ------------------------------
>
> *1) Trim or ignore cutsite on reverse read? *My reverse reads have poor
> "per base sequence quality" for the first three bases (i.e. the Mspl RAD
> cutsite) for most samples. We spoke to the lab that did the work and
> they said that's normal because it's a low complexity region and so I
> said I could either ignore or trim off the cutsite. I'm wondering how
> best to go about this because I know process_radtags is discarding lots
> of reverse reads if I don't deal with it. I also want make sure that
> however I deal with the reverse reads now will be compatible with the
> rest of the stacks pipeline.
>
> _a) The default situation (do nothing): if I specify renz 2 (but also
> rescue) - only get 67% reads retained
> _
> - makes sense so many would be discarded as the mstI site often has more
> than one base incorrect so rescue isn't enough
>
> process_radtags -1 /project/../plate1_R1.fastq.gz
> -2  /project/../plate1_R2.fastq.gz   -b
> /scratch/../stacks_barcode_D701.txt -o /scratch/../D701 -w 0.25 -s 20 -y
> gzfastq --inline_null --renz_1 sbfI*--renz_2 mspI *--rescue --quality
> --barcode_dist_1 1 -D --numCPU 48 &>
> process_radtags_standoutputerror_D701.oe
>
> log output
> Total Sequences 534119336
> Barcode Not Found       989162
> Low Quality     3249572
> RAD Cutsite Not Found   169965372
> Retained Reads  359915230 (67%)
>
> _b) If I try not specifying renz 2 - then I get 97% of my reads retained
> (yay)
> _
>  -*I want to know, would there be any negative ramifications of doing
> method b) (not specifying renz2) further down the stacks pipeline?
> Anything else I should consider with this? E.g. since the bases at the
> mspl cutsite won't match for many reads (not rescued), later in the
> pipeline, will stacks misidentify the mismatches in the mspl cutsite as
> SNPs? So, would it be better to just trim the first three bases off the
> reverse reads before(? or after?) process_radtags rather than no renz2?*
> process_radtags -1 /project/../plate1_R1.fastq.gz
> -2  /project/../plate1_R2.fastq.gz   -b
> /scratch/../stacks_barcode_D701.txt -o /scratch/../D701 -w 0.25 -s 20 -y
> gzfastq --inline_null --renz_1 sbfI --rescue --quality --barcode_dist_1
> 1 -D --numCPU 48 &> process_radtags_standoutputerror_D701.oe
>
> log output
> Total Sequences534119336
> Barcode Not Found989162
> Low Quality6363120
> RAD Cutsite Not Found9425479
> Retained Reads517341575 (97%)
>
> *If it's not correct to do method b) (not specifying renz 2), what
> should I do to stop stacks throwing away so many reverse reads? *
> *_c) Trim off the first three bases of the reverse read somehow?_*
>      i) If so, should I do it before process_radtags? If so, how - what
> about Trimmomatic (which is already installed on my HPC?)
>     ii) or should I do it after process_radtags? With something like
> trimmomatic?
> d) disable_rad_check? This might be problematic - as it would also stop
> checking the forward read cutsite which I don't think would be good idea.
>
> *P.S. here is an example of a couple of reverse reads after running
> process_radtags - first example with both renz2 and rescue, second
> example with no renz2. I highlighted two identical reads (one
> **yellow**and other **red**) between method 1a) and 1b) so you can see
> how process_radtags is dealing with not specifying renz2.*
>
> *Method 1a). using both renz2 = mspl and --rescue (full process_radtags
> code in Q 1a) above)*
>  - corrects everything to CGG if there's only 1 base incorrect - drops
> reads that don't match at least 2 mstl cutsite bases
>
> zcat R02-CW-KA-02.2.fq.gz | head
> @184_2_1101_29803_1000/2
> CGGGGTGTCCAGGGGAGAACTGCTGCTCTCCCTCAAGCCACTGTCTCCACCGCTGCTGAACCCCCTAGCAAGCGTCCCCGTAGCTACGCAAGTGTGGGGCA
> +
> :FF::F,FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,
> @184_2_1101_5466_1031/2
> CGGCCGATATCCGCGACGCCCGTGTGAAAGAGGCCTTACAGCTTTGTGCCCCCTTCTACAGATATGTGAATATCCTATAAAGCTGAGTACATGCCTGGAAT
> +
> ,F:FFFF,FFFFFFFFFFFFFF:F:FFFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:F,FFFFFFFFFFF
>
> *Method 1b) NO renz2 specified (full code in Q 1b) above)*
> *ALSO something else to consider: with method 1b) (no renz2),
> process_ratags is retaining many more reverse reads in the "remainder"
> files than forward reads (below). And the opposite is true for method
> 1a) (specifying renz2). This makes sense to me but would more rem.2 than
> rem.1 be a problem for later on in the stacks pipeline if I use method
> 1b)?* I guess this would also happen if we trimmed off the first three
> bases?
>
> /E.g. Read counts of demultiplexed files from 1b) run of norenz2 and 1a)
> specificying renz2/
>
> *File name           number of reads **number reads *
> *                                                1b) no renz2
>  1a) **specifying renz2*
> R01-CR-CE-01.1.fq.gz1971117796780
> R01-CR-CE-01.2.fq.gz1971117796780
> R01-CR-CE-01.rem.1.fq.gz265801200917
> R01-CR-CE-01.rem.2.fq.gz15062654101
> R01-CR-CE-02.1.fq.gz1708873697828
> R01-CR-CE-02.2.fq.gz1708873697828
> R01-CR-CE-02.rem.1.fq.gz239121034957
> R01-CR-CE-02.rem.2.fq.gz4228615127
> R01-CR-CE-03.1.fq.gz637564256243
> R01-CR-CE-03.2.fq.gz637564256243
> R01-CR-CE-03.rem.1.fq.gz7635388956
> R01-CR-CE-03.rem.2.fq.gz3151811265
> R01-CR-CE-04.1.fq.gz1012871420087
> R01-CR-CE-04.2.fq.gz1012871420087
> R01-CR-CE-04.rem.1.fq.gz14640607424
> R01-CR-CE-04.rem.2.fq.gz4149815052
>
> So, in conclusion, what do you think I should do about the low quality
> mspl cutsite? Thank you!
> -------------------------------------------------------------------------------------------------------------------------
>                                          END Q1, begin Qs 2 - 5
> -------------------------------------------------------------------------------------------------------------------------
> *2) .discards output - not rescuing barcodes: *I took a look at the
> .discards file and noticed that forward reads are being thrown away with
> only one mismatch in the barcode (due to an "N") despite me specifying
> both rescue and barcode_dist_1. There is an "N" in location 2 of many
> reads. Is stacks throwing it away because of the low Phred score for
> that base? If so, how do I stop it doing that? I want to retain reads
> with barcodes with one mismatch (as I specified in process_radtags). I
> have checked my list of barcodes and there is no ambiguity if one base
> of the barcode is incorrect/N./Is stacks discarding the reads because
> the sliding window is looking at quality in the barcode?/ I just want
> the sliding window to look at quality in the sequence after the barcode.
> /Does the order of rescue and barcode_dist_1 vs "w" and "s" and
> "quality" in process_radtags matter?/  I.e. if I moved "rescue" and
> "barcode_dist_1" before,"w" and "s" and "quality", would that stop
> stacks deleting reads with a barcode with an N in it?
>
> /Sample of two forward reads in the .discards file.*Barcodes in bold*/
> /*
> */
> @A00516:184:HLLKFDSXY:2:1101:25880:1000 1:N:0:ATTACTCG
> *GNAAGA*TGCAGATTTCTGTGTGACACACAGTGCATCCATCGCCTGGTTATAGTCTAACAGTGTAAATCCAGCGTTATTTGCCAAAAAAGTAACTTTTGGCA
> +
> F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFF:FFFFFFFF
> @A00516:184:HLLKFDSXY:2:1101:26133:1000 1:N:0:ATTACTCG
> *CNATTA*TGCAGAAATAGTAATCCTTGCCTGTAACTACAGCAGGATTCTGATAACCACTCAGGTTGCAAACTCCATTCCGACCTATCAATATGTCTTCTATA
> +
> F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFF:FFFFFFFFFF::F
>
> -------------------------------------------------------------------------------------------------------------------------
> *3) Variable length forward reads. *Once demultiplexed, the forward
> reads vary from 93 - 97 bases between individuals (due to variable
> barcode lengths of 4 to 8 bases between individuals). Do I need to trim
> my forward reads to all be the same length between individuals to use in
> the rest of the stacks pipeline (e.g .for making loci) - i.e. trim the
> 3' end? If I do need to trim the ends off in process_radtags, do I use
> -t 93 or --len-limit 93 or something else? Or is it ok to have variable
> forward read lengths between individuals because all forward reads
> within a single individual will have the same length (because all reads
> from one individual have the same barcode)?
> -------------------------------------------------------------------------------------------------------------------------
> *4) * parallelization*.* I'm running process_radtags on a compute canada
> server and it's taking a lot of time and resources despite me requesting
> 48 cpus, and I'm wondering if stacks doesn't know to use 48 cpus and is
> just trying one cpu. Does stacks automatically detect the number of cpus
> available? Or do I need to add a term for number of cpus? I tried adding
> --numCPU 48 to the list of process_radtags terms (as I saw that in a
> previous version) but it failed b/c it couldn't recognise the term so
> that clearly wasn't it. Any ideas?
> -------------------------------------------------------------------------------------------------------------------------
> *5) barcode_dist. *Just wanted to double check. Do I use barcode_dist_1
> or barcode_dist_2 for my barcode rescue? I think it should be
> barcode_dist_1 because I have paired-end reads but only one barcode
> (just on the forward read).
> ---------------------------------------------------------------------------------
>
> Thank you so much for reading all of that and thank you in advance if
> you can help with any of it!
>
> Many thanks,
> Roseanna
>
> Roseanna Gamlen-Greene (she/her)
> PhD Candidate
> <https://urldefense.com/v3/__https://www.grad.ubc.ca/campus-community/meet-our-students/gamlen-greene-roseanna__;%21%21DZ3fjg%21sLIh4d-zlKQwuZ1yQ0lp8NVF7GdsVqNQGkrOOlGy17a6sOfIPnVdgDyHC3_lg1LAmhw$>
> | Vanier Canada Scholar | National Geographic Explorer
> <https://urldefense.com/v3/__https://www.nationalgeographic.org/find-explorers/roseanna-gamlen-greene__;%21%21DZ3fjg%21sLIh4d-zlKQwuZ1yQ0lp8NVF7GdsVqNQGkrOOlGy17a6sOfIPnVdgDyHC3_lhr-rwIs$>

Jordan Garcia

unread,
Dec 9, 2021, 3:45:53 PM12/9/21
to Stacks
Hello Roseanna,

I have an almost identical problem that you had. For my paired-end ddRAD (sbfI and pspXI) data using NovaSeq 6000, Fast QC shows good quality on the entire read 1 but poor quality for the first 5 bp in read 2. My R1 starts with barcodes followed by the TGCAGG sbfI cut site, while my R2 should start with the TCGA pspXI cut site (if I understand correctly). However, many reads seem to have at least two bp differences and I am loosing 34.4% of my reads because of "RAD cut site not found", retaining only 60% of total reads when I included the second restriction enzyme and used rescue (your method 1a).

Could you please let us know if you decided to try it both ways and whether it made an impact on your SNP calling? What did you decide to do and why?

Thank you,
Jordan 

Roseanna Gamlen-Greene

unread,
Dec 9, 2021, 5:14:02 PM12/9/21
to jg2...@cornell.edu, Stacks
Hi Jordan,

I trimmed the Mspl cut site from all the reverse reads using Trimmomatic v0.39 (flag= HEADCROP:3). As Julien said, retaining the cut site does not help infer any additional biology in the downstream data and only helps with the assembly - and since I was going to lose so many reads by keeping the reverse cut site, I opted to remove it. I ran Stacks v2.53, process_radtags with flags: --inline_null --renz_1 sbfI --quality --rescue --barcode_dist_1 s = 20 w = 0.25. This solved the issue for me (I retained >95% of reads - the RAD cut site was not found in <2% of reads). My downstream SNP calling worked well. I did not bother to call SNPs without removing the reverse reads because my whole pipeline took 3 weeks to run and I already knew that I would end up with so few reads without removing the cut sites of the reverse reads.

I hope this fix works for you! 

Best,
Roseanna

Roseanna Gamlen-Greene (she/her)
PhD Candidate | Vanier Canada Scholar | Killam Laureate | National Geographic Explorer | UBC Public Scholar
Forest and Conservation Sciences
University of British Columbia, Vancouver

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to a topic in the Google Groups "Stacks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stacks-users/dO4Qw-4RrUc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/50f99a26-b9e2-4891-8383-9692caf66c94n%40googlegroups.com.

betsy.s...@gmail.com

unread,
Sep 22, 2023, 2:03:20 PM9/22/23
to Stacks
Hi all, 

I have a very nearly identical problem with my Novaseq library, with a slight twist. Instead of a ddrad library with a barcode only on R1 like in your library prep,  I have a ddrad library with dual indexed reads that both have a barcode of 7-10 bp in length before the enzyme cut site. So R1 and R2 have a 7-10 bp barcode + 5 bp enzyme cut site on the 5' end of the read. Luckily my barcodes sequenced well, but there is a very consistent dip in sequence quality on R2 at 10 bp into the sequence ( right at the enzyme cut site) that I see in all my fastqc reports (for each of my 49 pools!). And when I examine my raw fastq files I can see that many reads have sequencing errors in the cut site on the R2s (hard data on this below from one sample so you can see the actual cut site sequences). 

Unsurprisingly, when I run process rad tags with the rad check enabled for each of my enzymes (XbaI and EcoRI), it can't find the cut site in 8-10% of my reads for each of my 49 pools (this is with rescue enabled). This is high enough for me to not want to throw away my data for no reason other than some sequencing errors, but I also understand the importance of checking the rad cut site is intact (to prevent including erroneous sequence data). I have confirmed that it is in fact the R2 causing the problem and not the R1 (R1 cut sites are nearly 100% intact). 

In the case of your datasets, you could trim off the rad cut site on the R2 and run the pipeline with rad check enabled for the first enzyme. However,  because I have a variable length barcode (7-10 bp) before the rad cute site on R2, I can't just cut off the rad cut site before doing process rad tags (or maybe I could extract those 5 bp, but I am hesitant to do it with my basic programming skills). I would love to just leave the cut site on the R2 and run process radtags with the rad check for cut site r1 only. However, I am concerned this could cause problems downstream. 

Jordan, I am interested to know if you ran the pipeline with the rad cut site on R2 remaining, and if leaving it on created any unwanted artifacts? I COULD trim off the first 5 bp of each read AFTER running process rad tags since the barcodes will be removed, but I am a bit hesitant to mess with the sample files and .rem files created by process rad tags. Maybe I shouldn't be hesitant? Obviously it would be much simpler just to do process rad tags with one enzyme check enabled and then proceed with the pipeline with the cut sites left on each of the reads. 


Here is a demonstration of the problem: 
I ran process rad tags with rad check enabled for my two enzymes for one of my pools with the following script: 
/Applications/stacks-2.64/process_radtags -1 out.A1.r1.fastq.gz -2 out.A1.r2.fastq.gz -i gzfastq -b Pool_A1_barcodes.txt -o /Output_files/A1 
-c -q -r -w 0.10 -s 10 -t 140 -D --inline_inline --barcode-dist-2 2 --renz_1 xbaI --renz_2 ecoRI --filter-illumina --retain_header

(note that I didn't include the adapter checking because I already removed adapter sequences with cut adapt prior to running process rad tags because I had some slightly complex, but low%, adapter contamination on the R2 ends)

BEGIN total_raw_read_counts
Total Sequences 139991652
Failed Illumina filtered reads 0 0.0%
Barcode Not Found 1960430 1.4%
Low Quality 109552 0.1%
RAD Cutsite Not Found 12691987 9.1%
Retained Reads 125229683 89.5%
END total_raw_read_counts

I ran process rad tags with rad check disabled for one of my pools with the following script (I plan to run it again with rad check enabled for enzyme 1):
/Applications/stacks-2.64/process_radtags -1 out.A1.r1.fastq.gz -2 out.A1.r2.fastq.gz \
-i gzfastq -b /Pool_A1_barcodes.txt -o /Test_A1_processradtags/test_disableradcheck \
-c -q -r -w 0.10 -s 10 -t 140 -D --inline_inline  --barcode-dist-2 2 --disable_rad_check --filter-illumina --retain_header 


(note that I didn't include the adapter checking because I already removed adapter sequences with cut adapt prior to running process rad tags because I had some slightly complex, but low%, adapter contamination on the R2 ends). 

Output:
BEGIN total_raw_read_counts
Total Sequences 139991652
Failed Illumina filtered reads 0 0.0%
Barcode Not Found 1960430 1.4%
Low Quality 110233 0.1%
RAD Cutsite Not Found 0 0.0%
Retained Reads 137920989 98.5%
END total_raw_read_counts

I checked the output of one of my sample R2 files produced from process radtags commands above (with rad check disabled) with this script to count the first 5 bp of each sequence (to see if the first 5 bp were in fact the EcoRI cut site, or close to it). 

cat mysampleR2file.fq |

awk '(NR%4==2)' |

cut -c 1-5  |

sort |

uniq -c |

sort -n -r


Results of the above script are what I expect. I see the EcoRI cut site (AATTC) as the top result, followed by a lot of sequences very close to this. You can see that some of the top results have 2 bp mistakes, though, which is why 8-10% aren't passing rad check. 

11328151 AATTC

7432863 AAGTC

1829733 AAGGC

793777 AATGC

10409 AATTA

6964 AAGTA

5983 AACTC

5894 AAATC

3094 ATTCT

2811 AATCC

2790 AAATA

2638 AAGGA

2548 AATAC

2101 AACGC

1841 AACTA

1270 AAGCC

1250 AAAAA

1224 AATAA

1199 AAGAC

1114 AATGA

 .... cont 


Thanks for any insights or tips you all can provide!
Betsy Collins
PhD candidate, George Mason University
Reply all
Reply to author
Forward
0 new messages