Hi Roseanna,
Questions 1+2:
You have described the issue and options well. My only question would
be: are your throwing away paired-end reads because the MspI cutsite is
low quality (phred score) or because there are 'N' sequences present? It
doesn't appear, from the command line you included, that you have
specified the '--clean' option, if you had, it would discard reads with
Ns in them.
Assuming you did not use --clean, then I think what you outline is fine:
you can ignore the cutsite on the paired-end reads, and if you are
concerned the software will mistake errors for SNPs, then you can trim
that part of the read off with trimmomatic. (Again, if there were Ns in
the MspI cutsite I would not be worried about errant SNP calls.) Keeping
the cutsite overhang will not help you infer any additional biology in
the downstream data, it will slightly help in the assembly process.
If you can't decide which way to go, try it both ways and look at the
resulting SNP calls.
#2: For the barcode, process_radtags will try to correct it first,
before it does any quality checking (because it is pointless to continue
if it cannot be assigned to a sample). Beyond the barcode, if you are
not specifying --clean, then reads are being dropped based only on phred
scores.
#3: process_radtags understands variable length barcodes and will trim
them to be a uniform length.
#4: process_radtags is not parallelized, asking for additional threads
has no effect (other parts of the pipeline are parallelized). The
process_radtags program is dominated by input/output (reading the raw
sequencing files from disk -- very slow on most clusters) which is
orders of magnitude slower than the computation it is doing once the
reads are in memory.
As an aside, it is almost never useful to specify 48 threads for any
threaded program (there are some rare exceptions for specific kinds of
computation), in fact, adding too many threads will often slow down your
program as the threads start to conflict with each other in accessing
the shared portions of the code/data.
#5: --barcode_dist_1 is the only relevent command line option for
barcode distance as you only have a single barcode.
Best,
julian
Roseanna Gamlen-Greene wrote on 3/27/21 12:38 AM:
> Hi
>
> I've got a few questions about using process_radtags to demultiplex and
> clean GBS data in stacks 2.53. I'd really appreciate some help as I'm
> still stuck after many trials and searches (including the manual,
> journal articles and messages in this group). I'm new to stacks and
> haven't gone past process_radtags yet.
>
> We have double-digested (Sbfl and Mspl), paired-end (2x101bp), Illumina
> HiSeq data with an individual level barcode on the forward read only.
> The data provided from the sequencing platform (NovaSeq) came back to us
> as demultiplexed plates with the Illumina adapter sequences and plate
> barcodes trimmed off (just leaving the individual barcode and cutsite on
> the forward read, and just the cutsite on the reverse read). Now we need
> to demultiplex (and clean) the reads from the plates into individual
> sample files. And eventually we'll run all these files through the rest
> of the stacks pipeline.
>
> -------------------------- I have 5 questions, the first one is long
> ------------------------------
>
> *1) Trim or ignore cutsite on reverse read? *My reverse reads have poor
> "per base sequence quality" for the first three bases (i.e. the Mspl RAD
> cutsite) for most samples. We spoke to the lab that did the work and
> they said that's normal because it's a low complexity region and so I
> said I could either ignore or trim off the cutsite. I'm wondering how
> best to go about this because I know process_radtags is discarding lots
> of reverse reads if I don't deal with it. I also want make sure that
> however I deal with the reverse reads now will be compatible with the
> rest of the stacks pipeline.
>
> _a) The default situation (do nothing): if I specify renz 2 (but also
> rescue) - only get 67% reads retained
> _
> - makes sense so many would be discarded as the mstI site often has more
> than one base incorrect so rescue isn't enough
>
> process_radtags -1 /project/../plate1_R1.fastq.gz
> -2 /project/../plate1_R2.fastq.gz -b
> /scratch/../stacks_barcode_D701.txt -o /scratch/../D701 -w 0.25 -s 20 -y
> gzfastq --inline_null --renz_1 sbfI*--renz_2 mspI *--rescue --quality
> --barcode_dist_1 1 -D --numCPU 48 &>
> process_radtags_standoutputerror_D701.oe
>
> log output
> Total Sequences 534119336
> Barcode Not Found 989162
> Low Quality 3249572
> RAD Cutsite Not Found 169965372
> Retained Reads 359915230 (67%)
>
> _b) If I try not specifying renz 2 - then I get 97% of my reads retained
> (yay)
> _
> -*I want to know, would there be any negative ramifications of doing
> method b) (not specifying renz2) further down the stacks pipeline?
> Anything else I should consider with this? E.g. since the bases at the
> mspl cutsite won't match for many reads (not rescued), later in the
> pipeline, will stacks misidentify the mismatches in the mspl cutsite as
> SNPs? So, would it be better to just trim the first three bases off the
> reverse reads before(? or after?) process_radtags rather than no renz2?*
> process_radtags -1 /project/../plate1_R1.fastq.gz
> -2 /project/../plate1_R2.fastq.gz -b
> /scratch/../stacks_barcode_D701.txt -o /scratch/../D701 -w 0.25 -s 20 -y
> gzfastq --inline_null --renz_1 sbfI --rescue --quality --barcode_dist_1
> 1 -D --numCPU 48 &> process_radtags_standoutputerror_D701.oe
>
> log output
> Total Sequences534119336
> Barcode Not Found989162
> Low Quality6363120
> RAD Cutsite Not Found9425479
> Retained Reads517341575 (97%)
>
> *If it's not correct to do method b) (not specifying renz 2), what
> should I do to stop stacks throwing away so many reverse reads? *
> *_c) Trim off the first three bases of the reverse read somehow?_*
> i) If so, should I do it before process_radtags? If so, how - what
> about Trimmomatic (which is already installed on my HPC?)
> ii) or should I do it after process_radtags? With something like
> trimmomatic?
> d) disable_rad_check? This might be problematic - as it would also stop
> checking the forward read cutsite which I don't think would be good idea.
>
> *P.S. here is an example of a couple of reverse reads after running
> process_radtags - first example with both renz2 and rescue, second
> example with no renz2. I highlighted two identical reads (one
> **yellow**and other **red**) between method 1a) and 1b) so you can see
> how process_radtags is dealing with not specifying renz2.*
>
> *Method 1a). using both renz2 = mspl and --rescue (full process_radtags
> code in Q 1a) above)*
> - corrects everything to CGG if there's only 1 base incorrect - drops
> reads that don't match at least 2 mstl cutsite bases
>
> zcat R02-CW-KA-02.2.fq.gz | head
> @184_2_1101_29803_1000/2
> CGGGGTGTCCAGGGGAGAACTGCTGCTCTCCCTCAAGCCACTGTCTCCACCGCTGCTGAACCCCCTAGCAAGCGTCCCCGTAGCTACGCAAGTGTGGGGCA
> +
> :FF::F,FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,
> @184_2_1101_5466_1031/2
> CGGCCGATATCCGCGACGCCCGTGTGAAAGAGGCCTTACAGCTTTGTGCCCCCTTCTACAGATATGTGAATATCCTATAAAGCTGAGTACATGCCTGGAAT
> +
> ,F:FFFF,FFFFFFFFFFFFFF:F:FFFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:F,FFFFFFFFFFF
>
> *Method 1b) NO renz2 specified (full code in Q 1b) above)*
> *ALSO something else to consider: with method 1b) (no renz2),
> process_ratags is retaining many more reverse reads in the "remainder"
> files than forward reads (below). And the opposite is true for method
> 1a) (specifying renz2). This makes sense to me but would more rem.2 than
> rem.1 be a problem for later on in the stacks pipeline if I use method
> 1b)?* I guess this would also happen if we trimmed off the first three
> bases?
>
> /E.g. Read counts of demultiplexed files from 1b) run of norenz2 and 1a)
> specificying renz2/
>
> *File name number of reads **number reads *
> * 1b) no renz2
> 1a) **specifying renz2*
> R01-CR-CE-01.1.fq.gz1971117796780
> R01-CR-CE-01.2.fq.gz1971117796780
> R01-CR-CE-01.rem.1.fq.gz265801200917
> R01-CR-CE-01.rem.2.fq.gz15062654101
> R01-CR-CE-02.1.fq.gz1708873697828
> R01-CR-CE-02.2.fq.gz1708873697828
> R01-CR-CE-02.rem.1.fq.gz239121034957
> R01-CR-CE-02.rem.2.fq.gz4228615127
> R01-CR-CE-03.1.fq.gz637564256243
> R01-CR-CE-03.2.fq.gz637564256243
> R01-CR-CE-03.rem.1.fq.gz7635388956
> R01-CR-CE-03.rem.2.fq.gz3151811265
> R01-CR-CE-04.1.fq.gz1012871420087
> R01-CR-CE-04.2.fq.gz1012871420087
> R01-CR-CE-04.rem.1.fq.gz14640607424
> R01-CR-CE-04.rem.2.fq.gz4149815052
>
> So, in conclusion, what do you think I should do about the low quality
> mspl cutsite? Thank you!
> -------------------------------------------------------------------------------------------------------------------------
> END Q1, begin Qs 2 - 5
> -------------------------------------------------------------------------------------------------------------------------
> *2) .discards output - not rescuing barcodes: *I took a look at the
> .discards file and noticed that forward reads are being thrown away with
> only one mismatch in the barcode (due to an "N") despite me specifying
> both rescue and barcode_dist_1. There is an "N" in location 2 of many
> reads. Is stacks throwing it away because of the low Phred score for
> that base? If so, how do I stop it doing that? I want to retain reads
> with barcodes with one mismatch (as I specified in process_radtags). I
> have checked my list of barcodes and there is no ambiguity if one base
> of the barcode is incorrect/N./Is stacks discarding the reads because
> the sliding window is looking at quality in the barcode?/ I just want
> the sliding window to look at quality in the sequence after the barcode.
> /Does the order of rescue and barcode_dist_1 vs "w" and "s" and
> "quality" in process_radtags matter?/ I.e. if I moved "rescue" and
> "barcode_dist_1" before,"w" and "s" and "quality", would that stop
> stacks deleting reads with a barcode with an N in it?
>
> /Sample of two forward reads in the .discards file.*Barcodes in bold*/
> /*
> */
> @A00516:184:HLLKFDSXY:2:1101:25880:1000 1:N:0:ATTACTCG
> *GNAAGA*TGCAGATTTCTGTGTGACACACAGTGCATCCATCGCCTGGTTATAGTCTAACAGTGTAAATCCAGCGTTATTTGCCAAAAAAGTAACTTTTGGCA
> +
> F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFF:FFFFFFFF
> @A00516:184:HLLKFDSXY:2:1101:26133:1000 1:N:0:ATTACTCG
> *CNATTA*TGCAGAAATAGTAATCCTTGCCTGTAACTACAGCAGGATTCTGATAACCACTCAGGTTGCAAACTCCATTCCGACCTATCAATATGTCTTCTATA
> +
> F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFF:FFFFFFFFFF::F
>
> -------------------------------------------------------------------------------------------------------------------------
> *3) Variable length forward reads. *Once demultiplexed, the forward
> reads vary from 93 - 97 bases between individuals (due to variable
> barcode lengths of 4 to 8 bases between individuals). Do I need to trim
> my forward reads to all be the same length between individuals to use in
> the rest of the stacks pipeline (e.g .for making loci) - i.e. trim the
> 3' end? If I do need to trim the ends off in process_radtags, do I use
> -t 93 or --len-limit 93 or something else? Or is it ok to have variable
> forward read lengths between individuals because all forward reads
> within a single individual will have the same length (because all reads
> from one individual have the same barcode)?
> -------------------------------------------------------------------------------------------------------------------------
> *4) * parallelization*.* I'm running process_radtags on a compute canada
> server and it's taking a lot of time and resources despite me requesting
> 48 cpus, and I'm wondering if stacks doesn't know to use 48 cpus and is
> just trying one cpu. Does stacks automatically detect the number of cpus
> available? Or do I need to add a term for number of cpus? I tried adding
> --numCPU 48 to the list of process_radtags terms (as I saw that in a
> previous version) but it failed b/c it couldn't recognise the term so
> that clearly wasn't it. Any ideas?
> -------------------------------------------------------------------------------------------------------------------------
> *5) barcode_dist. *Just wanted to double check. Do I use barcode_dist_1
> or barcode_dist_2 for my barcode rescue? I think it should be
> barcode_dist_1 because I have paired-end reads but only one barcode
> (just on the forward read).
> ---------------------------------------------------------------------------------
>
> Thank you so much for reading all of that and thank you in advance if
> you can help with any of it!
>
> Many thanks,
> Roseanna
>
> Roseanna Gamlen-Greene (she/her)
> PhD Candidate
> <
https://urldefense.com/v3/__https://www.grad.ubc.ca/campus-community/meet-our-students/gamlen-greene-roseanna__;%21%21DZ3fjg%21sLIh4d-zlKQwuZ1yQ0lp8NVF7GdsVqNQGkrOOlGy17a6sOfIPnVdgDyHC3_lg1LAmhw$>
> | Vanier Canada Scholar | National Geographic Explorer
> <
https://urldefense.com/v3/__https://www.nationalgeographic.org/find-explorers/roseanna-gamlen-greene__;%21%21DZ3fjg%21sLIh4d-zlKQwuZ1yQ0lp8NVF7GdsVqNQGkrOOlGy17a6sOfIPnVdgDyHC3_lhr-rwIs$>