Need for reverse complement of R2 reads in STACKS?

504 views
Skip to first unread message

Kerin Bentley

unread,
Aug 25, 2015, 11:29:44 AM8/25/15
to Stacks
Hi,

I have been in discussions with some colleagues who are also using STACKS and we are having a difficult time understanding whether or not there is a need to reverse complement R2 files in paired-end RADseq libraries in order to concatenate R2 reads with the R1 reads, and what pros/cons there may be to using the reverse complement. There are very few resources out there discussing this in detail for RADseq.

For example, I have done process_radtags on all my samples leading to the .1.fq.gz, .2.fq.gz, .1.rem.fq.gz and .2.rem.fq.gz files for each sample and would like to run ustacks etc.. So far, none of the .2 files are reverse complemented. But there are a couple ways in which to concatenate these paired end files together to make the single ustacks input file for that sample. One involves ignoring the fact they are paired end (Stacks would then treat stacks generated from R1 and R2 as 2 loci), and one involves merging paired end reads together (so two sequences become one longer sequence, so Stacks would treat this as a single locus) prior to concatenating that file to the unpaired reads in the .rem files.

So here's the question: I want to concatenate these files together to make a single input file for ustacks... but should I reverse complement the .2 files before hand before concatenating files together? Should I merge my paired end reads into a single locus? Which concatenation route is probably best and why? 

If you're confused as to what I'm asking, here is an example using the process_radtags output files from a "sample 1":

Route 1: Simply concatenate files together (No reverse complement or merging of paired end reads into one sequence line)
$cat sample1.1.fq sample1.2.fq sample1.1.rem.fq sample1.2.rem.fq > sample1_concatenated.fq

vs. 

Route 2: Reverse complement the R2 reads and concatenate files together with a simple cat (paired end reads remain unmerged)
$cat sample1.1.fq sample1_revcomp.2.fq sample1.1.rem.fq sample1_revcomp.2.rem.fq > sample1_concatenated.fq

vs. 

Route 3: Merge paired end reads from sample1.1.fq.gz with the reverse complemented paired reads from sample1.2.fq.gz into one file (.1 sequence is merged with revcomp.2 read into one sequence line) then concatenate in the .rem files:
$cat sample1_R1and2merged.fq sample1.1.rem.fq sample1.2.rem.fq > sample1_concatenated.fq

There was also discussion among us if different routes were more appropriate if you are using these reads for different purposes, e.g. calling SNPs for standard population genetic analysis vs. scans for selection, and if reads are expected to be overlapping vs. non-overlapping. For example, if I was doing a scan for selection, would Route 3 be inappropriate if R1 and R2 reads were expected to be non-overlapping due to large bp size selection (e.g. selection for 500bp fragments)? 

Any input you may have would be appreciated!
Kerin

Message has been deleted

Rick Overson

unread,
Dec 9, 2015, 11:58:47 PM12/9/15
to Stacks
Hi Kerin,

     Did you ever make any headway with this by chance? I am in the same boat now and wonder what route you took...

bbarker505

unread,
Dec 17, 2015, 2:53:56 PM12/17/15
to Stacks
I'm interested in an answer to this question too! 

I have PE data that I have quality trimmed and merged the reverse-complement of the R2 with the R1. I plan on running these PE data through Stacks, but I'm wondering if this approach isn't as good as running the R1 and R2 through the denovo pipeline separately?

Maybe this has already been brought up somewhere in this forum?

Thanks!
Brittany

Julian Catchen

unread,
Dec 17, 2015, 3:41:14 PM12/17/15
to stacks...@googlegroups.com, bbark...@gmail.com, ricko...@gmail.com
Hi All,

If you are dealing with paired-end data, say from double-digest RAD the
only time you have to worry about reverse complementing it is if you
want to merge the sets of paired-reads together. But in this case, the
program doing the merging should likely handle this for you.

If you are not merging the reads, then it doesn't matter what
orientation the RAD loci end up in, as long as it is consistent across
your data set (which it should be by default).

Brittany -- if you are merging your reads, be careful about the number
of reads being discarded because they can't be merged. For this reason,
I would not do any quality trimming until you have merged the reads.

Merging the reads will cause a drop in coverage (as some reads will fail
to merge). And, you will have to trim the merged reads uniformly across
the data set before running the pipeline.

So, merging reads will give you longer haplotypes, but you will have
lower coverage and some loci will drop out in some individuals because
of this. Not merging the reads will give you higher coverage and more
loci, but you will have independent SNPs instead of an integrated haplotype.

We are working on several of these issues for future versions of Stacks.

Best,

julian
Message has been deleted

walla296

unread,
Jun 27, 2017, 12:08:24 PM6/27/17
to Stacks, bbark...@gmail.com, ricko...@gmail.com, jcat...@illinois.edu
Hi All,
I found this thread and have a follow-up question regarding PE reads and whether to merge or not. Our libraries were created with the TruSeq kit (random shearing rather than RAD) with 550bp inserts. The Illumina data is 2x125, so there is likely a low probability of overlap. Given this scenario, it seems the best course of action is to not merge and obtain independent SNPs. Any suggestions on best approach?

Thanks,
Liz    
Reply all
Reply to author
Forward
0 new messages