Pre-STACKS PE-ddRAD QC and cleanout

981 views
Skip to first unread message

Jamie

unread,
Mar 2, 2016, 3:04:42 PM3/2/16
to Stacks
Hi all

I'm interested in (a) peoples thoughts on the need for cleaning out reads based on contaminant inclusion and lower read scores and (b) some help with an issue I'm struggling to resolve.

We have around 600 samples undergone RAD (paired end (125bp) double digest): There's a small section of read 1's that may have read through into adapter sequence
AGATCGGAAGAGCetc and a small section of read 2's that may have read through first into the reverse complement of 1 of 24 barcodes and continuing into adapter sequence  AGATCGGAAGAGCetc. In addition as is the way with illumina sequencing, we have variable phred scrores (+33), and since in our initial sample we included some samples that had variant levels of DNA degradation and starting yield in some cases the phred scores are quite variant.

Analysis through the STACKS and subsequent outlier analysis is not an issue as all this has been determined already, but as a last thought to get the best starting data set, the QC issue is the last major hurdle.

So knowing that there is the potential for barcode and adapter sequence presence within some reads:

1. is there a need to remove these or will the natural progression of aligning reads against against a reference genome OR creating stacks denovo, naturally weed out reads that aren't true sequence?
2. Have people removed reads based on phred score or just used all reads?


If adapter based removal is required,
Read 1 appears relatively simple, such as instructing a programme like cutadapt, to remove reads with AGATCGGAAGAGC sequence in and retain reads where read length (m) remains at 125 as STACKS won't allign reads with different lengths. In read 2 however, any read through will first be into the reverse complement of 1 of 24 index barcodes and then into adapter sequence. We'll want to remove any reads with RC barcode sequence and adapter read through, but I can find no way of doing this without losing huge amounts of data (up to 34% using cutdapt either through specifying the 24 Rc barcodes in a file or the 24 rc barcodes followed by adapter sequence). A lot of this seems to be down to the match between barcode/adapter and read in cutadapt having to be 3 base pairs, which obviously means sequence and reads removed will happen largely through the fact that read sequences may naturally have 4, 4, 5, bp sequences also in the adapter that get falsely removed. Increasing the level of minimum 3 base pair matches to say 8 or so, then potentially involves not removing reads that do have barcode adapter because only a partial barcode is in the read.

I'm not sure if I've made that clear, but if 3. anyone could advise on what's going wromg or the need for QC trimming? I'd be grateful -  I'd really appreciate the help.

Cheers

Jamie
Message has been deleted

Jamie

unread,
Mar 5, 2016, 2:03:06 PM3/5/16
to Stacks
Any thoughts at all?

many thanks

Julian Catchen

unread,
Mar 6, 2016, 1:30:41 PM3/6/16
to stacks...@googlegroups.com, jt...@kent.ac.uk
Hi Jamie,

I would consider it standard to run process_radtags on your data before
pushing it through Stacks. This will filter the data according to PHRED
scores (as well as demultiplexing and checking for the existence of the
restriction enzyme cut site). However, PHRED scores are a very coarse
measure of sequencing quality and therefore should not be used as a
means for heavy filtering. Approaches where all data below Q20 or Q30
are discarded are wrong in my opinion (and throw out lots of good data).

As for the adaptor contamination, if your read 1 contains adaptor
sequence at the end, then the corresponding read 2 should also have
adaptor at its end. This means that the genomic DNA that you are
sequencing must be less than 125bp in length and therefore your original
two enzymes were less than 125bp apart in the genome they originated in.
One side question would be why didn't your size selection step eliminate
these molecules?

Therefore, if you remove adaptor, the affected reads will be shorter
than the other reads in your dataset.

I would ask how many of these sites you accidentally captured that
should have been excluded by size selection? Do you need them, or do you
have enough other RAD sites that you can proceed without much problem?

You can do 25 adaptor filtering steps to remove them (one for the read
1's, 24 for the read 2's, each a concatenation of a barcode and the
adaptor). The process_radtags program will filter your reads for adaptor
if you tell it to.

But, to use these data in Stacks, you will need to trim your other data
so that the reads within each sample are the same length (length can
vary between samples).

Is it really worth it to save these short RAD sites (and their likely
smaller number of SNPs captured within them)?

Best,

julian

Jamie

unread,
Mar 7, 2016, 8:38:16 AM3/7/16
to Stacks, jt...@kent.ac.uk, jcat...@illinois.edu

Hi Julian,


Many thanks for the reply - very much appreciated


Very interesting to hear your thoughts on phred score filtering thank you and actually pleased you agree this shouldn't be done too harshly, so I'll leave it to process-radtags rather than do any phred based trimming in an external programme prior.


Regarding adapter contamination, it's not an issue with all samples at all, just a small minority, that I assume size selection didn't deal with and thus slipped through into being sequenced. The majority of reads are fine. It was just an issue of whether this small minority of contaminated reads should be removed before processradtags/stacks, as they could I assume impact results OR can they be left in. The vast majority of read 1 and read 2 the will be OK (have no contamination) and will be kept as the 125bp-PE reads that STACKS requires (ie all the same size) and we won't trim them down at all – reads that were trimmed prior to process_radtags, because of barcode/adapter presence would be excluded from inclusion prior to process_radtags as they would be trimmed below the 125bp read length of the majority, thus would not run in STACKS.


This leaves options as:

1. trim no reads, run them all through process_radtags and assume that those containing barcode/adapter (not the barcode used for demultiplexing which will of course remain in read 1) will subsequently not align to a reference sequence OR if denovo, not be in sufficient quantity to form stacks and thus such reads with contamination will be automatically removed.


2. Use a programme such as cutadapt, trimmomatic, prior to process_radtags to remove an element of the contamination. Naturally some partial match containment that does not meet the minimum threshold nucleotide match will remain in the reads and in likelihood an element of true sequence will be removed through random matching of true sequence to nucleotides of the barcode/adapter

 
Give your experience of RAD, STACKS, and read loss, could I ask which of these options sounds most appropriate for our dataset, either way there will be a loss of a small section of reads.

I understand your comments on the use of process_radtags for adapter filtering and the fact if R1 has adapter sequence then R2 will as well, but wonder whether running the 24/25 rounds on 30 libraries would be problematic.

Cheers Julian - almost there :)

Jamie

unread,
Mar 7, 2016, 4:58:13 PM3/7/16
to Stacks, jt...@kent.ac.uk, jcat...@illinois.edu
Hi Julian,

Having been trying to use various softwares for qc prior to process_radtags and STACKS, it seems that actually given what you've said and what i've read, process_radtags can do exactly what I need all on it's own.

As discussed below the phred score issue is dealt with by the default process_radtag window settings used over a 125bp read which is great, with no need to use any harsh filters.

As for the adapter issue, you rightly mention that for any reads where there is read through into read 1, there will also be read through into read2. If i run process_radtags -1 read1.fastq.gz -2 read2.fastq.gz -b .pools -o ./RADlibs -c -q -r - inline_null --renz_1 ecoRI --renz_2 sphI -i gzfastq --adapter_1  AGATCGGAAGAGC --adapter_2  AGATCGGAAGAGC --adapter_mm 0, this would most likely deal with most of the read through issue, bar the bit of reverse complement barcode before the  read through into adapter on read. 2. Since I assume the q function, will remove these reads, all we may be left with is some reads with just rc barcode sequence on reads 2's in the remainder file. I'll be combining reminder and paired files, so cann't just discard the unpaired remainder files which would solve the issue, BUT is there any function to say in STACKS effectively 'if an adapter occurs in read 1, discard read 2 as well as one, and not place them in the remainder file? That would be me sorted then :)

Cheers
Reply all
Reply to author
Forward
0 new messages