process_radtags and illumina phred 2

411 views
Skip to first unread message

Waplesry

unread,
Aug 1, 2012, 7:32:25 PM8/1/12
to stacks...@googlegroups.com
I've been trying to decide how best to apply process_radtags to filter illumina reads prior to downstream analysis in Stacks and I have a question.

The quality scores for illumina reads often end in a short string of characters that decode to a phred score of 2.  ("#" in the current phred33, "B" previously) 
To Illumina, this is a phred score with a special meaning.
From Illumina:
"
At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall. 
"

More than a few of my sequencing reads have quality score lines that look something like this at the end (simplified for effect):

@@@@@@@@@@@@@#####

@ = phred score 31
# = phred score 2

With the default values in process_radtags, a read like this would be kept and used.  A relatively small run of 5 x phred 2 will not bring the window phred average down low enough to kick out the read, because the adjacent to region is of high quality. 

I'm looking for suggestions on what to do in this situation.  The options as I see them are below. For now I need to keep each sequence the same length:

1) Go with the default process_radtags setting and keep these reads in.  I would expect to find an excess of SNPs near the radtag ends if these reads actually have a higher error rate.  Not sure how much a problem this might be.  It might be possible to later exclude SNPs at these positions.

2) Increase the stringency of process_radtags, so that more of these reads are removed.  I could lessen the window size or increase the necessary phred average.  This would remove some, but the process_radtags framework doesn't seem to allow exclusion of all such reads easily, especially those with only 1 or 2 "#"s. 

3) I could exclude all reads with any trailing phred 2 scores, likely before process_radtags.  I could couple this with trimming all read ends, if desired.

Any other suggestions are welcome, how are others dealing with reads like these?



Thanks,
Ryan

Julian Catchen

unread,
Aug 7, 2012, 12:27:33 AM8/7/12
to stacks...@googlegroups.com
Hi Ryan,

The first thing I would do is run Stacks (and process_radtags) with the default
parameters. The reason is that process_radtags will remove the worst of the
reads, but then Stacks will require a certain depth of identical reads to
declare a stack, and the maximum likelihood SNP model can handle a reasonable
amount of additional error. So, unless you had a really bad Illumina run (which
can happen), the overall pipeline should be able to handle the error potentially
introduced by the quality scores of 2.

Whether it worked or not should be relatively obvious by looking at the pipeline
output in the web interface. If you have runs of SNPs at the 3' end of your
reads, then you may need to filter more, but in my experience, this is not a
problem. You can increase the stringency of process_radtags and this will
further reduce the number of reads, but you should do it while comparing the
results against a default run of Stacks to try to get a feel for how the data is
reacting.

If the community is interested, I could add a switch to process_radtags to
discard reads with any quality scores of 2, however, I think this may discard a
lot of valid reads, has anyone counted the percentage of their Illumina lanes
that contain quality scores of 2?

Cheers,

julian

Ryan Waples

unread,
Aug 7, 2012, 1:51:13 PM8/7/12
to stacks...@googlegroups.com
Julian,

Thanks for the reply.

Quick outline of percent reads containing at least one "#" character
after process_radtags for my current dataset:
I have about 52 million paired end reads (26M each P1 and P2) these
have been filtered by process_radtags, paired end mode with default
settings and all have matching or rescued barcodes. They have not
been trimmed at all, the RE is still present on P1.
About 5.5% of the retained P1 reads still have at least one "#"
character, and about 3.6% of the P2 reads do. I'm not sure if this is
'good' or 'bad', but there are over 2 million reads like this. This
may be small in percentage terms but they do represent a lot of data
to leave on the cutting floor if STACKS really is error tolerant.

I used the above data to assemble paired-end contigs outside of
stacks, so I don't have any cstacks results to report for that set,
but I have attached a .pdf histogram of # of SNPs per position for
another set of RAD data (plotting "SNP Column" from cstacks output).
You can certainly see a pattern of increasing SNP density near the
read end. Back of the envelope calculations put it at about 5%
'extra' SNPs spread across the last 10 positions. I have seen similar
plots in at least 3 other independent rad projects with different
setups and filtering steps. Does this pattern match what you or other
people are seeing? My sense is that this issue *might* be a bit more
acute when using a catalog constructed from many individuals, but I'm
not sure about that.

The graphed catalog was constructed from 24 individuals from 6
populations and was quality filtered with a (more stringent) script
than process_radtags. The script removed all the reads with phred2s,
so I don't think they are the only cause of this pattern. I'm trying
to figure out how 'clean' your data should be before running stacks,
and the best way to clean it.

-Ryan
> --
> For more options or to unsubscribe:
> http://groups.google.com/group/stacks-users
> Stacks website: http://creskolab.uoregon.edu/stacks/
batch_3.catalog.snps.pdf

bbarker505

unread,
Oct 6, 2013, 7:18:30 PM10/6/13
to stacks...@googlegroups.com, jcat...@uoregon.edu
Hi Julian and Ryan,

Not sure if either of you will read this since it's a response to a fairly old post, but I've been trying to learn more about why I seem to always get a lot of R2 reads that start out good but then quickly plummet in quality to the Phred score of 2 (# in the fastq file) for my ddRAD libraries that are run on an Illumina HiSeq. After talking with Illumina tech support I came to the conclusion that getting these low quality R2 reads is inevitable unless I use a MiSeq (because cluster density is lower) or spike a significant amount of PhiX into the sample. The reason for the low quality apparently has to do with cluster re-synthesis in R2, which makes the cluster larger, and the low diversity of the RAD library. Essentially the large size of the clusters make it difficult for the the machine to confidently make base calls because the clusters are difficult to discriminate from each other. Has anyone else had this issue?

I thought that a Phred score of 2 was horrible, so I'm not even sure if my R2 reads are usable.  Has anyone had this issue with RADseq and figured out a way to good quality paired-end data on Illumina HiSeq without spiking with PhiX?

We have been using a different cleaning pipeline prior to running Stacks so I can't comment on the process_radtags issue.

Thanks,
Brit

Ryan Waples

unread,
Oct 6, 2013, 7:21:58 PM10/6/13
to stacks...@googlegroups.com

Have you thought about trimming the reads such that most reads will only include a few 2s at most?

--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Brittany Barker

unread,
Oct 7, 2013, 10:01:33 AM10/7/13
to stacks...@googlegroups.com
Hey Ryan,
Yea, I tried your trimming idea but the issue was that there were a lot of R2's that were entirely poor quality, which led to them getting tossed out in my cleaning pipeline. Thus, many R1s were missing their mate. For this reason I have only been analyzing the R1 data. If you or anyone else who reads this have obtained good quality R2 data (i.e. the entire sequence has bases w/ Phred score > 25 or so) from a HiSeq without having to use PhiX I'd really like to hear what the trick is! I'm using a modified version of Peterson et al's (2012) ddRAD protocol.
Thanks,
Brit


You received this message because you are subscribed to a topic in the Google Groups "Stacks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stacks-users/W7qlQmthICU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stacks-users...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--
Brittany Barker, Ph.D.
Department of Ecology and Evolutionary Biology 
The University of Arizona
Tucson, AZ 85721
Ph: 505-205-4251
Homepage: http://brittanysbarker.org
Reply all
Reply to author
Forward
0 new messages