Question about quality scores

40 views
Skip to first unread message

JB

unread,
Feb 29, 2016, 6:36:04 PM2/29/16
to AftrRAD
Hi,

First off, thank you for putting together an easy to understand/use pipeline! Running default parameters I am finding that too many reads are being thrown out due to low quality scores (like almost 70% of reads).  I ran FASTQC on my data and noticed that the base quality is generally high except for bases that would correspond to the restriction site, which are quite low.  I am thinking this may be due to low complexity at these positions since I know this is a problem for Illumina (most of the bases are exactly the same).  So, my question is: does AftrRAD remove the restriction site sequence before filtering by quality (i.e. when it removes the barcodes does it also remove the restriction site sequence)? If not, then I'm guessing my problem is that sequences are being tossed because of low quality base calling in the restriction site, and I would need to remove these bases (any suggestions on how to do this would be welcome). Otherwise, I guess I need to adjust the minQual parameter--is there any way to adjust it so that, for example, it would only throw out the read if a certain number of bases fell below the threshold rather than one?  It seems pretty stringent to throw out an entire read that might only have one bad base. 
Thank you, any input is much appreciated.  

Jeremy  

Mike Sovic

unread,
Mar 2, 2016, 10:32:34 AM3/2/16
to AftrRAD
Hi Jeremy,

Glad to hear the program is helpful!  The quality scores for the restriction site are not evaluated.  However, the program does check the identity of the restriction site in each read, and this can cause reads to be discarded, so I expect you're right that low sequence quality at the restriction sites due to a lack of complexity could cause a problem.  The details are a little different depending on whether your barcodes are all the same length or not…

If your barcodes are all the same length (or if you are running the dplexedData-1 flag)…

In this case, the default in the program is to allow up to one mismatch at the restriction site.  Reads containing RE sites with two or more mismatches are discarded.  If you wanted to eliminate this behavior, I think you could enter full stops (periods) for the 're' argument.  You'll need one for each base in the RE site in the reads.  For example, if your RE site is 'TGCAGG' (the default in the program), then you could use the argument 're-……'.  Alternatively, if your RE site were 'CCAG', you could use 're-….'.  The '.' is basically a wildcard character, and will match any base, so the program will not throw any of the reads out based on the identity of the RE site (but it still needs to know the length of the RE site, which is why the number of them matters).

If your barcodes are different lengths…

This makes things a bit more complicated.  I expect doing what I suggest above will lead to some errors in the program.  For reads containing a barcode that is an exact match to one of the barcodes you give the program to search for, things should be fine if you set something like 're-…..'.  However, in cases where you have barcodes of different lengths, and the barcode for a given read is not an exact match to any of the barcodes you give the program, I expect there will be some problems.  I don't have a great solution at the moment - going to have to think about that a bit more - if you're in this situation, let me know and I will give it a bit more thought.

A couple other things…

1.)  The number of reads that are thrown out based on the identity of the RE is recorded in the file Output/RunInfo/ReportX.txt.  This should give you some idea of how many you're currently losing because of the RE site.

2.)  Just to be sure, keep in mind that the RE site in the reads will not be the full recognition site for the enzyme, but only a portion of it based on where the cut is made within the recognition sequence.  For example, the default RE site in AftrRAD is 'TGCAGG' which corresponds to SbfI.  However, the actual recognition sequence for SbfI is 'CCTGCAGG'.  I know some folks have confused this in the past, and doing so would result in several problems, including behavior such as what you are seeing with lots of reads being thrown out because of mismatches at the RE site.


             Mike

JB

unread,
Mar 2, 2016, 12:19:18 PM3/2/16
to AftrRAD
Hi Mike, 

Thanks for the reply.  I'm pasting in the report below. It looks like what is causing the problem is that reads are being thrown out because they contain a base below the quality cut-off.  When looking at the FASTQC output the reads seem to be of high quality, with all positions having average quality scores above 32 (most near 40) with the exception of the restriction site region.  So, the only thing I can figure is that although reads are generally of high quality many reads have a single lower quality base that is causing the to be kicked out.  As a point of comparison, I had previously run the same dataset through process_radtags in stacks (which I believe uses a sliding window to assess quality) and 137,353,563 out of 144,307,453 reads were retained, whereas AftrRAD is keeping only 44,220,702.  I want to use AftrRAD but I'm not sure what the best way to proceed would be.  Obviously I could lower the quality threshold value, but I'm not sure that will necessarily fix the problem since a single low quality base may still exist (I guess I can try and see). Is there any way to alter the parameter so that it would take more than a single base to throw the read out?  Any suggestions would be welcome.  Thanks! Report follows:
Parameters used for this run...

minParalog 5
stringLength 15
dplexedData 0
Phred 33
minDepth 5
numIndels 3
MaxH 90
DataPath Data/
BarcodePath Barcodes/
minIden 90
minQual 20
P2 ATTAGATC
re TGCAGG
Help 0
The total number of reads in the fastq file is 144307453.

You entered ATTAGATC as the beginning of the P2 adaptor.  49278 reads contained this string and were discarded, leaving 144258175 reads for further analysis.

Of these 144258175 sequences, 137742489 reads were an exact match to one of your barcodes and 6515686 were not.

Of the 6515686 not matching a barcode exactly, 2549585 reads had barcodes that could be confidently assigned to one of your barcodes.

This left 140292074 sequences.

Of these 140292074 reads, 2367390 contained restriction enzymes recognition sites that were 2 or more bases away from your recognition site and were discarded, leaving 137924684 reads for further analysis.

Of these 137924684 sequences, 93703982 were removed because they contained at least one base that had a quality score below the minimum value of 20, leaving 44220702 sequences.

Of these 44220702, 15079 additional reads were removed from the dataset because they contained a homopolymer at least 15 bases long.
 

Mike Sovic

unread,
Mar 3, 2016, 1:00:56 PM3/3/16
to AftrRAD
Hi Jeremy,

OK, yeah, I think you're right with all of this, and in this case it's not surprising that you'd be retaining more reads with stacks, as yes, it does use a sliding window.  Indeed, as currently written, AftrRAD is probably overly-conservative in terms of throwing things out.  I've intended to make this more flexible (we played around with a sliding window approach in the past), but just haven't every gotten anything formally incorporated yet.  I'll try to check back in to that soon, but no promises.  In the meantime, a couple things you could try…

1.)  Probably not the case, as I know you said in the first message that average base quality is generally high except at the restriction site, but I know we had one dataset in which one specific position in the read was consistently bad.  Fortunately, it was one of the last positions in the read, and so we just trimmed the reads back to get rid of it, and that took care of that issue.  If you'd happen to be in this situation, let me know.  I can send along the script we used to trim the reads back - pretty simple script, and could easily be adapted for your case.

2.)  As you say, you could lower the minQual value.  I've thought about this in the past, and I think that you could actually set this to '1' (effectively making it irrelevant), and I think things might work OK.  This is because of the two separate approaches built in to AftrRAD for trying to identify and get rid of error reads - the traditional approach based on quality scores is the first.  The second was originally intended to deal specifically with artifacts that arise during PCR.  These are real to the sequencer, and so could have very high quality scores, but still be "error" reads.  We use that 'minDepth' parameter to try to identify these, assuming they occur at lower frequencies than true reads.  Since error reads arising from the sequencer (i.e. error reads associated with low quality scores) would also (presumably) occur in lower frequencies than true reads, these could also be dealt with based just on the 'minDepth' parameter in the same way as PCR artifacts, theoretically eliminating the need for the 'minQual' parameter altogether.  The biggest downside in this approach is that it may somewhat increase the RAM requirements at a couple of points in the run, so that would be worth keeping an eye on.  

Note that reducing the 'minQual' value and relying just on the 'minDepth' value may be a bit deceptive.  The Report file will tell you it's using essentially all of the reads for the analysis (probably a number similar to what you get from stacks).  However, a lot of these reads will probably then be thrown out after this, when 'minDepth' is evaluated.    Basically, what will happen is for all of the reads that have at least one low quality score (i.e. that were thrown out in your first run), if that low quality score results in an incorrect base call, you'll end up with an error read, and it will probably be thrown out because it's average depth doesn't exceed 'minDepth' (and we don't currently keep track of how many are thrown out at this step).  If the sequencer got the base correct despite the low quality score, then this read will be retained despite the low quality score.  So, in your case, you'll end up keeping something between the 44 million and the 137 million from stacks, but not sure where in there it would be.

3.)  A third option might be to set 'minQual' very low, and 'minDepth' also very low.  This would essentially retain all the reads for the analysis.  In this case, most of the error reads will be dealt with during genotyping, which should still be OK.  The downside to this is simply efficiency.  The main reason for trying to identify and get rid of the error reads early on is so we don't have to waste time aligning them.

One more thought - probably worth taking a look at some of the files in TempFiles/RawReadCountFiles/, even from the run where you got 44 million reads retained.  The number of reads that you really need is of course very dependent on your specific project (mainly num of loci and individuals).  If the read depths in these files look good, I personally wouldn't get too bogged down in trying to recover every last read possible.  If they look lower than you'd like, you can get a sense here of how many more reads you might need (i.e. 2X, 5X, 10X, etc).

Guess those are my thoughts/ramblings for now. Hopefully some of that makes sense.  Like I said above, I'll try to look into adding some flexibility in terms of screening reads based on 'minQual' - can maybe look at it this weekend, and will let you know if I make any progress.

               Mike

JB

unread,
Mar 4, 2016, 12:48:41 PM3/4/16
to AftrRAD
Hi Mike,

Thanks for all of the advice and time you put into answering my question. I'm toying with using the reads retained by stacks and then setting the minQual value low--I figure stacks should get rid of the obviously bad reads--AftrRAD may get rid of some more, but not nearly as many as before.  As you say, I should have plenty of SNPs and probably don't need every single read.  Initially, when I had the small number retained, the coverage was too low, but I think this fix should take care of that problem. I like the fact that AftrRAD can deal with indel variation as my taxa are bordering a population genetic vs. phylogenetic approach so I like this capability that AftrRAD offers. 
Also, quick question if you have a chance--is there a way to "kill" the analysis in the middle of running? Would this create any problem with files, etc. when I rerun again (I assume it just overwrites).  Thanks!

Jeremy  
Reply all
Reply to author
Forward
0 new messages