Hi Jeremy,
OK, yeah, I think you're right with all of this, and in this case it's not surprising that you'd be retaining more reads with stacks, as yes, it does use a sliding window. Indeed, as currently written, AftrRAD is probably overly-conservative in terms of throwing things out. I've intended to make this more flexible (we played around with a sliding window approach in the past), but just haven't every gotten anything formally incorporated yet. I'll try to check back in to that soon, but no promises. In the meantime, a couple things you could try…
1.) Probably not the case, as I know you said in the first message that average base quality is generally high except at the restriction site, but I know we had one dataset in which one specific position in the read was consistently bad. Fortunately, it was one of the last positions in the read, and so we just trimmed the reads back to get rid of it, and that took care of that issue. If you'd happen to be in this situation, let me know. I can send along the script we used to trim the reads back - pretty simple script, and could easily be adapted for your case.
2.) As you say, you could lower the minQual value. I've thought about this in the past, and I think that you could actually set this to '1' (effectively making it irrelevant), and I think things might work OK. This is because of the two separate approaches built in to AftrRAD for trying to identify and get rid of error reads - the traditional approach based on quality scores is the first. The second was originally intended to deal specifically with artifacts that arise during PCR. These are real to the sequencer, and so could have very high quality scores, but still be "error" reads. We use that 'minDepth' parameter to try to identify these, assuming they occur at lower frequencies than true reads. Since error reads arising from the sequencer (i.e. error reads associated with low quality scores) would also (presumably) occur in lower frequencies than true reads, these could also be dealt with based just on the 'minDepth' parameter in the same way as PCR artifacts, theoretically eliminating the need for the 'minQual' parameter altogether. The biggest downside in this approach is that it may somewhat increase the RAM requirements at a couple of points in the run, so that would be worth keeping an eye on.
Note that reducing the 'minQual' value and relying just on the 'minDepth' value may be a bit deceptive. The Report file will tell you it's using essentially all of the reads for the analysis (probably a number similar to what you get from stacks). However, a lot of these reads will probably then be thrown out after this, when 'minDepth' is evaluated. Basically, what will happen is for all of the reads that have at least one low quality score (i.e. that were thrown out in your first run), if that low quality score results in an incorrect base call, you'll end up with an error read, and it will probably be thrown out because it's average depth doesn't exceed 'minDepth' (and we don't currently keep track of how many are thrown out at this step). If the sequencer got the base correct despite the low quality score, then this read will be retained despite the low quality score. So, in your case, you'll end up keeping something between the 44 million and the 137 million from stacks, but not sure where in there it would be.
3.) A third option might be to set 'minQual' very low, and 'minDepth' also very low. This would essentially retain all the reads for the analysis. In this case, most of the error reads will be dealt with during genotyping, which should still be OK. The downside to this is simply efficiency. The main reason for trying to identify and get rid of the error reads early on is so we don't have to waste time aligning them.
One more thought - probably worth taking a look at some of the files in TempFiles/RawReadCountFiles/, even from the run where you got 44 million reads retained. The number of reads that you really need is of course very dependent on your specific project (mainly num of loci and individuals). If the read depths in these files look good, I personally wouldn't get too bogged down in trying to recover every last read possible. If they look lower than you'd like, you can get a sense here of how many more reads you might need (i.e. 2X, 5X, 10X, etc).
Guess those are my thoughts/ramblings for now. Hopefully some of that makes sense. Like I said above, I'll try to look into adding some flexibility in terms of screening reads based on 'minQual' - can maybe look at it this weekend, and will let you know if I make any progress.
Mike