Hi Matthew,
A few thoughts:
Finally, once all of that is complete, you want to focus on coverage for your individual samples, and the amount of missing data. Only after all this looks good should you be focused on population genetic results (e.g. HWE).
julian
Hi Matt,
You cannot use the --rm-pcr-duplicates flag for ddRAD data, it will just cause all data to be thrown away (since identifying PCR duplicates involves finding pairs of reads with the exact same starting coordinates, and ddRAD reads at each locus all have the same starting coordinates.
The clone_filter program will remove the degenerate bases if you allow it to, we don’t typically remove the cutsite, but leave it to be used in the denovo assembly process. You can give the 3’ adapter sequence to process_radtags and allow it to remove reads that contain it. If you have adapter frequently, it may be better to trim all the reads uniformly, but I would try to avoid that in your first attempts.
julian
From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Matthew Penney <matthew....@gmail.com>
Date: Wednesday, December 8, 2021 at 5:43 PM
To: Stacks <stacks...@googlegroups.com>
Subject: Re: [stacks] hapstats.tsv: All HWE values zero, Haplotype Diversity Greater Than One
Hi Dr. Catchen,
These are double-digested libraries (Msp1 and Pst1) with a forward adapter including a 4bp degenerate sequence and a reverse Y-adapter. The individual samples were identified using combinatorial barcodes, which were added via the PCR primers. Since clone_filter didn't seem to be doing much to remove PCR Duplicates based on FastQC results maybe it's better to keep the --rm-pcr-duplicates flag.
The reason I was using Trimmomatic is that Genome Quebec does not remove adapter sequence from the sequencing reads and my reads had adapter at the 5' end (the degenerate bases and cut site) and, unfortunately, adapter appearing at the 3' end according to FastQC. The intention was to use Trimmomatic to cut out adapters and any additional low-quality sequence that was showing up in the FastQC results. Low-quality sequence at the ends was also the reason for truncating the reads. It's possible that I could keep more, though. 75bp just seemed common in the literature.
That "m = 10" mistake was me jumping back and forth between Linux and R. I corrected that awhile ago but pulled the code from the wrong script. My bad. I'll take a look at the paper you mentioned.
One issue I do have is that the number of reads differs widely between some samples. I've considered taking the average and SD for total reads and then excluding anything that falls out of range. If you have any suggestions for dealing with this I'd appreciate any advice.
Cheers,
Matt Penney
On Wednesday, December 8, 2021 at 5:33:28 PM UTC-4 jcatchen wrote:
Hi Matthew,
A few thoughts:
1. It is not clear why you are using clone_filter, since these are paired-end data (what kind of RAD libraries are these?) you can let gstacks in the denovo_map.pl pipeline handle filtering PCR duplicates. Of course, if you have degenerate oligos marking your independent molecules, then using clone_filter makes sense. However, this command line option does not make sense: “--retain_oligo GANNNNGTTGCA”, as --retain_oligo is a flag and does not accept an argument. I don’t know what effect that would have, but not what you would like.
2. There is no need to run trimmomatic at all, let process_radtags handle that for you.
3. Why are you trimming to 75bp? Why not keep the entire length of the reads?
4. You specified the “--rm-pcr-duplicates” flag to denovo_map.pl anyway, when in principle, you accomplished this with clone_filter. I suggest using one method or the other, not both.
5. You specified “-X "ustacks: m = 10"” flag to denovo_map.pl and this is not formed properly. Moreover, it rarely, if ever would make sense to set -m to 10. Please see our optimization paper by Josie Paris, the stacks 2 paper, or our latest methods chapter on Biorxiv for a discussion.
Finally, once all of that is complete, you want to focus on coverage for your individual samples, and the amount of missing data. Only after all this looks good should you be focused on population genetic results (e.g. HWE).
julian