Re: Unsure on the quality of my stacks output

98 views
Skip to first unread message
Message has been deleted

Angel Rivera-Colón

unread,
Mar 13, 2025, 8:33:11 PMMar 13
to Stacks
Hi Alex,

Just confirming a few things.

How did you remove PCR duplicates? Normally for ddRAD, this can only be done using UMIs. If your library preparation did not include them, you might be incorrectly removing PCR duplicates from your data, as the traditional ddRAD protocol is them incompatible with insert size-based PCR duplicate removal (like it is done in gstacks, Picard, etc).

Regarding the misphasing rates, from what I see in your data the average misphasing rate is ~14%, which might not be that extreme. Overall, phasing could be affected by several issues, both technical and biological, meaning it might be impossible to phase 100% of loci (even for simulated data). For example, misphasing could be caused by over-merging loci when building a catalog de novo, as you get haplotypes from several loci stacked together. On the biological side, it could be a product of transposable elements and other repeats in the genome, resulting in recent paralogs. Do you have an expectation for how repetitive is the genome of your species? In other words, it is hard to tell from the misphasing rate alone if there are any issues, and a relatively high number might be expected given the biology of the species. Regardless, Stacks takes into account the phasing of loci when exporting genotypes downstream. The ones displaying phasing issues are flagged.

About the excess in relatedness, this is a complex issue and it might be harder to diagnose. Two things I might recommend verifying is 1) the proportion of loci in and out of Hardy-Weinberg Equilibrium and 2) removing sites in LD. This depends on the exact method used, but it might be good to check if the method used to calculate relatedness is sensitive to either of these two metrics. For example, some might expect neutral loci only and/or might be sensitive to the presence of non-indepence of markers. Note that populations will do the HWE calculation, but not explicitly remove loci out of HWE. Again, this is a more complex issue and no single filter might solve the problem, but these could be straightforward checks to make using Stacks.

Hope this information is helpful.

Angel

On Wednesday, March 12, 2025 at 3:16:04 PM UTC-7 alex irvine wrote:

I performed ddRADseq on my samples and removed PCR duplicates before de novo analysis, which eliminated approximately 50% of reads before demultiplexing. My dataset consists of 122 individuals, with 116 from one geographic area and 6 from another, as reflected in my population map.

For my Stacks analysis, I followed a combined approach based on:

  1. Rivera-Colón, A.G. & Catchen, J. (2022) – Population Genomics Analysis with RAD, Reprised: Stacks 2 (Methods in Molecular Biology).
  2. Paris, J.R., Stevens, J.R., & Catchen, J.M. (2017) – Lost in parameter space: optimizing de novo RAD-seq assembly using the r80 approach.
Pipeline & Parameters

I ran denovo_map.pl using:

  • -m 3 -M 2 -n 3 as it yielded the most r80 loci.
Concerns & Observations
  1. High Misphasing Rate – My phasing rate summary suggests a high level of misphasing, and I’m unsure how to improve it.
  2. Unexpectedly High Relatedness –
    • After generating a populations VCF file, both CKMRsim and PLINK analyses indicate excessive relatedness.
    • From 122 individuals, ~300 parent-offspring relationships were identified, which seems unrealistic for my population.
  3. Filtering & Preprocessing –
    • I have tried adjusting filtering thresholds in populations (e.g., MAF, MAC, observed heterozygosity).
    • Despite this, the relatedness remains high.
Request for Guidance
  • Could my high misphasing rate be contributing to this?
  • Are there specific filtering steps or parameter adjustments I should try in populations to reduce artificial relatedness?
  • Has anyone encountered similar issues with CKMRsim or PLINK when working with Stacks-generated VCFs?

I've attached key summary statistics from my run, and I’d appreciate any insights on potential issues in my pipeline.

Thanks in advance!

Reply all
Reply to author
Forward
0 new messages