Depth/quality info for invariant sites

93 views
Skip to first unread message

Geoffrey Walker

unread,
May 19, 2025, 2:43:32 PMMay 19
to Stacks
Hello, 

I am hoping to run some analyses using full RAD loci in addition to the traditional "One SNP per locus" approach. Specifically, I have aligned my de novo stacks output to a related reference genome using stacks-integrate-alignments and would like to calculate a variety of 
genomic diversity statistics using
 Pixy (https://pixy.readthedocs.io/en/latest/about.html). 

However, invariant sites output with --vcf-all from populations do not include depth or quality information, leaving invariant RAD loci (loci with no SNPs present) fully without depth/quality information as well. I would ideally like to filter these invariant RAD loci alongside variant loci using the same depth/quality for each to avoid introducing any bias into my calculations. But without depth information output by stacks, I'm not sure how to go about this.  

I did see this question asked a couple years ago (https://groups.google.com/g/stacks-users/c/m_0cW15_TIc) without a definitive answer, so I was wondering if anyone has done something similar and figured out a way to extract this depth/quality information from the Stacks output. 

Thank you,
Geoffrey

John Swenson

unread,
Jun 23, 2025, 11:57:16 AMJun 23
to Stacks
Hello Stacks Team,

I am working on a similar analysis as Geoffrey and, following the recommendations of Schmidt et al. (2021), I would like to calculate genetic diversity (pi) of all sites, both monomorphic and polymorphic. I know that this information is included in output files from populations (e.g., under # All positions in the populations.sumstats_summary.tsv file) What I'm trying to figure out now is where/how much bias I might be introducing in our de novo workflow, in particular by treating SNPs or SNP-containing loci differently than monomorphic sites/loci. This will help me understand what metric I'm actually reporting in the upcoming manuscripts and offer the relevant caveats for if/when these values are used to inform policy.

A few questions that come to mind include:
  • Is there a way to include entirely monomorphic loci in calculations of pi (if populations doesn't already doing this)?
    • Otherwise it seems like we might be biasing our estimates of pi upward if we only include loci that have at least one SNP in the calculations (right?).
    • On the other hand, doesn't setting n to a maximum value necessarily introduce downward bias to calculations of pi? 
      • I know this is unavoidable to a degree, even after optimizing m, M, and n. Just checking my understanding.
  • How does Stacks deal with missingness in calculations of genetic diversity, etc.
    • If we specify a whitelist of SNPs that passed external filtering programs, what would happen to SNPs that did not pass filters i.e., SNPs that are present on the same locus as a SNP that did pass filters? Would Stacks keep the locus and remove the "bad SNPs" (which could deflate genetic diversity estimates), or would these sites be marked as missing and excluded from calculations of pi? Or other?
I have a zillion other nuanced questions but don't want to wander too far into the weeds. So I suppose my overarching question is whether any of you wise Stacks developers/maintainers have suggestions for how to go about obtaining minimally biased calculations of genetic diversity spanning all (reliably sequenced) sites, both monomorphic and polymorphic? Depending on the species, we are either using a de novo or integrated approach, as reference genomes are tough to come by. I understand that some bias is impossible to avoid, but wondering about suggestions for how to minimize it.

Many thanks for your time and for considering this inquiry.

Sincerely,
John
Reply all
Reply to author
Forward
0 new messages