Hello Stacks Team,
I am working on a similar analysis as Geoffrey and, following the recommendations of
Schmidt et al. (2021), I would like to calculate genetic diversity (pi) of all sites, both monomorphic and polymorphic. I know that this information is included in output files from populations (e.g., under # All positions in the
populations.sumstats_summary.tsv file) What I'm trying to figure out now is where/how much bias I might be introducing in our
de novo workflow, in particular by treating SNPs or SNP-containing loci differently than monomorphic sites/loci. This will help me understand what metric I'm
actually reporting in the upcoming manuscripts and offer the relevant caveats for if/when these values are used to inform policy.
A few questions that come to mind include:
- Is there a way to include entirely monomorphic loci in calculations of pi (if populations doesn't already doing this)?
- Otherwise it seems like we might be biasing our estimates of pi upward if we only include loci that have at least one SNP in the calculations (right?).
- On the other hand, doesn't setting n to a maximum value necessarily introduce downward bias to calculations of pi?
- I know this is unavoidable to a degree, even after optimizing m, M, and n. Just checking my understanding.
- How does Stacks deal with missingness in calculations of genetic diversity, etc.
- If we specify a whitelist of SNPs that passed external filtering programs, what would happen to SNPs that did not pass filters i.e., SNPs that are present on the same locus as a SNP that did pass filters? Would Stacks keep the locus and remove the "bad SNPs" (which could deflate genetic diversity estimates), or would these sites be marked as missing and excluded from calculations of pi? Or other?
I have a zillion other nuanced questions but don't want to wander too far into the weeds. So I suppose my overarching question is whether any of you wise Stacks developers/maintainers have suggestions for how to go about obtaining minimally biased calculations of genetic diversity spanning all (reliably sequenced) sites, both monomorphic and polymorphic? Depending on the species, we are either using a de novo or integrated approach, as reference genomes are tough to come by. I understand that some bias is impossible to avoid, but wondering about suggestions for how to minimize it.
Many thanks for your time and for considering this inquiry.
Sincerely,
John