Heterozygosity data

Peter G

unread,

Oct 20, 2021, 9:00:09 AM10/20/21

to Stacks

I feel this topic may have come up before but im unable to find anything from the search that covers it.

After using Stacks, we have been pulling the Heterozygosity data from sumstats.tsv and comparing observed against expected - distributions for 3 species shown below

Im wondering why the Expected data are restricted within 0.09(ish) and 0.5, and if we should be doing something with the observed to have it just in these bounds too?

We've been pulling every data point from within the sumstats.tsv file, but perhaps we should be conditionally filtering out some rows first?

Any advice would be appreciated

Catchen, Julian

unread,

Oct 20, 2021, 5:37:35 PM10/20/21

to stacks...@googlegroups.com

The expected heterozygosity is calculated based on Hardy-Weinberg Equilibrium: 2pq. Since the max value of p and q (frequencies of the two alleles at a nucleotide position) are 0.5 each, the maximum value for expected heterozygosity would be 2 * 0.5 * 0.5 = 0.5. Here is a random slide from Google (https://slideplayer.com/slide/15034095/), one of many:

Chart, radar chart

Description automatically generated

If you have collapsed paralogous loci, you might have an observed heterozygosity that is higher than expected (nucleotide positions that are being called as SNPs but are really fixed differences between two or more collapsed loci). You might consider applying the --max-obs-het filter in the populations program to remove these potentially confounded SNPs.

From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Peter G <pgray...@gmail.com>
Date: Wednesday, October 20, 2021 at 8:00 AM
To: Stacks <stacks...@googlegroups.com>
Subject: [stacks] Heterozygosity data

I feel this topic may have come up before but im unable to find anything from the search that covers it.

After using Stacks, we have been pulling the Heterozygosity data from sumstats.tsv and comparing observed against expected - distributions for 3 species shown below

Im wondering why the Expected data are restricted within 0.09(ish) and 0.5, and if we should be doing something with the observed to have it just in these bounds too?

We've been pulling every data point from within the sumstats.tsv file, but perhaps we should be conditionally filtering out some rows first?

Any advice would be appreciated

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/049b185e-2ab6-4f4d-9689-f2092b464bfbn%40googlegroups.com.

Peter G

unread,

Oct 21, 2021, 6:03:28 AM10/21/21

to Stacks

Thank you Julian,

That makes alot of sense. We have used 0.7 for --max-obs-het (and -min-maf 0.014) but given what your refer to regarding HWE, do these OH loci between the bounds of 0.5 and 0.7 represent a need for more stringent filtering, or are they likely to be biologically correct, representing SNPs that are under selection and thus, are not within HWE? its unclear to me if before testing to see if theres a sig diff between observed and expected heterozygosity in the populations, i need to consider all the SNPs here or impose further filtering/conditions.

Reply all

Reply to author

Forward