Hi Dan,
Your observations are quite sharp. As you probably know, there are two ways to
compute expected heterozygosity. The first is to invoke Hardy-Weinberg, as you
mention below, and you compute 2pq, or 2 times the frequency of the p allele
times the frequency of the q allele.
But, Pi is also considered equivalent to expected heterozygosity and we compute
it differently, using binomial coefficients:
pi = 1 - [(p choose 2) + (q choose 2) / (n choose 2)]
which is equivalent to the number of ways to choose a p homozygote plus the
number of ways to choose a q homozygote divided by the total number of ways to
select two alleles from the population.
Anyway, these two calculations are usually very similar, but can differ by a few
hundred thousandths.
When I originally wrote the summary statistics code, I wanted to have both types
of calculations in the output so we could sanity check our results, and hence
the code sets expected homozygosity = 1 - pi, instead of 1 - expected
heterozygosity.
In hindsight this seems confusing, so I have changed it and I will make a
release shortly.
All of these values can be recomputed from the raw data in the
batch_X.sumstats.tsv file. It is trivial to open this file in Excel or R and
recompute Pi, exp het, exp hom, from the inputs. You can also verify the
observed counts for p and q in the web interface.
I rechecked our calculations this morning and I believe everything is working
correctly beyond what I described above.
Best,
julian
On 1/27/14, 7:47 AM, Dan Wang wrote:
> Hi all,
>
> I'm using the population function of stacks, and got a question about the
> calculation of expected heterozygosity and homozygosity.
>
> According to the supporting information of the stacks population paper, the
> expected heterozygosity (/2pq/) and homozygosity (/1 - 2pq/) are calculated