expected heterozygosity and homozygosity

562 views
Skip to first unread message

Dan Wang

unread,
Jan 27, 2014, 10:47:45 AM1/27/14
to stacks...@googlegroups.com
Hi all,

I'm using the population function of stacks, and got a question about the calculation of expected heterozygosity and homozygosity. 

According to the supporting information of the stacks population paper, the expected heterozygosity (2pq) and homozygosity (1 - 2pq) are calculated under Hardy-Weinberg Equilibrium, meaning they should add up to 1. But when I looked into my batch_x.sumstats.tsv file, I found that in many cases, the sum of the two were not 1. On the other hand, the sum of Exp Hom and Pi equals 1 constantly. Is there a reason behind it?

Here are some examples:
# Batch ID  Locus ID Chr BP Col Pop ID P Nuc Q Nuc N P Obs Het Obs Hom Exp Het Exp Hom Pi
1 3 DS611871 12863 139 1 G A 4 0.5 0 1 0.5 0.428571 0.571429
1 3 DS611871 12863 139 2 A G 5 0.8 0 1 0.32 0.644444 0.355556
1 3 DS611871 12863 139 3 G A 8 0.5 0.25 0.75 0.5 0.466667 0.533333
1 3 DS611871 12864 138 2 G A 5 0.6 0.4 0.6 0.48 0.466667 0.533333
1 3 DS611871 12866 136 1 A 5 1 0 1 0 1 0
1 3 DS611871 12866 136 2 A T 5 0.8 0 1 0.32 0.644444 0.355556
1 3 DS611871 12866 136 3 A 10 1 0 1 0 1 0
1 3 DS611871 12871 131 1 T C 5 0.6 0.4 0.6 0.48 0.466667 0.533333
1 3 DS611871 12871 131 2 T C 5 0.6 0.4 0.6 0.48 0.466667 0.533333
1 3 DS611871 12871 131 3 C T 10 0.8 0.4 0.6 0.32 0.663158 0.336842
1 3 DS611871 12872 130 1 T C 5 0.6 0.4 0.6 0.48 0.466667 0.533333
1 3 DS611871 12872 130 2 T C 5 0.6 0.4 0.6 0.48 0.466667 0.533333
1 3 DS611871 12872 130 3 C T 10 0.8 0.4 0.6 0.32 0.663158 0.336842
1 3 DS611871 12883 119 1 A 5 1 0 1 0 1 0

Any insights would be great. Thanks!

Dan

Julian Catchen

unread,
Jan 28, 2014, 4:27:38 PM1/28/14
to stacks...@googlegroups.com, wangda...@gmail.com
Hi Dan,

Your observations are quite sharp. As you probably know, there are two ways to
compute expected heterozygosity. The first is to invoke Hardy-Weinberg, as you
mention below, and you compute 2pq, or 2 times the frequency of the p allele
times the frequency of the q allele.

But, Pi is also considered equivalent to expected heterozygosity and we compute
it differently, using binomial coefficients:

pi = 1 - [(p choose 2) + (q choose 2) / (n choose 2)]

which is equivalent to the number of ways to choose a p homozygote plus the
number of ways to choose a q homozygote divided by the total number of ways to
select two alleles from the population.

Anyway, these two calculations are usually very similar, but can differ by a few
hundred thousandths.

When I originally wrote the summary statistics code, I wanted to have both types
of calculations in the output so we could sanity check our results, and hence
the code sets expected homozygosity = 1 - pi, instead of 1 - expected
heterozygosity.

In hindsight this seems confusing, so I have changed it and I will make a
release shortly.

All of these values can be recomputed from the raw data in the
batch_X.sumstats.tsv file. It is trivial to open this file in Excel or R and
recompute Pi, exp het, exp hom, from the inputs. You can also verify the
observed counts for p and q in the web interface.

I rechecked our calculations this morning and I believe everything is working
correctly beyond what I described above.

Best,

julian

On 1/27/14, 7:47 AM, Dan Wang wrote:
> Hi all,
>
> I'm using the population function of stacks, and got a question about the
> calculation of expected heterozygosity and homozygosity.
>
> According to the supporting information of the stacks population paper, the
> expected heterozygosity (/2pq/) and homozygosity (/1 - 2pq/) are calculated
Reply all
Reply to author
Forward
0 new messages