Expected heterozygosity: --hardy vs --freq

Gabriele Sgarlata

unread,

Oct 7, 2025, 5:32:04 AM10/7/25

to plink2-users

Hi there,

I am interested in estimating expected heterozygosity for each variant site of my dataset.

I tested two options in plink2, since I am also interested in the F coefficient: --hardy or --freq options.

I noticed that the expected heterozygosity, computed as 2*p*(1-p), is different between the two approaches.

I dug a bit further and figure out that perhaps the differences between the two are due to the fact that --hardy ignores individuals with missing data at a given variant site. Thus, it computes heterozygosity only based on the individuals that are called at that site.

I concluded that --freq does not do the same.

Did I understood correctly?

Thank you,

Gabriele

Chris Chang

unread,

Oct 7, 2025, 9:09:57 AM10/7/25

to Gabriele Sgarlata, plink2-users

Please post full .log file(s) when asking for troubleshooting help.

In this case, you should also post the .hardy and .afreq output for one variant that illustrates what you’re talking about.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/c907963c-f53e-4d66-ac5b-595b0d156da8n%40googlegroups.com.

Gabriele Sgarlata

unread,

Oct 7, 2025, 10:24:07 AM10/7/25

to plink2-users

Thank you Chris!

I am posting below the results of the --hardy option (on the left of the red vertical line) and of the --freq option (on the right of the red vertical line), which I imported in R.

I have added three fields: N_tot (HOM_A1_CT + HET_A1_CT + TWO_AX_CT), A1_freq (((2*HOM_A1_CT) + (HET_A1_CT)) / (2*N_tot)), Vx_hardy (2*A1_freq*(1-A1_freq)) and Vx_afreq (2*ALT_FREQS*(1-ALT_FREQS)).

Essentially, Vx_hardy reproduces the expected heterozygosity "E(HET_A1)" obtained from --hardy, whereas Vx_afreq does not reproduce these results.

I suspect that this is due to the fact that --freq include also the individuals with missing genotypes.

I am also sending you the log files of the --hardy and --freq analyses.

Thanks,

Gabriele

Screenshot 2025-10-07 at 15.04.46 (2).png

test_hardy.log

test_AF.log

Chris Chang

unread,

Oct 7, 2025, 10:48:13 AM10/7/25

to Gabriele Sgarlata, plink2-users

—freq uses dosages when available, —hardy only looks at hardcalls.

To view this discussion visit https://groups.google.com/d/msgid/plink2-users/5f3e41eb-1f2c-4f56-8170-50b5dfe727d4n%40googlegroups.com.

Gabriele Sgarlata

unread,

Oct 7, 2025, 10:59:28 AM10/7/25

to plink2-users

Ok, thanks.

This clarifies my doubts.

Best,

Gabriele

Reply all

Reply to author

Forward