Understanding --hardy and --missing results for multi-allelic SNPs?

24 views
Skip to first unread message

kel...@umich.edu

unread,
Jan 22, 2026, 1:48:39 AM (13 days ago) Jan 22
to plink2-users
Hello,

I recently attempted to run SNP QC on a sample of 343,981 people. The data is in .bed format, converted from .bgen to .bed with plink2 (unknown version, conversion happened in 2023). I ran the SNP QC using plink2_linux_amd_avx2_20260110.

I used the --keep and --extract flags to subset each chromosome to the people/SNPs of interest, and used --afreq --missing variant-only and --hardy all together in the same run, so the .afreq, .hardy, and .vmiss results for each chromosome are based on exactly the same sample.
I am confused about the results I am getting in the .hardy and .vmiss files (examples shown at the end of this message).

My questions:

1. How should I interpret the count columns in the .hardy file? I can see that HOM_A1_CT + HET_A1_CT + TWO_AX_CT + MISSING_CT (from .vmiss) always equals 343981, my sample size. But if I try to look at the number of CC homozygotes for rs300691, it appears to be 242,073 on the row where it's shown as a C/A SNP, and 333,319 CC homozygotes on the row where it’s shown as a C/T SNP.
(I understand the expected behavior of Plink is to do the HWE test separately for each A1/AX pair for a multi-allelic SNP, but I don't understand why the number of CC homozygotes differs across the two rows.)

2. How can I calculate a missingness rate that combines the rows for a multi-allelic variant? For example, if I want the overall "percent of people missing data on rs534306314", this variant has 4 people missing on the row where it’s shown as a G/T SNP and 672 people missing on the row where it’s shown as a G/A SNP. But these missingness numbers are not consistent with some of the missings being “people who have the allele not shown on this row”, so I’m not sure how to combine them with the count columns from .hardy and figure out how many people are actually missing on each SNP.

3. Is it possible that these results indicate a problem with the .bed files or the .bgen -> .bed conversion, or is this just normal “sometimes Plink acts weird with multi-allelic SNPs” behavior?

Thank you,
Kristen

PS. Here are some examples, showing tri-allelic SNPs on chromosomes 2, 3, and 21. I also included a bi-allelic SNP from chromosome 22 for comparison.

Results in the .afreq files:

#CHROM  POS       ID           REF  ALT  ALT_FREQS    OBS_CT

2       180171    rs300691     C    A    0.159614     685516

2       180171    rs300691     C    T    0.0100552    680244

3       255395    rs331869     C    G    0.355103     303670

3       255395    rs331869     C    T    0.987245     687962

21      10790805  rs534306314  G    A    0.000667038  686618

21      10790805  rs534306314  G    T    0            687954

22      16052962  rs376238049  C    T    0.0469835    607298


Results in the .hardy files:

#CHROM  POS       ID           REF  ALT  A1  AX  HOM_A1_CT  HET_A1_CT  TWO_AX_CT  O(HET_A1)   E(HET_A1)   P

2       180171    rs300691     C    A    C   A   242073     91952      8733       0.268271    0.268275    0.99492

2       180171    rs300691     C    T    C   T   333319     6766       37         0.0198929   0.0199082   0.605503

3       255395    rs331869     C    G    C   G   60213      75410      16212      0.496658    0.458009    5.23581e-240

3       255395    rs331869     C    T    C   T   50         8675       335256     0.0252194   0.0251847   0.45652

21      10790805  rs534306314  G    A    G   A   342851     458        0          0.00133408  0.00133319  1

21      10790805  rs534306314  G    T    G   T   343977     0          0          0           0           1

22      16052962  rs376238049  C    T    C   T   275605     27555      489        0.0907462   0.0895521   1.97831e-14


Results in the .vmiss files:

#CHROM  POS       ID           REF  ALT  MISSING_CT  OBS_CT  F_MISS       F_HETHAP

2       180171    rs300691     C    A    1223        343981  0.00355543   0

2       180171    rs300691     C    T    3859        343981  0.0112186    0

3       255395    rs331869     C    G    192146      343981  0.558595     0

3       255395    rs331869     C    T    0           343981  0            0

21      10790805  rs534306314  G    A    672         343981  0.0019536    0

21      10790805  rs534306314  G    T    4           343981  1.16285e-05  0

22      16052962  rs376238049  C    T    40332       343981  0.117251     0


vmiss.tsv
hardy.tsv
afreq.tsv

Chris Chang

unread,
Jan 22, 2026, 5:26:28 AM (13 days ago) Jan 22
to kel...@umich.edu, plink2-users
"Split" multiallelic SNPs don't look like this.

It looks like the original variant caller which produced the .bgen didn't actually know these were multiallelic SNPs.  So at rs331869, most of the genotypes are T/T, and when the variant caller was forced to call genotypes with REF=C and ALT=T it got confused.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/26d35036-d2fa-4e4d-a30e-46439b490f59n%40googlegroups.com.

Chris Chang

unread,
Jan 22, 2026, 5:37:19 AM (13 days ago) Jan 22
to kel...@umich.edu, plink2-users
(correction, that should read "ALT=G" at the end)

kel...@umich.edu

unread,
Jan 22, 2026, 1:01:40 PM (13 days ago) Jan 22
to plink2-users
Oh weird!!

The original .bgen files were from UK Biobank, the "version 3" imputed BGEN data.

I tried having a look at the original .bgen files with qctool, and got similarly weird results:

chromosome  rsid         alleleA  alleleB  alleleA_frequency  alleleB_frequency  missing_proportion  AA       AB       BB           total

21          rs534306314  G        A        0.998664           0.00133639         7.37789e-14         343062   918.498  0.443137     343981

21          rs534306314  G        T        0.999994           5.67747e-06        1.01531e-15         343977   3.90588  3.88578e-16  343981

21          rs8127052    T        A        0.580173           0.419827           -3.36743e-14        115942   167252   60786.4      343981

21          rs8127052    T        C        0.477352           0.522648           -6.26105e-15        78510.5  171379   94091.2      343981

21          rs454123     C        A        0.833725           0.166275           4.1966e-14          239068   95435.3  9477.71      343981

21          rs454123     C        T        0.511958           0.488042           -3.38435e-15        90188.7  171830   81961.9      343981


So this is a UK Biobank issue and not a Plink issue, and would exist even when accessing the original .bgen files with a different tool.

Thanks for getting me pointed towards the right problem!

Thank you,
Kristen
Reply all
Reply to author
Forward
0 new messages