PLINK 1.9 --set--hh-missing

301 views
Skip to first unread message

Sander W. van der Laan

unread,
Mar 19, 2024, 2:20:42 PM3/19/24
to plink2-users
Hi,

I have three datasets. 

First some background. 
I executed the QC, and check the sex. For the latter I first filtered my data to include only high-quality genotypes to determine sex using --check-sex. I removed the erroneous samples. Also, the chromosomes are split, in the original (raw) binary PLINK data - so --split-x b37 errors out.

Ultimately I want VCF files. So per these descriptions, I ran --set-hh-missing to set all remaining haploid calls to missing, since I am 100% sure that the males are males, and the females are females.
After running this the same number of haploid calls are reported. What am I mis-interpreting? Shouldn't that warning-message have disappeared after running --set-hh-missing?

Thanks

Sander


Below some other details of the three cleaned datasets. 

dataset1 - Affymetrix SNP 5
chrX - 9004
chrXY - 138
chrY - 10
chrMT - 0
2269 het. haploid genotypes present (see dataset1/dataset1_finalQC_cleaned.hh
I checked the number of unique variants in that hh-file, and where they are. These are 15 SNPs in total, 3 on chromosome Y, and 12 on X. 

dataset2 - Affymetrix Axiom CEU
chrX - 14437
chrXY - 506
chrY - 266
chrMT - 67
5721 het. haploid genotypes present (see dataset2/dataset2_finalQC_cleaned.hh
I checked the number of unique variants in that hh-file, and where they are. These are 2933 unique SNPs and all on chromosome X.

dataset3 - Illumina GSA MD1 
chrX - 16869
chrXY - 535
chrY - 1337
chrMT - 134
53064 het. haploid genotypes present (see dataset3/dataset3_finalQC_cleaned.hh
I checked the number of unique variants in that hh-file, and where they are. These are 2565 unique SNPs. And 2494 are on chromosome X, 71 are on chromosome Y.


> set haploids to missing PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to dataset1/dataset1_finalQC.log. Options in effect: --bfile dataset1/dataset1_finalQC_cleaned --make-bed --out dataset1/dataset1_finalQC --set-hh-missing 16384 MB RAM detected; reserving 8192 MB for main workspace. 409037 variants loaded from .bim file. 780 people (535 males, 245 females) loaded from .fam. 780 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 0 founders and 780 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.997156. 409037 variants and 780 people pass filters and QC. Among remaining phenotypes, 780 are cases and 0 are controls. --make-bed to dataset1/dataset1_finalQC.bed + dataset1/dataset1_finalQC.bim + dataset1/dataset1_finalQC.fam ... 4%Warning: 2269 het. haploid genotypes present (see aegs1/dataset1_finalQC.hh ); many commands treat these as missing. Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missingdone. > set haploids to missing PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to dataset2/dataset2_finalQC.log. Options in effect: --bfile dataset2/dataset2_finalQC_cleaned --make-bed --out dataset2/dataset2_finalQC --set-hh-missing 16384 MB RAM detected; reserving 8192 MB for main workspace. 529238 variants loaded from .bim file. 900 people (605 males, 295 females) loaded from .fam. 900 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 0 founders and 900 nonfounders present. Calculating allele frequencies... done. Warning: 5721 het. haploid genotypes present (see dataset2/dataset2_finalQC.hh ); many commands treat these as missing. Total genotyping rate is 0.995835. 529238 variants and 900 people pass filters and QC. Among remaining phenotypes, 900 are cases and 0 are controls. --make-bed to dataset2/dataset2_finalQC.bed + dataset2/dataset2_finalQC.bim + dataset2/dataset2_finalQC.fam ... done. > set haploids to missing PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to dataset3/dataset3_finalQC.log. Options in effect: --bfile dataset3/dataset3_finalQC_cleaned --make-bed --out dataset3/dataset3_finalQC --set-hh-missing 16384 MB RAM detected; reserving 8192 MB for main workspace. 661345 variants loaded from .bim file. 587 people (407 males, 180 females) loaded from .fam. 587 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 0 founders and 587 nonfounders present. Calculating allele frequencies... done. Warning: 53064 het. haploid genotypes present (see dataset3/dataset3_finalQC.hh ); many commands treat these as missing. Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing. Total genotyping rate is 0.999589. 661345 variants and 587 people pass filters and QC. Among remaining phenotypes, 587 are cases and 0 are controls. --make-bed to dataset3/dataset3_finalQC.bed + dataset3/dataset3_finalQC.bim + dataset3/dataset3_finalQC.fam ... done.

Chris Chang

unread,
Mar 19, 2024, 7:35:12 PM3/19/24
to Sander W. van der Laan, plink2-users


On Tue, Mar 19, 2024 at 4:34 PM Chris Chang <chrch...@gmail.com> wrote:
Due to the order of operations (allele frequencies are computed before --make-bed, the het-haploid warning occurs during the allele frequency computation while --set-hh-missing occurs as part of --make-bed), a het-haploid warning is expected when plink 1.x --set-hh-missing is run.  However, if you load the *output* of that run, you should not see another warning; did you try that?  I don't see that in your logs.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/e34f0c52-964f-4fe6-8823-4f5901543768n%40googlegroups.com.

Sander W. van der Laan

unread,
Mar 19, 2024, 10:18:13 PM3/19/24
to Chris Chang, plink2-users
Hi,

Thanks for the quick response. I think I fixed this.

First I did the --merge-x and next --split-x commands with plink19 and plink2 respectively. Then I check the .hh-file, these were a bunch of X chromosome and Y chromosome variants - the latter is odd anyway in the case of dataset1, because as far as I know an Affymetrix SNP 5 chip doesn’t have any Y-variants. Than haploid variants remained, since I am sure the data is good I used --set-hh-missing.
After that I checked the frequency and still - many more actually - some variants were haploid. These were all on chromosome X as it turned out, and these were all low-call X-variants. So I removed these. I think because they were ignored before they may have escaped my initial QC rounds. Now, the imputation - with TOPMed - works.

Thanks!

Sander
Reply all
Reply to author
Forward
0 new messages