Hi,
I have three datasets.
First some background.
I executed the QC, and check the sex. For the latter I first filtered my data to include only high-quality genotypes to determine sex using --check-sex. I removed the erroneous samples. Also, the chromosomes are split, in the original (raw) binary PLINK data - so --split-x b37 errors out.
Ultimately I want VCF files. So per
these descriptions, I ran --set-hh-missing to set all remaining haploid calls to missing, since I am 100% sure that the males are males, and the females are females.
After running this the same number of haploid calls are reported. What am I mis-interpreting? Shouldn't that warning-message have disappeared after running --set-hh-missing?
Thanks
Sander
Below some other details of the three cleaned datasets.
dataset1 - Affymetrix SNP 5
chrX - 9004
chrXY - 138
chrY - 10
chrMT - 0
2269 het. haploid genotypes present (see dataset1/dataset1_finalQC_cleaned.hh
I checked the number of unique variants in that hh-file, and where they are. These are 15 SNPs in total, 3 on chromosome Y, and 12 on X.
dataset2 - Affymetrix Axiom CEU
chrX - 14437
chrXY - 506
chrY - 266
chrMT - 67
5721 het. haploid genotypes present (see dataset2/dataset2_finalQC_cleaned.hh
I checked the number of unique variants in that hh-file, and where they are. These are 2933 unique SNPs and all on chromosome X.
dataset3 - Illumina GSA MD1
chrX - 16869
chrXY - 535
chrY - 1337
chrMT - 134
53064 het. haploid genotypes present (see dataset3/dataset3_finalQC_cleaned.hh
I checked the number of unique variants in that hh-file, and where they are. These are 2565 unique SNPs. And 2494 are on chromosome X, 71 are on chromosome Y.
> set haploids to missing
PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dataset1/dataset1_finalQC.log.
Options in effect:
--bfile dataset1/dataset1_finalQC_cleaned
--make-bed
--out dataset1/dataset1_finalQC
--set-hh-missing
16384 MB RAM detected; reserving 8192 MB for main workspace.
409037 variants loaded from .bim file.
780 people (535 males, 245 females) loaded from .fam.
780 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 780 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997156.
409037 variants and 780 people pass filters and QC.
Among remaining phenotypes, 780 are cases and 0 are controls.
--make-bed to dataset1/dataset1_finalQC.bed +
dataset1/dataset1_finalQC.bim + dataset1/dataset1_finalQC.fam ... 4%Warning: 2269 het. haploid genotypes present (see
aegs1/dataset1_finalQC.hh ); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missingdone.
> set haploids to missing
PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dataset2/dataset2_finalQC.log.
Options in effect:
--bfile dataset2/dataset2_finalQC_cleaned
--make-bed
--out dataset2/dataset2_finalQC
--set-hh-missing
16384 MB RAM detected; reserving 8192 MB for main workspace.
529238 variants loaded from .bim file.
900 people (605 males, 295 females) loaded from .fam.
900 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 900 nonfounders present.
Calculating allele frequencies... done.
Warning: 5721 het. haploid genotypes present (see
dataset2/dataset2_finalQC.hh ); many commands treat these as missing.
Total genotyping rate is 0.995835.
529238 variants and 900 people pass filters and QC.
Among remaining phenotypes, 900 are cases and 0 are controls.
--make-bed to dataset2/dataset2_finalQC.bed +
dataset2/dataset2_finalQC.bim + dataset2/dataset2_finalQC.fam ... done.
> set haploids to missing
PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dataset3/dataset3_finalQC.log.
Options in effect:
--bfile dataset3/dataset3_finalQC_cleaned
--make-bed
--out dataset3/dataset3_finalQC
--set-hh-missing
16384 MB RAM detected; reserving 8192 MB for main workspace.
661345 variants loaded from .bim file.
587 people (407 males, 180 females) loaded from .fam.
587 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 587 nonfounders present.
Calculating allele frequencies... done.
Warning: 53064 het. haploid genotypes present (see
dataset3/dataset3_finalQC.hh ); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.999589.
661345 variants and 587 people pass filters and QC.
Among remaining phenotypes, 587 are cases and 0 are controls.
--make-bed to dataset3/dataset3_finalQC.bed +
dataset3/dataset3_finalQC.bim + dataset3/dataset3_finalQC.fam ... done.