--geno --maf --hwe work differently in hardcall and dosage input data (also difference in plink1.90 and plink2))

836 views
Skip to first unread message

harry zhang

unread,
Mar 24, 2021, 6:41:02 PM3/24/21
to plink2-users
Hi,
    You can find the PLINK command and association output file. How does plink2 filter snps on --geno --hwe --maf for dosage data? 
The imputation quality can be handled in info or Rsq. I assume --geno --hwe will treat dosage data as hardcall genotypes, but it seems I am wrong. 

Thanks!

### input file:
1. *.bed: converted from VCF by plink190
plink190 --vcf impute_miniMac4_R2_030_n3_2019Nov_QC_all_chr22.vcf.gz  --out impute030_n3_all_chr22_plink190

impute030_n3_all_chr22_plink190.bed

2. *.bed: converted from VCF by plink2
plink2 --vcf impute_miniMac4_R2_030_n3_2019Nov_QC_all_chr22.vcf.gz  --out impute030_n3_all_chr22_plink2
impute030_n3_all_chr22_plink2.bed

3. *.pfile: converted from VCF by plink2 (dossage ) 
plink2 --vcf impute_miniMac4_R2_030_n3_2019Nov_QC_all_chr22.vcf.gz dosage=DS --make-pgen --out  impute030_n3_all_chr22_plink2dosage
impute030_n3_all_chr22_plink2dosage



#########################################################################
########## association in plink190 and plink2 (typed or dosage data)
#########################################################
#######NO-dosage in PLINK190
plink190 --bfile   impute030_n3_all_chr22_plink190 --allow-no-sex  --pheno temp_n3_cau_phen_lt12_lifeaud5 --pheno-name lt12_lifeaud5 --geno 0.1 --maf 0.01 --hwe 1.0e-20 midp --logistic --out  impute030_n3_all_chr22_plink190_plink190losgistic
###log:
0 variants removed due to missing genotype data (--geno).
--hwe: 0 variants removed due to Hardy-Weinberg exact test.
44324 variants removed due to minor allele threshold(s)

#########################################################
###### NO-dosage in PLINK2, but input data is from PLINK190
plink2   --bfile   impute030_n3_all_chr22_plink190 --allow-no-sex  --pheno temp_n3_cau_phen_lt12_lifeaud5 --pheno-name lt12_lifeaud5 --geno 0.1 --maf 0.01 --hwe 1.0e-20 midp --glm allow-no-covars  --out  impute030_n3_all_chr22_plink190_plink2glm
###log:
--geno: 0 variants removed due to missing genotype data.
--hwe midp: 14120 variants removed due to Hardy-Weinberg exact test (founders
only).
44320 variants removed due to allele frequency threshold(s)

#########################################################
###### NO-dosage in PLINK2; *pgen file
plink2   --pfile  impute030_n3_all_chr22_plink2 --allow-no-sex  --pheno temp_n3_cau_phen_lt12_lifeaud5 --pheno-name lt12_lifeaud5 --geno 0.1 --maf 0.01 --hwe 1.0e-20 midp       --glm allow-no-covars --out  impute030_n3_all_chr22_plink2_glm
###log:
--geno: 0 variants removed due to missing genotype data.
--hwe midp: 14120 variants removed due to Hardy-Weinberg exact test (founders
only).
44320 variants removed due to allele frequency threshold(s)

#########################################################
###### dosage; *pgen file
plink2   --pfile  impute030_n3_all_chr22_plink2dosage --allow-no-sex  --pheno temp_n3_cau_phen_lt12_lifeaud5 --pheno-name lt12_lifeaud5 --geno 0.1 --maf 0.01 --hwe 1.0e-20 midp --glm allow-no-covars --out  impute030_n3_all_chr22_plink2dosage_glm
###log:
--geno: 33579 variants removed due to missing genotype data.
--hwe midp: 3710 variants removed due to Hardy-Weinberg exact test (founders
only).
40955 variants removed due to allele frequency threshold(s)

#########################################################
###### --require-pheno (no dosage); *pgen file
plink2   --pfile  impute030_n3_all_chr22_plink2 --allow-no-sex  --require-pheno --pheno temp_n3_cau_phen_lt12_lifeaud5 --pheno-name lt12_lifeaud5 --geno 0.1 --maf 0.01 --hwe 1.0e-20 midp --glm allow-no-covars --out  impute030_n3_all_chr22_plink2_glm_requirePheno
###log:
--geno: 0 variants removed due to missing genotype data.
--hwe midp: 0 variants removed due to Hardy-Weinberg exact test (founders
only).
55695 variants removed due to allele frequency threshold(s)

#########################################################
###### --require-pheno (dosage); *pgen file
plink2   --pfile  impute030_n3_all_chr22_plink2dosage --allow-no-sex  --require-pheno --pheno temp_n3_cau_phen_lt12_lifeaud5 --pheno-name lt12_lifeaud5 --geno 0.1 --maf 0.01 --hwe 1.0e-20 midp --glm allow-no-covars --out  impute030_n3_all_chr22_plink2dosage_glm__requirePheno
###log:
--geno: 30951 variants removed due to missing genotype data.
--hwe midp: 594 variants removed due to Hardy-Weinberg exact test (founders
only).
52549 variants removed due to allele frequency threshold(s)

Christopher Chang

unread,
Mar 24, 2021, 6:52:02 PM3/24/21
to plink2-users
1. Why would you expect --maf to work the same with dosages?  If --maf doesn't take dosages into account, why are you keeping dosages around at all?
2. You've omitted far too much information from the logs.

harry zhang

unread,
Mar 24, 2021, 8:23:14 PM3/24/21
to plink2-users
Hi,  
    Thanks for your quick response. I understand --maf and --glm should definitely consider dosage info. My major question is the criteria for --geno. It seems there are a lot of missing data for my dosage data. I am thinking what is the best way to save more data in relatively high-quality criteria. BTW, I have excluded the SNPs with Rsq (<0.3) in miniMac4 imputation data. I am also interested in the criteria for --hwe with dosage data. Here is the full log. 
Thanks again!

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to impute030_n3_all_chr22_plink190_plink190losgistic.log.
Options in effect:
  --allow-no-sex
  --bfile impute030_n3_all_chr22_plink190
  --geno 0.1
  --hwe 1.0e-20 midp
  --logistic
  --maf 0.01
  --out impute030_n3_all_chr22_plink190_plink190losgistic
  --pheno temp_n3_cau_phen_lt12_lifeaud5
  --pheno-name lt12_lifeaud5

257652 MB RAM detected; reserving 128826 MB for main workspace.
100322 variants loaded from .bim file.
22848 people (0 males, 0 females, 22848 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
impute030_n3_all_chr22_plink190_plink190losgistic.nosex .
8446 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 22848 founders and 0 nonfounders present.
Calculating allele frequencies... 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% done.
0 variants removed due to missing genotype data (--geno).
--hwe: 0 variants removed due to Hardy-Weinberg exact test.
44324 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
55998 variants and 22848 people pass filters and QC.
Among remaining phenotypes, 4487 are cases and 3959 are controls.  (14402
phenotypes are missing.)
Writing logistic model association results to
impute030_n3_all_chr22_plink190_plink190losgistic.assoc.logistic ... 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% done.
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to impute030_n3_all_chr22_plink190_plink2glm.log.
Options in effect:
  --allow-no-sex
  --bfile impute030_n3_all_chr22_plink190
  --geno 0.1
  --glm allow-no-covars
  --hwe 1.0e-20 midp
  --maf 0.01
  --out impute030_n3_all_chr22_plink190_plink2glm
  --pheno temp_n3_cau_phen_lt12_lifeaud5
  --pheno-name lt12_lifeaud5

Start time: Wed Mar 24 18:07:20 2021
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
257652 MiB RAM detected; reserving 128826 MiB for main workspace.
Using up to 56 threads (change this with --threads).
22848 samples (0 females, 0 males, 22848 ambiguous; 22848 founders) loaded from
impute030_n3_all_chr22_plink190.fam.
100322 variants loaded from impute030_n3_all_chr22_plink190.bim.
1 binary phenotype loaded (4487 cases, 3959 controls).
Calculating allele frequencies... 0% 65% done.
--geno: 0 variants removed due to missing genotype data.
--hwe midp: 14120 variants removed due to Hardy-Weinberg exact test (founders
only).
44320 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
41882 variants remaining after main filters.
--glm logistic-Firth hybrid regression on phenotype 'lt12_lifeaud5': 0% 68% done.
Results written to impute030_n3_all_chr22_plink190_plink2glm.lt12_lifeaud5.glm.logistic.hybrid .
End time: Wed Mar 24 18:07:23 2021
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to impute030_n3_all_chr22_plink2_glm.log.
Options in effect:
  --allow-no-sex
  --geno 0.1
  --glm allow-no-covars
  --hwe 1.0e-20 midp
  --maf 0.01
  --out impute030_n3_all_chr22_plink2_glm
  --pfile impute030_n3_all_chr22_plink2
  --pheno temp_n3_cau_phen_lt12_lifeaud5
  --pheno-name lt12_lifeaud5

Start time: Wed Mar 24 18:07:23 2021
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
257652 MiB RAM detected; reserving 128826 MiB for main workspace.
Using up to 56 threads (change this with --threads).
22848 samples (0 females, 0 males, 22848 ambiguous; 22848 founders) loaded from
impute030_n3_all_chr22_plink2.psam.
100322 variants loaded from impute030_n3_all_chr22_plink2.pvar.
1 binary phenotype loaded (4487 cases, 3959 controls).
Calculating allele frequencies... 0% 65% done.
--geno: 0 variants removed due to missing genotype data.
--hwe midp: 14120 variants removed due to Hardy-Weinberg exact test (founders
only).
44320 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
41882 variants remaining after main filters.
--glm logistic-Firth hybrid regression on phenotype 'lt12_lifeaud5': 0% 68% done.
Results written to impute030_n3_all_chr22_plink2_glm.lt12_lifeaud5.glm.logistic.hybrid .
End time: Wed Mar 24 18:07:26 2021
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to impute030_n3_all_chr22_plink2dosage_glm.log.
Options in effect:
  --allow-no-sex
  --geno 0.1
  --glm allow-no-covars
  --hwe 1.0e-20 midp
  --maf 0.01
  --out impute030_n3_all_chr22_plink2dosage_glm
  --pfile impute030_n3_all_chr22_plink2dosage
  --pheno temp_n3_cau_phen_lt12_lifeaud5
  --pheno-name lt12_lifeaud5

Start time: Wed Mar 24 18:07:26 2021
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
257652 MiB RAM detected; reserving 128826 MiB for main workspace.
Using up to 56 threads (change this with --threads).
22848 samples (0 females, 0 males, 22848 ambiguous; 22848 founders) loaded from
impute030_n3_all_chr22_plink2dosage.psam.
100322 variants loaded from impute030_n3_all_chr22_plink2dosage.pvar.
1 binary phenotype loaded (4487 cases, 3959 controls).
Calculating allele frequencies... 0% 65% done.
--geno: 33579 variants removed due to missing genotype data.
--hwe midp: 3710 variants removed due to Hardy-Weinberg exact test (founders
only).
40955 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
22078 variants remaining after main filters.
--glm logistic-Firth hybrid regression on phenotype 'lt12_lifeaud5': 0% 67% done.
Results written to impute030_n3_all_chr22_plink2dosage_glm.lt12_lifeaud5.glm.logistic.hybrid .
End time: Wed Mar 24 18:07:32 2021
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to impute030_n3_all_chr22_plink2_glm_requirePheno.log.
Options in effect:
  --allow-no-sex
  --geno 0.1
  --glm allow-no-covars
  --hwe 1.0e-20 midp
  --maf 0.01
  --out impute030_n3_all_chr22_plink2_glm_requirePheno
  --pfile impute030_n3_all_chr22_plink2
  --pheno temp_n3_cau_phen_lt12_lifeaud5
  --pheno-name lt12_lifeaud5
  --require-pheno

Start time: Wed Mar 24 18:07:32 2021
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
257652 MiB RAM detected; reserving 128826 MiB for main workspace.
Using up to 56 threads (change this with --threads).
22848 samples (0 females, 0 males, 22848 ambiguous; 22848 founders) loaded from
impute030_n3_all_chr22_plink2.psam.
100322 variants loaded from impute030_n3_all_chr22_plink2.pvar.
1 binary phenotype loaded (4487 cases, 3959 controls).
--require-pheno: 14402 samples removed.
8446 samples (0 females, 0 males, 8446 ambiguous; 8446 founders) remaining
after main filters.
4487 cases and 3959 controls remaining after main filters.
Calculating allele frequencies... 0% 65% done.
--geno: 0 variants removed due to missing genotype data.
--hwe midp: 0 variants removed due to Hardy-Weinberg exact test (founders
only).
55695 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
44627 variants remaining after main filters.
--glm logistic-Firth hybrid regression on phenotype 'lt12_lifeaud5': 0% 67% done.
Results written to impute030_n3_all_chr22_plink2_glm_requirePheno.lt12_lifeaud5.glm.logistic.hybrid .
End time: Wed Mar 24 18:07:34 2021
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021)      www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to impute030_n3_all_chr22_plink2dosage_glm__requirePheno.log.
Options in effect:
  --allow-no-sex
  --geno 0.1
  --glm allow-no-covars
  --hwe 1.0e-20 midp
  --maf 0.01
  --out impute030_n3_all_chr22_plink2dosage_glm__requirePheno
  --pfile impute030_n3_all_chr22_plink2dosage
  --pheno temp_n3_cau_phen_lt12_lifeaud5
  --pheno-name lt12_lifeaud5
  --require-pheno

Start time: Wed Mar 24 18:07:34 2021
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
257652 MiB RAM detected; reserving 128826 MiB for main workspace.
Using up to 56 threads (change this with --threads).
22848 samples (0 females, 0 males, 22848 ambiguous; 22848 founders) loaded from
impute030_n3_all_chr22_plink2dosage.psam.
100322 variants loaded from impute030_n3_all_chr22_plink2dosage.pvar.
1 binary phenotype loaded (4487 cases, 3959 controls).
--require-pheno: 14402 samples removed.
8446 samples (0 females, 0 males, 8446 ambiguous; 8446 founders) remaining
after main filters.
4487 cases and 3959 controls remaining after main filters.
Calculating allele frequencies... 0% 65% done.
--geno: 30951 variants removed due to missing genotype data.
--hwe midp: 594 variants removed due to Hardy-Weinberg exact test (founders
only).
52549 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
16228 variants remaining after main filters.
--glm logistic-Firth hybrid regression on phenotype 'lt12_lifeaud5': 0% 65% done.
Results written to impute030_n3_all_chr22_plink2dosage_glm__requirePheno.lt12_lifeaud5.glm.logistic.hybrid .
End time: Wed Mar 24 18:07:40 2021

Christopher Chang

unread,
Mar 24, 2021, 8:39:44 PM3/24/21
to plink2-users
Okay, thanks for providing the full logs.  There are several things going on here.

1. From the --vcf dosage=DS documentation: "hardcalls are now regenerated from scratch from the dosages, using the --hard-call-threshold logic described below."  That's why you have so many missing hardcalls in that file, even though your VCF doesn't have any...

2. ...which is alarming.  Your data is quite badly imputed if the default "--hard-call-threshold 0.1" setting results in >30% of the variants getting filtered out by "--geno 0.1".  (Incidentally, "--geno 0.1 dosage" filters on missing dosages instead of missing hardcalls.  But plain "--geno 0.1" is probably doing the right thing here: I wouldn't trust any of those 30951 variants.)

3. The plink2 --hwe documentation sets the --hwe flag on a yellow background.  This is an indicator that there is a relevant behavior change from plink 1.9 (from https://www.cog-genomics.org/plink/2.0/general_usage : "Background color summarizes degree of similarity to PLINK 1.9. Green signals maximal compatibility: there will usually be a minor difference in output file formats, but all information in the PLINK 1.9 output file will also be present in the PLINK 2.0 output file when the same flag and modifiers are used. (Note that green does not guarantee the absence of additional options.) Yellow signals slightly different functionality and/or command-line usage, and blue signals that the flag is new to PLINK 2.0.").  In this case, it doesn't relate to dosages (--hwe is based on an exact test on a table of hardcall counts, and there is no clean way to extend that to dosages), instead it relates to your phenotype.  The last part of the plink2 --hwe documentation spells out the difference:

"There is currently no special handling of case/control phenotypes;

--keep-if <phenotype name> == control

is frequently a good idea when using --hwe in a genome-wide association analysis (and matches PLINK 1.x's behavior)."

harry zhang

unread,
Mar 25, 2021, 5:59:45 PM3/25/21
to plink2-users
Thank you very much for your message. It helps a lot. Plink2 indeed help me filter out the bad imputed SNPs.
1. If I understand correctly, --hwe is based on all samples (both cases/controls). Can we add a modifier to calculate hwe on controls only?  "-keep-if <phenotype name> == control" can be used for control only, but it cannot be combined with --glm.
    Instead, --hwe in plink1.07 is based on controls by default. I think It is a better approach in case-control GWAS analysis.  
2. Plink flag index is very helpful. Is it possible to add "default value" in PLINK2 index?

Thanks again!
Harry

Christopher Chang

unread,
Mar 25, 2021, 6:03:42 PM3/25/21
to plink2-users
1. Use --write-snplist to save the result of your --hwe + --keep-if run, and then --extract to load that result in your --glm result.

plink 1.x's "controls by default" behavior is completely incompatible with plink 2.0, because plink 2.0 can have multiple phenotypes loaded at once.

2. I'll look into cases where that's worth including in the 1-line summaries.

Christopher Chang

unread,
Mar 25, 2021, 6:05:09 PM3/25/21
to plink2-users
(oops, meant to say "--glm run" instead of "--glm result")
Reply all
Reply to author
Forward
0 new messages