Replicating plink2 logistic regression results in R

94 views
Skip to first unread message

Lorenzo Capitani

unread,
Mar 18, 2025, 8:52:33 AM3/18/25
to plink2-users
Hello :)

I've been asked to replicate some analysis from plink2 in R and hoping for guidance to achieve the same results. I have a phenotype (1/2), gender (1/2), country (1,2,3,4) and various SNPs encoded additively (0,1,2).

In plink2 I run the following command

 --pfile hq_exsomes --glm --covar-name Sex,Country_of_origin --maf 0.01 -pheno-name Phenotype -out model_output 

I obtain good results which match the literature giving me confidence. Log is attached.

In R I've tried to replicate the analysis (similarly filtering with a --maf of 0.01) by performing either a logistic regression or firth regression like

glm(Phenotype ~ SNP + Gender + Country_of_origin, data = model_data, family=binomial(link="logit")) or Firth regression using the logistf() function. Whilst the p-values and effect sizes are similar, they are not the same. 

Does plink2 take any additional steps during modelling? For example, my phenotype variable is skewed 3:1 cases:controls, not sure if plink2 does anything to account for that perhaps? Any help appreciated, looked in documentation but couldn't find anything to help..

Thanks for your time and for this great software :)


model.log

Christopher Chang

unread,
Mar 23, 2025, 11:19:59 PM3/23/25
to plink2-users
plink2's Firth regression should be equivalent to logistf() with pl=FALSE.  (The default pl=TRUE setting doesn't play well with --glm's regular output columns.)

Please provide a replicable example of a plain-logistic-regression discrepancy you're seeing.

Lorenzo Capitani

unread,
Mar 24, 2025, 12:01:09 PM3/24/25
to plink2-users
Hello :)

Thanks for getting back to me. I provide a sample dataset (with anonymised IDs). This is the command is used to run the GLM in plink:

plink2 --pfile sample_data --glm --covar-name Sex,Country_of_origin_checked --mac 1 --pheno-name Phenotype -out sample_data

The model results (sample_data.Phenotype.glm.logistic.hybrid) are also attached. 

I then provide an R code as notebook or html file for viewing. Let me know if you'd like anything else or in a different format.

Thank you dearly for your time!
Lorenzo

sample_data.pvar
sample_data.pgen
sample_data.psam
plink2_vs_R.Rmd
sample_data.log
plink2_vs_R.html
sample_data.Phenotype.glm.logistic.hybrid

Christopher Chang

unread,
Mar 24, 2025, 3:48:17 PM3/24/25
to plink2-users
- I'm guessing that the sample order in your local /Users/c1530015/Documents/HLA_AD_WES_results/VCF_encoded_HLA_data_additively.csv file does not match that in the .psam file.  When I synthesize such a file with correct sample order, I get the expected 0.00285 p-value in R.
- In the meantime, there is a problem with how you preprocessed this data: the REF and ALT alleles for the first variant are identical, which should never happen.
Reply all
Reply to author
Forward
0 new messages