replicate --glm and --glm genotypic in R

32 views
Skip to first unread message

Luis

unread,
Sep 26, 2025, 7:24:50 PMSep 26
to plink2-users
Hi Chris,

I am trying to get the covariance matrix for the coefficients obtained from GWAS, but I couldn’t find any way to obtain it directly from PLINK (please correct me if I’m wrong).

So I tried extracting the genotype data from a PLINK bfile and running the regression in R. For the additive model, I obtained a similar but not identical result, while for the dominance deviation model, I couldn’t get a comparable result. 

Based on my understanding, --glm simply performs a linear regression, while --glm genotypic also performs a linear regression but adds another covariate that codes heterozygous sites as 1 and homozygous sites as 0. 

I’ve attached the code, results and the log file for the exported allele counts and GWAS for your reference. Could you please let me know where I did wrong? Thanks in advance for your help!


# additive for variant rs75267490 in R
Call: lm(formula = height ~ rs75267490_G..C. + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + BIRTH_YEAR, data = m_df)
Coefficients: Estimate Std. Error t value Pr(>|t|) rs75267490_G..C. 0.13063 0.03879 3.368 0.000758 ***

# additive for variant from rs75267490 from --glm
#CHROM  POS     ID      REF     ALT     PROVISIONAL_REF?        A1      OMITTED A1_FREQ TEST    OBS_CT  BETA    SE      T_STAT  P       ERRCODE
1       954032  rs75267490      C       G       Y       G       C       0.340962        ADD     181912  0.0956024       0.0230831       4.14166 3.4495e-05      .


# domdev for variant rs1291348 in R
Call: lm(formula = height ~ rs1291348_G..A. + rs1291348_HET + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + BIRTH_YEAR, data = m_df) Residuals: Min 1Q Median 3Q Max -188.029 -4.058 0.352 4.812 32.005 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 175.43819 0.02969 5909.009 < 2e-16 *** rs1291348_G..A. -0.21342 0.10288 -2.074 0.038038 * rs1291348_HET 0.28947 0.11642 2.486 0.012905 *

# domdev for variant rs1291348 from --glm genotypic
#CHROM  POS     ID      REF     ALT     PROVISIONAL_REF?        A1      OMITTED A1_FREQ TEST    OBS_CT  BETA    SE      T_OR_F_STAT     P       ERRCODE
12      13091942        rs1291348       A       G       Y       G       A       0.128724        ADD     184130  -0.310339       0.0612034       -5.07062        3.96902e-07     .
12      13091942        rs1291348       A       G       Y       G       A       0.128724        DOMDEV  184130  0.36828 0.0692663       5.31686 1.05694e-07     .
12      13091942        rs1291348       A       G       Y       G       A       0.128724        GENO_2DF        184130  NA      NA      14.706  4.10925e-07     .



20250926_height_sex_diff_dom_snps.log
height.dom.test.log
height.add.test.log
beta_cov.r

Chris Chang

unread,
Sep 26, 2025, 7:53:24 PMSep 26
to Luis, plink2-users
Please provide a toy dataset (could have just a few samples) which exhibit the discrepancy you're talking about, and which your commands/scripts can be executed on.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/5b8ef33c-77c4-446d-b61b-b9a06292d5c8n%40googlegroups.com.

Luis

unread,
Sep 26, 2025, 11:37:18 PMSep 26
to plink2-users

I figured out the problem -- I coded the missing phenotype as “-9” when running the GWAS but didn’t remove those individuals when running the regression in R...

Luis

unread,
Oct 24, 2025, 2:46:28 AMOct 24
to plink2-users
Hi Chris, 

Sorry to bother you again! I am currently trying to replicate Firth regression in R, but I noticed that the results are very close, though not exactly the same. I have attached the corresponding files and outputs below. Could you please let me know if this kind of difference is expected?

I also noticed that if I code the binary phenotype as 2 and 1 instead of 0 and 1, I get the error "Error in logistf.fit(x = x, y = y, weight = weight, offset = offset, firth, : LAPACK dpotrf failed: matrix is not positive definite (info = 3)" in logistf. Does this mean that, although the phenotype is coded as 1 and 2, PLINK still treats it as 0 and 1 when running Firth regression?

Thank you for your help!

# PLINK output
#CHROM  POS     ID      REF     ALT     PROVISIONAL_REF?        A1      OMITTED A1_FREQ FIRTH?  TEST    OBS_CT  OR      LOG(OR)_SE      Z_OR_F_STAT     P       ERRCODE
X       100333504       rs142164313     T       C       N       C       T       0.297424        N       ADD     427     3.54977 0.375535        3.37354 0.000742073     .
X       100333504       rs142164313     T       C       N       C       T       0.297424        N       DOMDEV  427     1.06478 0.414484        0.151436        0.879632        .
X       100333504       rs142164313     T       C       N       C       T       0.297424        N       GENO_2DF        427     NA      NA      20.7413 2.53099e-09     .

# R output
logistf(formula = G43 ~ rs142164313_T..C. + rs142164313_HET +
    scale(PC1) + scale(PC2) + scale(PC3) + scale(BIRTH_YEAR),
    data = t, pl = FALSE)

Model fitted by Penalized ML
Coefficients:
                          coef  se(coef)  lower 0.95 upper 0.95        Chisq            p method
(Intercept)        0.006564923 0.1419794 -0.27170956 0.28483941  0.002138005 0.9631201027      1
rs142164313_T..C.  1.154813176 0.3366572  0.49497712 1.81464924 11.766506790 0.0006030604      1
rs142164313_HET    0.152597356 0.3785970 -0.58943919 0.89463390  0.162457441 0.6869040148      1
scale(PC1)         0.197422870 0.1111732 -0.02047265 0.41531839  3.153511979 0.0757637365      1
scale(PC2)         0.051491040 0.1090592 -0.16226115 0.26524323  0.222914528 0.6368281111      1
scale(PC3)        -0.017572378 0.1112972 -0.23571091 0.20056615  0.024928289 0.8745458805      1
scale(BIRTH_YEAR) -0.148137541 0.1072885 -0.35841914 0.06214406  1.906443265 0.1673588824      1

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=52.3058 on 6 df, p=1.618571e-09, n=427
Wald test = 73.11451 on 6 df, p = 9.381385e-14


firth_reg_replicate.zip

Chris Chang

unread,
Oct 24, 2025, 11:06:46 AMOct 24
to Luis, plink2-users
The "FIRTH?" column says "N" in your example, so logistic regression was used here, not Firth.

Reply all
Reply to author
Forward
0 new messages