An update of Plink2: --glm now errors out and recommends adding --covar-variance-standardize

re...@channing.harvard.edu

unread,

Feb 25, 2019, 4:09:58 PM2/25/19

to plink2-users

Hi everyone,

There is a recent update of plink2. --glm now errors out and recommends adding --covar-variance-standardize when covariates vary enough in scale for numeric instability to be a major concern. I ran GWAS on 128 phenotypes separately using plink2 alpha recently. The first 78 GWAS worked fine. And I got the corresponding Manhattan plots. However, the plot of the 79th phenotype misses the SNPs of Chr17-22. And I checked my log files. I got this warning message:

Warning: Skipping --glm regression on phenotype '79' since

genotype/covariate

scales vary too widely for numerical stability of the

current implementation.

Try rescaling your covariates with e.g.

--covar-variance-standardize.

So plink2 directly skipped Chr17-22. As for the rest of GWASs (i.e. 80-128), plink2 directly skipped all chromosomes. And I saw the same warning messages as shown above. I am wondering whether --covar-variance-standardize is required as long as quantitative covariates are included in case of this warning message.

Thanks,

Jiangyuan

Christopher Chang

unread,

Feb 25, 2019, 4:26:25 PM2/25/19

to plink2-users

Yes, you should add --covar-variance-standardize for now. (This should not have a significant effect on your p-values; you can test this by e.g. adding --covar-variance-standardize for your 78th phenotype and comparing results.)

I'm planning an update to --glm which performs automatic rescaling when necessary, while keeping track of how scaling was performed so that regression coefficients can be reported for the original units. However, this is a relatively low priority.

sy06...@gmail.com

unread,

Mar 25, 2019, 4:09:51 PM3/25/19

to plink2-users

Hi Chris,

I have a question about the use of --covar-variance-standardize. Sorry if it is a trivial question . When would you recommend not using it? Or can we use it as default in any --glm run including a covariate table containing multiple discrete and continuous variables ?

My second question is can we use "NA" for all missing value in a covariates table that would contain both continuous and discrete values ? For example : Age, Genetic_sex and the first 10 PCA .

Thank you very much for your time,

Saliha

Are we allowed to use "NA" for missing continous covariates "

Christopher Chang

unread,

Mar 25, 2019, 4:14:36 PM3/25/19

to plink2-users

1. If you actually care about a covariate's regression coefficient, it may be more convenient to not use --covar-variance-standardize.

2. Yes, "NA" is normally interpreted as a missing covariate value. The exception is for *categorical* covariates, where "NONE" should be used instead.

sy06...@gmail.com

unread,

Mar 25, 2019, 4:29:06 PM3/25/19

to plink2-users

Thanks you very much for the clarifications.

Cheers,

Saliha

sy06...@gmail.com

unread,

Mar 27, 2019, 4:40:48 PM3/27/19

to plink2-users

Hi Chris,

Just following up on you input about missing categorical covariates being coded as "NONE" for plink2 .

I understand that all case/control (2 and 1 respectively for PLINK2) and quantitative phenotypes missing values have to be codded 'NA" .

For the covariates, from the error listed below, it seems like we cannot have 1,2 and "NONE" ? Is that correct? Since it is categorical

Moreover, if we consider the example of gender. It is coded 2 for females and 1 for males in PLINK2.

- How will PLINK use the input If for gender I use 1 and 2 and "NA" ? compared to "female" "male" and "NONE" ?

Thank you very much for your time,

Cheers,

Saliha

Logging to test16NONE.log.

Options in effect:

--covar /home/…/Covar.txt

--glm

--keep /home/…/Pheno2617PNs.txt

--memory 1600

--out test16NONE

--pfile /home/…/385K

--pheno /home/N…/Pheno2617PNs.txt

--threads 5

Start time: Wed Mar 27 20:15:33 2019

386743 MiB RAM detected; reserving 1600 MiB for main workspace.

Using up to 5 compute threads.

386114 samples (207945 females, 178169 males; 386114 founders) loaded from

/home/…/385K.psam.

142 variants loaded from

/home/…/385K.pvar.

1 binary phenotype loaded (847 cases, 1670 controls).

--keep: 2517 samples remaining.

Error: 'HLA' entry on line 224 of

/home/…/Covar.txt is categorical, while earlier entries are not.

(Case/control and quantitative phenotypes must all be numeric/'NA'.

Categorical phenotypes cannot be 'NA'--use e.g. 'NONE' to represent missing

categorical values instead--or start with a number.)

End time: Wed Mar 27 20:15:34 2019

LINES 223 and 224 of /home/…/Covar.txt file . the thirds column is HLA ( the possible values are 1,2 "NONE")

1003326 1003326 1 1 45 -0.010642 0.041682299 0.0308064 -0.00832141 0.00437607 -0.031305399 0.0211863 0.00226753 8.74E-04 0.00226753

1003334 1003334 NONE 2 56 0.030665601 0.00552055 -0.00170586 0.014714 -0.035934702 0.0199519 0.00696799 -0.00890832 -0.0269731 -0.00890832

Christopher Chang

unread,

Mar 27, 2019, 5:04:37 PM3/27/19

to plink2-users

1. Correct, you cannot mix 1, 2, and NONE. Each column must be entirely numeric (except with NA/nan permitted as missing values), or entirely non-numeric.

2. For gender specifically, you should use plink's standard encoding. But for other binary covariates, there should be no difference in --glm results (except for the intercept) between 1/2/NA and category1/category2/NONE; you can use whatever representation makes more sense in context.

Reply all

Reply to author

Forward