An update of Plink2: --glm now errors out and recommends adding --covar-variance-standardize

1,748 views
Skip to first unread message

re...@channing.harvard.edu

unread,
Feb 25, 2019, 4:09:58 PM2/25/19
to plink2-users
Hi everyone,

There is a recent update of plink2. --glm now errors out and recommends adding --covar-variance-standardize when covariates vary enough in scale for numeric instability to be a major concern. I ran GWAS on 128 phenotypes separately using plink2 alpha recently. The first 78 GWAS worked fine. And I got the corresponding Manhattan plots. However, the plot of the 79th phenotype misses the SNPs of Chr17-22. And I checked my log files. I got this warning message: 

Warning: Skipping --glm regression on phenotype '79' since
genotype/covariate
scales vary too widely for numerical stability of the
current implementation.
Try rescaling your covariates with e.g.
--covar-variance-standardize.

So plink2 directly skipped Chr17-22. As for the rest of GWASs (i.e. 80-128), plink2 directly skipped all chromosomes. And I saw the same warning messages as shown above. I am wondering whether --covar-variance-standardize is required as long as quantitative covariates are included in case of this warning message.

Thanks,
Jiangyuan

Christopher Chang

unread,
Feb 25, 2019, 4:26:25 PM2/25/19
to plink2-users
Yes, you should add --covar-variance-standardize for now.  (This should not have a significant effect on your p-values; you can test this by e.g. adding --covar-variance-standardize for your 78th phenotype and comparing results.)

I'm planning an update to --glm which performs automatic rescaling when necessary, while keeping track of how scaling was performed so that regression coefficients can be reported for the original units.  However, this is a relatively low priority.

sy06...@gmail.com

unread,
Mar 25, 2019, 4:09:51 PM3/25/19
to plink2-users
Hi Chris, 

I have a question about the use of --covar-variance-standardize.  Sorry if it is a trivial questionWhen would you recommend not using it?  Or can we use it as default in any --glm  run including a covariate table  containing multiple discrete and continuous variables ?  
My second question is  can we use "NA" for all  missing value in a  covariates table that would contain  both continuous and discrete values ?  For example :  Age, Genetic_sex and the first 10 PCA . 

Thank you  very much for your time,

Saliha

Are we allowed to use "NA"  for missing  continous covariates " 

Christopher Chang

unread,
Mar 25, 2019, 4:14:36 PM3/25/19
to plink2-users
1. If you actually care about a covariate's regression coefficient, it may be more convenient to not use --covar-variance-standardize.
2. Yes, "NA" is normally interpreted as a missing covariate value.  The exception is for *categorical* covariates, where "NONE" should be used instead.

sy06...@gmail.com

unread,
Mar 25, 2019, 4:29:06 PM3/25/19
to plink2-users
Thanks you very much for the clarifications. 
Cheers,
Saliha

sy06...@gmail.com

unread,
Mar 27, 2019, 4:40:48 PM3/27/19
to plink2-users
Hi Chris, 

Just following up on you input about missing categorical covariates  being coded as "NONE" for plink2 . 
I understand that all case/control (2 and 1 respectively for PLINK2)  and quantitative phenotypes missing values have to be codded  'NA" . 

For the covariates, from the error listed below, it seems like we cannot have 1,2 and "NONE"  ? Is that correct?  Since it is categorical 

Moreover, if we consider the example of gender. It is coded 2 for females and 1 for males in PLINK2. 
- How will PLINK use the input If for gender  I use 1 and 2 and "NA"  ?  compared to "female" "male" and "NONE" ? 

Thank you very much for your time,

Cheers,

Saliha


(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to test16NONE.log.

Options in effect:

  --covar /home/…/Covar.txt

  --glm

  --keep /home/…/Pheno2617PNs.txt

  --memory 1600

  --out test16NONE

  --pfile /home/…/385K

  --pheno /home/N…/Pheno2617PNs.txt

  --threads 5


Start time: Wed Mar 27 20:15:33 2019

386743 MiB RAM detected; reserving 1600 MiB for main workspace.

Using up to 5 compute threads.

386114 samples (207945 females, 178169 males; 386114 founders) loaded from

/home/…/385K.psam.

142 variants loaded from

/home/…/385K.pvar.

1 binary phenotype loaded (847 cases, 1670 controls).

--keep: 2517 samples remaining.

Error: 'HLA' entry on line 224 of

/home/…/Covar.txt is categorical, while earlier entries are not.

(Case/control and quantitative phenotypes must all be numeric/'NA'.

Categorical phenotypes cannot be 'NA'--use e.g. 'NONE' to represent missing

categorical values instead--or start with a number.)

End time: Wed Mar 27 20:15:34 2019


LINES 223 and 224 of /home/…/Covar.txt  file . the thirds column is HLA ( the possible values are 1,2 "NONE")

1003326 1003326 1 1 45 -0.010642 0.041682299 0.0308064 -0.00832141 0.00437607 -0.031305399 0.0211863 0.00226753 8.74E-04 0.00226753

1003334 1003334 NONE 2 56 0.030665601 0.00552055 -0.00170586 0.014714 -0.035934702 0.0199519 0.00696799 -0.00890832 -0.0269731 -0.00890832

Christopher Chang

unread,
Mar 27, 2019, 5:04:37 PM3/27/19
to plink2-users
1. Correct, you cannot mix 1, 2, and NONE.  Each column must be entirely numeric (except with NA/nan permitted as missing values), or entirely non-numeric.
2. For gender specifically, you should use plink's standard encoding.  But for other binary covariates, there should be no difference in --glm results (except for the intercept) between 1/2/NA and category1/category2/NONE; you can use whatever representation makes more sense in context.
Reply all
Reply to author
Forward
0 new messages