logistic regression run time

257 views
Skip to first unread message

fasssster101

unread,
Apr 16, 2019, 9:01:24 AM4/16/19
to plink2-users
Hello, 

I am running a logistic regression with the following flag: --logistic --adjust gc --covar --ci 0.95  

The data set has  ~50,000 individuals and I am using 24 cores with ~19 covariates. I have run it for over 24 hours and the calculations are still not complete. Is there a way to estimate what the run time should be? Are there ways of speeding this calculation up? 

Thank you,

Christopher Chang

unread,
Apr 16, 2019, 10:19:31 AM4/16/19
to plink2-users
plink 1.9 —logistic is single-threaded. You should get a large speedup from plink 2.0 “—glm firth-fallback”, since that’ll actually take advantage of your 24 cores.

Re: estimating runtime, both versions should display what % of variants are complete.

fasssster101

unread,
Apr 16, 2019, 11:15:39 AM4/16/19
to plink2-users
Thank you -- very helpful.

fasssster101

unread,
Apr 18, 2019, 9:13:48 AM4/18/19
to plink2-users
Hi Chris, 

When I am using --glm in plink2 with --covar, I get the following warning: 

Warning: Skipping --glm regression on phenotype 'PHENO1' since covariate
correlation matrix could
not be inverted. You may want to remove redundant
covariates
and try again.

I have 19 covariates including year of birth, 5 PCs and 13 which are one hot encoded for batch. I calculated he pairwise correlation between all of these and they range from -0.2 to 0.5. None of these are highly correlated. 

If I remove the last batch column, then the glm runs without a problem. If I remove one of the PCs, the warning is still occurs...  


I am confused where the high correlation is coming from? 

Thanks! 

On Tuesday, April 16, 2019 at 9:19:31 AM UTC-5, Christopher Chang wrote:

Christopher Chang

unread,
Apr 18, 2019, 1:14:52 PM4/18/19
to plink2-users
The regression always has an "intercept" column with all 1s.  So, when using one-hot encoding, if you don't omit one of the categories, the sum of all the category columns will be equal to the all-1s column.  Any nontrivial linear combination equal to zero breaks the regression, not just pairwise combinations.

Incidentally, plink 2.0 directly supports categorical covariates; --glm automatically omits one category and one-hot encodes the rest.
Reply all
Reply to author
Forward
0 new messages