logistic regression run time

fasssster101

unread,

Apr 16, 2019, 9:01:24 AM4/16/19

to plink2-users

Hello,

I am running a logistic regression with the following flag: --logistic --adjust gc --covar --ci 0.95

The data set has ~50,000 individuals and I am using 24 cores with ~19 covariates. I have run it for over 24 hours and the calculations are still not complete. Is there a way to estimate what the run time should be? Are there ways of speeding this calculation up?

Thank you,

Christopher Chang

unread,

Apr 16, 2019, 10:19:31 AM4/16/19

to plink2-users

plink 1.9 —logistic is single-threaded. You should get a large speedup from plink 2.0 “—glm firth-fallback”, since that’ll actually take advantage of your 24 cores.

Re: estimating runtime, both versions should display what % of variants are complete.

fasssster101

unread,

Apr 16, 2019, 11:15:39 AM4/16/19

to plink2-users

Thank you -- very helpful.

fasssster101

unread,

Apr 18, 2019, 9:13:48 AM4/18/19

to plink2-users

Hi Chris,

When I am using --glm in plink2 with --covar, I get the following warning:

Warning: Skipping --glm regression on phenotype 'PHENO1' since covariate
correlation matrix could not be inverted. You may want to remove redundant
covariates and try again.

I have 19 covariates including year of birth, 5 PCs and 13 which are one hot encoded for batch. I calculated he pairwise correlation between all of these and they range from -0.2 to 0.5. None of these are highly correlated.

If I remove the last batch column, then the glm runs without a problem. If I remove one of the PCs, the warning is still occurs...

I am confused where the high correlation is coming from?

Thanks!

On Tuesday, April 16, 2019 at 9:19:31 AM UTC-5, Christopher Chang wrote:

Christopher Chang

unread,

Apr 18, 2019, 1:14:52 PM4/18/19

to plink2-users

The regression always has an "intercept" column with all 1s. So, when using one-hot encoding, if you don't omit one of the categories, the sum of all the category columns will be equal to the all-1s column. Any nontrivial linear combination equal to zero breaks the regression, not just pairwise combinations.

Incidentally, plink 2.0 directly supports categorical covariates; --glm automatically omits one category and one-hot encodes the rest.

Reply all

Reply to author

Forward