specialized linear regression for very large number of cases

josef...@gmail.com

unread,

Apr 25, 2016, 9:51:56 AM4/25/16

to pystatsmodels

(A last paper in this series)

This looks like a good example for how to write a specialized
algorithm for a very large number of independent regressions.

Briefly: We have a large set of X variables and a large set of Y
variables. We want to test all combinations of x and y series for
significant relationship.
And there are a few billion of these pairs.

If we do this using a general purpose implementation like OLS, then it
takes a very long time.
The paper shows how to strip away all unnecessary computations, and
preprocess X and Y so that all paired regressions are relatively
cheap.

Shabalin, Andrey A. 2012. “Matrix eQTL: Ultra Fast eQTL Analysis via
Large Matrix Operations.” Bioinformatics 28 (10): 1353–58.
doi:10.1093/bioinformatics/bts163.

I don't know whether there is demand for this specific case in
statsmodels, but it would be good to cover some similar usecases
https://github.com/statsmodels/statsmodels/issues/2203

These applications are different from BigModels, because each
estimation problem would easily fit in memory, the problem is that we
have a huge number of those.

(aside: multiple testing p-value correction currently requires the
full array of p-values. There is no option to just look at the
smallest p-values and taking the total number of hypothesis tests into
account.)

Josef

josef...@gmail.com

unread,

Apr 25, 2016, 10:05:35 AM4/25/16

to pystatsmodels

On Mon, Apr 25, 2016 at 9:51 AM, <josef...@gmail.com> wrote:
> (A last paper in this series)
>
> This looks like a good example for how to write a specialized
> algorithm for a very large number of independent regressions.
>
> Briefly: We have a large set of X variables and a large set of Y
> variables. We want to test all combinations of x and y series for
> significant relationship.
> And there are a few billion of these pairs.
>
> If we do this using a general purpose implementation like OLS, then it
> takes a very long time.
> The paper shows how to strip away all unnecessary computations, and
> preprocess X and Y so that all paired regressions are relatively
> cheap.
>
> Shabalin, Andrey A. 2012. “Matrix eQTL: Ultra Fast eQTL Analysis via
> Large Matrix Operations.” Bioinformatics 28 (10): 1353–58.
> doi:10.1093/bioinformatics/bts163.
>
>
> I don't know whether there is demand for this specific case in
> statsmodels, but it would be good to cover some similar usecases
> https://github.com/statsmodels/statsmodels/issues/2203

The more general issue that this problem reduces to, is finding the
non-zero (significant) correlation coefficients in a big
cross-correlation matrix, after partialling out some explanatory
variables.

(573337 * 22011 in the example of the article, IIUC)

Josef

josef...@gmail.com

unread,

Apr 25, 2016, 10:14:31 AM4/25/16

to pystatsmodels

And my guess is that this can be generalized to GLM and other
nonlinear model but using a few billion score tests.
(I haven't yet seen an article for that.)

Josef

Reply all

Reply to author

Forward