(A last paper in this series)
This looks like a good example for how to write a specialized
algorithm for a very large number of independent regressions.
Briefly: We have a large set of X variables and a large set of Y
variables. We want to test all combinations of x and y series for
significant relationship.
And there are a few billion of these pairs.
If we do this using a general purpose implementation like OLS, then it
takes a very long time.
The paper shows how to strip away all unnecessary computations, and
preprocess X and Y so that all paired regressions are relatively
cheap.
Shabalin, Andrey A. 2012. “Matrix eQTL: Ultra Fast eQTL Analysis via
Large Matrix Operations.” Bioinformatics 28 (10): 1353–58.
doi:10.1093/bioinformatics/bts163.
I don't know whether there is demand for this specific case in
statsmodels, but it would be good to cover some similar usecases
https://github.com/statsmodels/statsmodels/issues/2203
These applications are different from BigModels, because each
estimation problem would easily fit in memory, the problem is that we
have a huge number of those.
(aside: multiple testing p-value correction currently requires the
full array of p-values. There is no option to just look at the
smallest p-values and taking the total number of hypothesis tests into
account.)
Josef