I don't know about the date handling, and leave the pandas question to Wes.
For more efficient calculation I would calculate daily X'X and X'y
matrices and add them up over the moving days windows. At least for
the 15 day window this would save quite a bit of calculations.
(fancier would be to update inv(x'x) or even better update the
cholesky decomposition.) If not all days have the same number of
minutes, then this could be easily included using weights when adding
the X'X and X'y over days.
Getting the parameters estimates will be easy. For the remaining
results that are provided by a Result instance, I'm not sure if
everything is available to create a result instance given only x'x and
x'y.. (It should work with maybe minor adjustments.)
How many regressors, columns of x, do you have?
Just curious, because the speed of the different ways of doing the
linear algebra also depends on the shape of x.
Josef
If you use correlations and standard deviations, or covariances, then
you include also a constant in the regression. I don't know if that is
what you want.
Since this is a very special case with simple calculations, I would go
for a custom implementation and test it against pandas and statsmodels
if speed is important.
If you have rolling x'x or rolling covariances you could also easily
calculate the slope coefficient for all pairs of regressions in one
vectorized calculation.
Josef
FWIW, I think the 'truncation' ('resample'? truncation implies masking
values not dates to me) should stay at the data-level to keep the
estimator code simpler if it's not costly performance-wise, a la
eViews' and gretl's smpl commands. I don't think there's much lost in
your approach above.
Skipper
Right. Slice or resample not truncate as in
http://en.wikipedia.org/wiki/Truncation_%28statistics%29
Just a quibble. But I think intuitive names are important, especially
as I'm feeling my way around and trying to learn pandas.
Skipper
I don't really understand what you mean? 'truncate' here happens to
In this case I would probably just write this code:
window_offset = datetools.BDay(15)
coefs = DataFrame(coefs).T
ols(y=y, x=x, window=BDay(15), subset=day_rng)
I'll give this some more thought but the "roll your own" option above will work.
- Wes
I guess you still have quite a bit of overhead in the pandas loop and
some redundand calculations.
As an internal function removing some of the overhead and some
redundand calculations could speed it up some more.
(There is quite a difference between using convenient general tools
and coding for a special case.)
> When Nobs = 116490 (about 16 months * 22 days/month * 5.5 hours/days * 60
> min/hour), the standard rolling OLS took 22.3 sec on my old machine, but RoY
> took 3.2 sec.
> When I shift to 5-min intervals (66 in a day, down from 330 -- much easier
> for rolling OLS to compete) Nobs = 23298, rolling OLS took 4.3 sec (a factor
> of 5, just as with Nobs) and RoY tool 1.9 sec.
> Attached is a .png file (matplotlib!) whose bottom panel shows the rolling
> 1-min OLS beta and the daily RoY beta converging at the end of each day, as
> they should.
I think you mixed up the colors in the legend.
Do the values fully agree at the end of a day or are they slightly
off? It's difficult to tell in the plot, but there seem to be small
differences.
Josef
> -- John
>