pandas: periodic ols rolling regressions

jdmarino

unread,

Jun 30, 2011, 10:58:29 PM6/30/11

to pystat...@googlegroups.com

I have 1-minute price returns (390/stock/day) and want to compute rolling ols regressions, but I don't need a set of coefficients every minute. In fact, I'd like to compute the coefficients just once per day, at the end of the day, using my 1-minute data going back 5- 10- or 15-days. The standard pandas rolling ols regression will compute a set of coefficients every minute (and it's slow), and I will throw away 389 out of 390 results. Is there a built-in way to have lower-frequency regressions than implied by the data?

(BTW, I also plan to do rolling regressions with returns of longer horizons: 5-, 15-, and 30-minute returns.)

As a work-around, I'm thinking I should "roll my own" by extracting daily windows of my data (where the windows are longer than a day) and running a single regression each day and collect the results.

josef...@gmail.com

unread,

Jun 30, 2011, 11:37:35 PM6/30/11

to pystat...@googlegroups.com

I don't know about the date handling, and leave the pandas question to Wes.

For more efficient calculation I would calculate daily X'X and X'y
matrices and add them up over the moving days windows. At least for
the 15 day window this would save quite a bit of calculations.
(fancier would be to update inv(x'x) or even better update the
cholesky decomposition.) If not all days have the same number of
minutes, then this could be easily included using weights when adding
the X'X and X'y over days.

Getting the parameters estimates will be easy. For the remaining
results that are provided by a Result instance, I'm not sure if
everything is available to create a result instance given only x'x and
x'y.. (It should work with maybe minor adjustments.)

How many regressors, columns of x, do you have?
Just curious, because the speed of the different ways of doing the
linear algebra also depends on the shape of x.

Josef

John Marino

unread,

Jul 1, 2011, 9:31:02 AM7/1/11

to pystat...@googlegroups.com

@Josef:

This is an interesting idea. My first choice is to stick with the built-ins, because they have been well-tested. (Although I can use them to validate any custom algorithm that I implement.)

I have the same number of observations each day. My pricing data has NaNs filled in by pushing the last known price forward (pandas 'pad' option) and so returns are zero for these periods. I have preselected the stocks to study based on having at least 75% real data.

I am regressing one stock's log returns against another's, so my independent x is a single column.

Pandas also computes rolling correlations for me every minute and this blazing fast. If rolling standard deviation is just as fast, perhaps I can compute std for my stocks and then corr*std1/std2 as the regression coefficient.

-- John

----------------
John Marino CFA

My LinkedIn Profile

josef...@gmail.com

unread,

Jul 1, 2011, 9:52:02 AM7/1/11

to pystat...@googlegroups.com

If you use correlations and standard deviations, or covariances, then
you include also a constant in the regression. I don't know if that is
what you want.

Since this is a very special case with simple calculations, I would go
for a custom implementation and test it against pandas and statsmodels
if speed is important.

If you have rolling x'x or rolling covariances you could also easily
calculate the slope coefficient for all pairs of regressions in one
vectorized calculation.

Josef

Skipper Seabold

unread,

Jul 1, 2011, 1:34:43 PM7/1/11

to pystat...@googlegroups.com

On Fri, Jul 1, 2011 at 1:24 PM, Wes McKinney <wesm...@gmail.com> wrote:

> In this case I would probably just write this code:
>
> # generate some faux data
> minute_rng = DateRange('1/1/2000', '5/1/2000', offset=datetools.Minute())
> day_rng = DateRange('1/1/2000', '5/1/2000', offset=datetools.BDay())
>
> y = Series(np.random.randn(len(minute_rng)), minute_rng)
> x = Series(np.random.randn(len(minute_rng)), minute_rng)
>
> # try this out too with multiple regressors
> # x = DataFrame(np.random.randn(len(minute_rng), 4), index=minute_rng)
>
> # run the regressions with a 15-day trailing window
>
> window_offset = datetools.BDay(15)
>
> coefs = {}
> for date in day_rng:
> prior_date = date - window_offset
> window_x = x.truncate(prior_date, date)
> window_y = y.truncate(prior_date, date)
> model = ols(y=window_y, x=window_x)
> coefs[date] = model.beta
>
> coefs = DataFrame(coefs).T
>
> note that x could also be the DataFrame above commented out and the
> code would be unchanged
>
> let me know if you have any trouble. it might be worth thinking about
> incorporating a "subset" argument in the moving window case, as it
> would be ideal to be able to write just:
>
> ols(y=y, x=x, window=BDay(15), subset=day_rng)
>
> or something like that. Several things:
>
> - Can't use DateOffsets for the window. That would be cool
> - I looked through the code and adding the "subset" option would
> really complicate the code. Maybe there's a way to refactor it that
> isn't too horrible
>

FWIW, I think the 'truncation' ('resample'? truncation implies masking
values not dates to me) should stay at the data-level to keep the
estimator code simpler if it's not costly performance-wise, a la
eViews' and gretl's smpl commands. I don't think there's much lost in
your approach above.

Skipper

Skipper Seabold

unread,

Jul 1, 2011, 1:50:01 PM7/1/11

to pystat...@googlegroups.com

On Fri, Jul 1, 2011 at 1:43 PM, Wes McKinney <wesm...@gmail.com> wrote:

> I don't really understand what you mean? 'truncate' here happens to
> mean "slice off the data before and after the input labels (assuming
> sortedness)"
>

Right. Slice or resample not truncate as in
http://en.wikipedia.org/wiki/Truncation_%28statistics%29

Just a quibble. But I think intuitive names are important, especially
as I'm feeling my way around and trying to learn pandas.

Skipper

Wes McKinney

unread,

Jul 1, 2011, 1:43:17 PM7/1/11

to pystat...@googlegroups.com

I don't really understand what you mean? 'truncate' here happens to

Wes McKinney

unread,

Jul 1, 2011, 1:24:11 PM7/1/11

to pystat...@googlegroups.com

In this case I would probably just write this code:

window_offset = datetools.BDay(15)

coefs = DataFrame(coefs).T

ols(y=y, x=x, window=BDay(15), subset=day_rng)

I'll give this some more thought but the "roll your own" option above will work.

- Wes

John Marino

unread,

Jul 1, 2011, 10:27:31 PM7/1/11

to pystat...@googlegroups.com

Skipper wrote

>>>FWIW, I think the 'truncation' [...] should stay at the data-level to keep the
estimator code simpler

I disagree. The definition of the problem is that I want the *regression* to run less often (with a lower periodicity than the data it is using), so to my mind, this should be a feature of ols, not of creatively arranging the data.

I'm trying out Wes's idea and will report some timings when finished.

-- John

John Marino

unread,

Jul 6, 2011, 2:59:58 PM7/6/11

to pystat...@googlegroups.com

Just to circle back: I tried and succeeded using Wes's roll-your-own (vs OLS's rolling regression) method.

My code has a few more bells & whistles, but it follows his format. I ran a few speed tests on a day that is 5.5 hours long (330 minutes, with an observation each minute -- I omit the first and last half hour of the US trading day). Since the RoY (roll your own) is 1x/day, I naively expected a speedup of 2 orders of magnitude. What I got was a 7x speedup (1 order of magnitude), but this is still very good. Clearly, though, the rolling OLS is highly optimized (cython? using Josef's suggested optimization?).

When Nobs = 116490 (about 16 months * 22 days/month * 5.5 hours/days * 60 min/hour), the standard rolling OLS took 22.3 sec on my old machine, but RoY took 3.2 sec.

When I shift to 5-min intervals (66 in a day, down from 330 -- much easier for rolling OLS to compete) Nobs = 23298, rolling OLS took 4.3 sec (a factor of 5, just as with Nobs) and RoY tool 1.9 sec.

Attached is a .png file (matplotlib!) whose bottom panel shows the rolling 1-min OLS beta and the daily RoY beta converging at the end of each day, as they should.

-- John

RollingVsRollYourOwn.png

josef...@gmail.com

unread,

Jul 6, 2011, 3:45:06 PM7/6/11

to pystat...@googlegroups.com

On Wed, Jul 6, 2011 at 2:59 PM, John Marino
<jdma...@alumni.princeton.edu> wrote:
> Just to circle back: I tried and succeeded using Wes's roll-your-own (vs
> OLS's rolling regression) method.
>
> My code has a few more bells & whistles, but it follows his format. I ran a
> few speed tests on a day that is 5.5 hours long (330 minutes, with an
> observation each minute -- I omit the first and last half hour of the US
> trading day). Since the RoY (roll your own) is 1x/day, I naively expected a
> speedup of 2 orders of magnitude. What I got was a 7x speedup (1 order of
> magnitude), but this is still very good. Clearly, though, the rolling OLS
> is highly optimized (cython? using Josef's suggested optimization?).

I guess you still have quite a bit of overhead in the pandas loop and
some redundand calculations.
As an internal function removing some of the overhead and some
redundand calculations could speed it up some more.

(There is quite a difference between using convenient general tools
and coding for a special case.)

> When Nobs = 116490 (about 16 months * 22 days/month * 5.5 hours/days * 60
> min/hour), the standard rolling OLS took 22.3 sec on my old machine, but RoY
> took 3.2 sec.
> When I shift to 5-min intervals (66 in a day, down from 330 -- much easier
> for rolling OLS to compete) Nobs = 23298, rolling OLS took 4.3 sec (a factor
> of 5, just as with Nobs) and RoY tool 1.9 sec.
> Attached is a .png file (matplotlib!) whose bottom panel shows the rolling
> 1-min OLS beta and the daily RoY beta converging at the end of each day, as
> they should.

I think you mixed up the colors in the legend.

Do the values fully agree at the end of a day or are they slightly
off? It's difficult to tell in the plot, but there seem to be small
differences.

Josef

> -- John
>

John Marino

unread,

Jul 6, 2011, 4:42:46 PM7/6/11

to pystat...@googlegroups.com

>>>removing some of the overhead and some
>>>redundand calculations could speed it up some more.

There is definitely extra processing in the roll-your-own that makes the comparison imperfect (but I need those calculations), but this is illustrative.

Do the values fully agree at the end of a day or are they slightly
off? It's difficult to tell in the plot, but there seem to be small
differences.

I see what you see on the graphs, but a programmatic (as opposed to visual) comparison of the end-of-day betas says they are the same. (I checked.)

-- John

Reply all

Reply to author

Forward