Rolling OLS linear regression

Meegan Gower

unread,

Sep 13, 2019, 2:02:29 PM9/13/19

to pystatsmodels

Hi, when is the rolling OLS function likely to be released? Looks like great work is happening behind the scenes. Will it include a grouping functionality?

Kevin Sheppard

unread,

Sep 13, 2019, 11:39:29 PM9/13/19

to pystatsmodels

Should be this fall. What do you mean by grouping?

Meegan Gower

unread,

Sep 18, 2019, 5:05:05 AM9/18/19

to pystatsmodels

Thanks for the response Kevin, looking forward to the release date!

In finance there is a need to rolling multiple linear regression for each company separately, so we would need to have a groupby() function to define the function by each company code.

There are a few posts out there with similar issues:

https://stackoverflow.com/questions/56457085/rolling-regression-by-group-in-pandas-dataframe

https://groups.google.com/forum/#!topic/pydata/s30sBxTQRvU

I do have another issue of needing to get two values from the regression and add them into two new columns on a row by row basis. But that may be a different question altogether.

Kevin Sheppard

unread,

Sep 18, 2019, 5:29:38 AM9/18/19

to pystatsmodels

I think that it is easy to write think wrapper around RollingOLS that would look something like:

x_cols = ['const', 'Mkt', 'SMB', 'HML']
def fn(df):
   return = RollingOLS(df['y'],df[x_cols], windows=60).fit().params

and then you could use

df.groupby('PERMNO').apply(fn)

Which should give you an output df with a MultiIndex by permno and date. (Untested)

I think adding explicit grouping is outside the scope for sm.

Kevin Sheppard

unread,

Sep 18, 2019, 5:30:56 AM9/18/19

to pystatsmodels

I do have another issue of needing to get two values from the regression and add them into two new columns on a row by row basis. But that may be a different question altogether.

This has the same solution, use a small wrapper function that does the insertion before returning a DataFrame (with additional columns).

Meegan Gower

unread,

Sep 18, 2019, 12:56:48 PM9/18/19

to pystatsmodels

I am a bit new to the world of Python...

Would something like this work?

df.loc[(df.date.isin(pd.date_range(start='1/1/1979',

end=dt.today(),

freq='xxx'))), 'new_value'] = df.groupby('PERMNO').apply(fn)

Richard Rymer

unread,

Sep 18, 2019, 2:18:17 PM9/18/19

to pystat...@googlegroups.com

I think this:

df.loc[(df.date.isin(pd.date_range(start='1/1/1979',

end=dt.today(),

freq='xxx'))), 'new_value'] = df.groupby('PERMNO').apply(fn)

would return an error, because the size of the result from df.groupby.apply would differ from the size of the group from the df.loc. It would also depend on the columns and index of df, which the df.groupby.apply return values would need to adhere to.

Here are a couple points related to generating models in bulk that might help:

Getting two values out of the same apply is relatively simple with the right data structures. I typically use this syntax to get new columns "in bulk" after applying a function and getting multiple values assembled into a list:

def fn(a):

return [a+1, a+2]

df[['new1', 'new2']] = df[a].apply(fn, result_type='expand', axis=1)

I frequently use this method to get the lower CI, mean, and upper CI from a prediction. It also works for "auto expanding" from model.params, for example. This requires that the relevant model be available on a row by row basis or accessible via a dataframe (or whatever) that is available from within the namespace of the input dataframe.

On the issue of actually training the models in bulk, one for each firm for example, I can attest that a groupby.apply(fn) works well, though I choose to parse out various data points into a Series that is automagically inserted into columns in a new dataframe, for example:

def get_group_fit(group):

    train, test = train_test_split(group, test_size=0.1)

    try:
        rlm_model = sm.RLM(train[<endog>], train[<exog>]).fit()
    except ValueError:
        return None
    rlm_y_test_results = rlm_model.predict(test[<exog>])

    rlm_test_rsquared = r2_score(test[<endog>], rlm_y_test_results)

    rlm_test_rmse = rmse(y_test, rlm_y_test_results)
    resid,fit = probplot(rlm_model.wresid)
    rlm_normal_rsquared = fit[2]**2
    yresid_corr, ycorr_p = pearsonr(rlm_model.fittedvalues,rlm_model.wresid)
    dw = durbin_watson(rlm_model.wresid)
    return Series({'test rsquared': rlm_test_rsquared,
                   'test rmse': rlm_test_rmse,
                   'resid dist rsquared': rlm_normal_rsquared,
                   'resid-yfit corr': yresid_corr,
                   'durbin-watson': dw,
                   'significance': rlm_model.f_pvalue,
                   'model type': 'wls',
                   'model instance': wls_model})

which can be applied simply through df.groupby(<cols>).apply(get_group_fit). The input dataframe contains your actual data. The resulting dataframe has the grouping cols (firm name, for exampe) as the index and columns labeled according to the returned Series. This even works on Spark DataFrames with one additional wrapper.

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/7dc7b4c8-9584-4adf-8598-112a2ec6859c%40googlegroups.com.

--

Richard Rymer, PhD

Principal Data Scientist

Verizon Digital Media Services

Traffic and Performance Management Team

P: 720 509 9685

Meegan Gower

unread,

Sep 23, 2019, 2:34:14 PM9/23/19

to pystatsmodels

Thanks for the responses. I have got the multiple return values for a function now.

However I'm still having an issue with groupby. I have been trying a couple of things, but groupby always returns a single row for that grouped set. And the calculations do not overlap (clearly what you want in most settings). I have tried 'grouper' as an option with a frequency however this also limits the calculation to the dates between each frequency. I need the rolling regression to cross over the date groups but not for the PERMNO group. So if I am calculating a rolling regression for each month/quarter that is based on a full years worth of data it doesn't work grouping by PERMNO and date.

rolling.apply(fn) limits to one output from the function and doesn't seem to solve my problem.

Am I missing something?

Richard Rymer

unread,

Sep 23, 2019, 3:49:57 PM9/23/19

to pystat...@googlegroups.com

Correct me if I am wrong, but it sounds like the groupby is causing the rolling OLS to take place within the individual windows, which obviously isn't what you want. I think you want to let the rolling OLS function take care of the grouping by date, per its design. To do this, I would simplify the groupby and return a Series containing a rolling ols model for each PERMNO. That would look something like:

def get_rolling_ols(group)

endog = group[<y-column>]

exog = group[<features>]

rols = RollingOLS(endog, exog, window=60).fit()

return Series({'model':rols})

models = data.groupby('permno').apply(get_rolling_ols)

The above will return a DataFrame with index permno and a single column: 'model'. To get predictions, you would join this dataframe back to your data and then access the appropriate coefficients from the params of the model. Would have been nice, though, if RollingRegressionResults had implemented a get_prediction method similar to that in RegressionResultsWrapper for other model types.

--

You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/9888d6e8-58d9-43d0-8ce1-ab32dd748f9d%40googlegroups.com.

Reply all

Reply to author

Forward