groupby().apply() vs rolling().apply() and rolling regression

2,649 views
Skip to first unread message

Spencer Ogden

unread,
Oct 20, 2016, 10:28:24 AM10/20/16
to PyData
Based on a few blog posts, it seems like the community is yet to come up with a canonical way to do rolling regression now that pandas.ols() is deprecated. The functionality which seems to be missing is the ability to perform a rolling apply on multiple columns at once.

It would seem that rolling().apply() would get you close, and allow the user to use a statsmodel or scipy in a wrapper function to run the regression on each rolling chunk. However, DataFrame.rolling() only operates one column at a time, so not enough information is passed with each chunk to run any sort of multivariate functions. The rolling().corr() and rolling().cov() functions appear to be very specialised, but I confess I haven't dug too far into the code.

The behavior of rolling().apply() differs from groupby().apply(). With groupby, you get a whole dataframe and can return a variety of structures based on your intention.

Am I missing a way to do rolling calculation in a way similar to groupby? This stackoverflow answer provides a very clear way to do this using groupby, but with the new rolling() stucture, one would think there should be a more straightforward solution.

http://stackoverflow.com/questions/39501277/efficient-python-pandas-stock-beta-calculation-on-many-dataframes

As a way to illustrate the differences in how rolling() and group() by chunk their data, I've included some example code below. You'll see that the rolling() chucks only contain once column as an ndarray, which looses a lot on context.

Perhaps there is something I'm missing, or this is something in the Roadmap.

Regards,
Spencer


df = pandas.DataFrame({'a':range(10),'b':range(10 ,20),'c':sorted(list(range(5))+list(range(5)))})
df


a
b c
0 0 10 0
1 1 11 0
2 2 12 1
3 3 13 1
4 4 14 2
5 5 15 2
6 6 16 3
7 7 17 3
8 8 18 4
9 9 19 4

chunk = 1
def mysum(a):
    global chunk
    print("chunk",chunk)
    chunk += 1
    print(a)
    return a.sum()

df.groupby('c').apply(mysum)
chunk 1
   a   b  c
0  0  10  0
1  1  11  0
chunk 2
   a   b  c
0  0  10  0
1  1  11  0
chunk 3
   a   b  c
2  2  12  1
3  3  13  1
chunk 4
   a   b  c
4  4  14  2
5  5  15  2
chunk 5
   a   b  c
6  6  16  3
7  7  17  3
chunk 6
   a   b  c
8  8  18  4
9  9  19  4
Out[149]:

a b c
c


0 1 21 0
1 5 25 2
2 9 29 4
3 13 33 6
4 17 37 8

chunk = 1
df.rolling(window=2).apply(mysum)

chunk 1
[ 0.  1.]
chunk 2
[ 1.  2.]
chunk 3
[ 2.  3.]
chunk 4
[ 3.  4.]
chunk 5
[ 4.  5.]
chunk 6
[ 5.  6.]
chunk 7
[ 6.  7.]
chunk 8
[ 7.  8.]
chunk 9
[ 8.  9.]
chunk 10
[ 10.  11.]
chunk 11
[ 11.  12.]
chunk 12
[ 12.  13.]
chunk 13
[ 13.  14.]
chunk 14
[ 14.  15.]
chunk 15
[ 15.  16.]
chunk 16
[ 16.  17.]
chunk 17
[ 17.  18.]
chunk 18
[ 18.  19.]
chunk 19
[ 0.  0.]
chunk 20
[ 0.  1.]
chunk 21
[ 1.  1.]
chunk 22
[ 1.  2.]
chunk 23
[ 2.  2.]
chunk 24
[ 2.  3.]
chunk 25
[ 3.  3.]
chunk 26
[ 3.  4.]
chunk 27
[ 4.  4.]
Out[148]:

a b c
0 NaN NaN NaN
1 1.0 21.0 0.0
2 3.0 23.0 1.0
3 5.0 25.0 2.0
4 7.0 27.0 3.0
5 9.0 29.0 4.0
6 11.0 31.0 5.0
7 13.0 33.0 6.0
8 15.0 35.0 7.0
9 17.0 37.0 8.0



spence...@gmail.com

unread,
Oct 20, 2016, 7:59:51 PM10/20/16
to pyd...@googlegroups.com
Based on a few blog posts, it seems like the community is yet to come up with a canonical way to do rolling regression now that pandas.ols() is deprecated. The functionality which seems to be missing is the ability to perform a rolling apply on multiple columns at once.

It would seem that rolling().apply() would get you close, and allow the user to use a statsmodel or scipy in a wrapper function to run the regression on each rolling chunk. However, DataFrame.rolling() only operates one column at a time, so not enough information is passed with each chunk to run any sort of multivariate functions. The rolling().corr() and rolling().cov() functions appear to be very specialised, but I confess I haven't dug too far into the code.

The behavior of rolling().apply() differs from groupby().apply(). With groupby, you get a whole dataframe and can return a variety of structures based on your intention.

Am I missing a way to do rolling calculation in a way similar to groupby? This stackoverflow answer provides a very clear way to do this using groupby, but with the new rolling() stucture, one would think there should be a more straightforward solution.

Reply all
Reply to author
Forward
0 new messages