Based on a few blog posts, it seems like the community is yet to
come up with a canonical way to do rolling regression now that
pandas.ols() is deprecated. The functionality which seems to be
missing is the ability to perform a rolling apply on multiple
columns at once.
It would seem that rolling().apply() would get you close, and allow
the user to use a statsmodel or scipy in a wrapper function to run
the regression on each rolling chunk. However, DataFrame.rolling()
only operates one column at a time, so not enough information is
passed with each chunk to run any sort of multivariate functions.
The rolling().corr() and rolling().cov() functions appear to be very
specialised, but I confess I haven't dug too far into the code.
The behavior of rolling().apply() differs from groupby().apply().
With groupby, you get a whole dataframe and can return a variety of
structures based on your intention.
Am I missing a way to do rolling calculation in a way similar to
groupby? This stackoverflow answer provides a very clear way to do
this using groupby, but with the new rolling() stucture, one would
think there should be a more straightforward solution.
http://stackoverflow.com/questions/39501277/efficient-python-pandas-stock-beta-calculation-on-many-dataframes
As a way to illustrate the differences in how rolling() and group()
by chunk their data, I've included some example code below. You'll
see that the rolling() chucks only contain once column as an
ndarray, which looses a lot on context.
Perhaps there is something I'm missing, or this is something in the
Roadmap.
Regards,
Spencer
df = pandas.DataFrame({'a':range(10),'b':range(10
,20),'c':sorted(list(range(5))+list(range(5)))})
df
a |
b |
c |
0 |
0 |
10 |
0 |
1 |
1 |
11 |
0 |
2 |
2 |
12 |
1 |
3 |
3 |
13 |
1 |
4 |
4 |
14 |
2 |
5 |
5 |
15 |
2 |
6 |
6 |
16 |
3 |
7 |
7 |
17 |
3 |
8 |
8 |
18 |
4 |
9 |
9 |
19 |
4 |
chunk = 1
def mysum(a):
global chunk
print("chunk",chunk)
chunk += 1
print(a)
return a.sum()
df.groupby('c').apply(mysum)
chunk 1
a b c
0 0 10 0
1 1 11 0
chunk 2
a b c
0 0 10 0
1 1 11 0
chunk 3
a b c
2 2 12 1
3 3 13 1
chunk 4
a b c
4 4 14 2
5 5 15 2
chunk 5
a b c
6 6 16 3
7 7 17 3
chunk 6
a b c
8 8 18 4
9 9 19 4
Out[149]:
|
a |
b |
c |
c |
|
|
|
0 |
1 |
21 |
0 |
1 |
5 |
25 |
2 |
2 |
9 |
29 |
4 |
3 |
13 |
33 |
6 |
4 |
17 |
37 |
8 |
chunk = 1
df.rolling(window=2).apply(mysum)
chunk 1
[ 0. 1.]
chunk 2
[ 1. 2.]
chunk 3
[ 2. 3.]
chunk 4
[ 3. 4.]
chunk 5
[ 4. 5.]
chunk 6
[ 5. 6.]
chunk 7
[ 6. 7.]
chunk 8
[ 7. 8.]
chunk 9
[ 8. 9.]
chunk 10
[ 10. 11.]
chunk 11
[ 11. 12.]
chunk 12
[ 12. 13.]
chunk 13
[ 13. 14.]
chunk 14
[ 14. 15.]
chunk 15
[ 15. 16.]
chunk 16
[ 16. 17.]
chunk 17
[ 17. 18.]
chunk 18
[ 18. 19.]
chunk 19
[ 0. 0.]
chunk 20
[ 0. 1.]
chunk 21
[ 1. 1.]
chunk 22
[ 1. 2.]
chunk 23
[ 2. 2.]
chunk 24
[ 2. 3.]
chunk 25
[ 3. 3.]
chunk 26
[ 3. 4.]
chunk 27
[ 4. 4.]
Out[148]:
|
a |
b |
c |
0 |
NaN |
NaN |
NaN |
1 |
1.0 |
21.0 |
0.0 |
2 |
3.0 |
23.0 |
1.0 |
3 |
5.0 |
25.0 |
2.0 |
4 |
7.0 |
27.0 |
3.0 |
5 |
9.0 |
29.0 |
4.0 |
6 |
11.0 |
31.0 |
5.0 |
7 |
13.0 |
33.0 |
6.0 |
8 |
15.0 |
35.0 |
7.0 |
9 |
17.0 |
37.0 |
8.0 |