Problems with zeros in pandas rolling window computations?

989 views
Skip to first unread message

Harm Schütt

unread,
May 21, 2016, 1:39:39 PM5/21/16
to PyData
Dear all, 

I got a quick question whether I am doing something wrong when using the rolling methods with pandas and I couldn't find much on this anywhere. 
The issue arises when computing rolling sums, standard deviations etc and there are only zeros on the whole window. 
For example: 

sequence = [3.6630, 2.1860, -1.6470, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
df = pd.DataFrame(sequence, columns=["x"])
df["x_sum"] = df["x"].rolling(10).sum()
df["x_npsum"] = df["x"].rolling(10).agg({"x": np.sum})
df["x_npvol"] = df["x"].rolling(10).agg({"x": np.std})
df["x_vol"] = df["x"].rolling(10).std()
df["something_divided_by_vol"] = 1/df["x_vol"]
print(df)

on my machine yields

        x         x_sum       x_npsum       x_npvol         x_vol  \
0   3.663           NaN           NaN           NaN           NaN   
1   2.186           NaN           NaN           NaN           NaN   
2  -1.647           NaN           NaN           NaN           NaN   
3   0.000           NaN           NaN           NaN           NaN   
4   0.000           NaN           NaN           NaN           NaN   
5   0.000           NaN           NaN           NaN           NaN   
6   0.000           NaN           NaN           NaN           NaN   
7   0.000           NaN           NaN           NaN           NaN   
8   0.000           NaN           NaN           NaN           NaN   
9   0.000  4.202000e+00  4.202000e+00  1.458427e+00  1.458427e+00   
10  0.000  5.390000e-01  5.390000e-01  9.105647e-01  9.105647e-01   
11  0.000 -1.647000e+00 -1.647000e+00  5.208271e-01  5.208271e-01   
12  0.000  2.220446e-16  2.220446e-16  1.216675e-08  1.216675e-08   
13  0.000  2.220446e-16  2.220446e-16  1.216675e-08  1.216675e-08   
14  0.000  2.220446e-16  2.220446e-16  1.216675e-08  1.216675e-08   

    something_divided_by_vol  
0                        NaN  
1                        NaN  
2                        NaN  
3                        NaN  
4                        NaN  
5                        NaN  
6                        NaN  
7                        NaN  
8                        NaN  
9               6.856701e-01  
10              1.098220e+00  
11              1.920023e+00  
12              8.219124e+07  
13              8.219124e+07  
14              8.219124e+07

Now, row 12, 13, 14 should produce 0.0 for the sums and standard deviation computations. Instead it returns very small numbers. When I scale something by its standard deviation obviously, row 13, 14, 15 blow up. Btw., if I simply print("{0:.20f}".format(np.sum(np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])))), I get 0.00000000000000000000.

Does anyone know what I am doing wrong here? I'm on a Windows 10 machine, with essentially the latest anaconda distribution. (python 3.5.1, pandas 0.18.1, numpy 1.10.4, numexpr 2.5.2, mkl 11.3.3)

best regards
Harm

Goyo

unread,
May 21, 2016, 6:42:57 PM5/21/16
to PyData
I do not think you are doing anything wrong, it looks like a bug to me.

Goyo
Message has been deleted

John E

unread,
May 21, 2016, 7:25:18 PM5/21/16
to PyData
I believe it's because rolling statistics are calculated in an incremental way.  For example, the row 12 sum is not calculated as the sum of rows 3 to 12.  Rather, it uses the existing rolling sum from row 11, subtracts row 2, and adds row 12.  I assume it's done like that for speed, but when you do things like that seemingly weird precision issues can sneak in.

Btw, I don't know that from knowing the numpy/pandas code but have convinced myself of that from some testing.  See here:  http://stackoverflow.com/questions/28952118/pandas-rolling-sum-weird-inaccuracy/28956610#28956610

Anyway, I don't know that this is really a bug, it's just that with floating point numbers the burden is sometimes on the user to decide when 0.0 and 2.220446e-16 are the same thing. I would suggest just rounding the results here.

Jaime Fernández del Río

unread,
May 21, 2016, 11:09:03 PM5/21/16
to pyd...@googlegroups.com
John is mostly right in his diagnosis.  For those who care, here's the Cython function where the magic happens in rolling sum:


NumPy does some tricks to prevent this type of roundoff error growing out of control, in particular it uses pairwise summation, which is not really well suited for the rolling sum problem.  Here's the templated C function where that happens in NumPy:


Pandas could use Kahan summation, but it would likely make things substantially slower (I'm guessing 2-3x slower for rolling sum), which may not be a desirable tradeoff.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Harm Schütt

unread,
May 26, 2016, 9:32:53 AM5/26/16
to PyData
Thank you all, 

that makes things clear. I stumbled upon this when porting some old analysis code from another language to Python. I'll think about just rounding the rolling output then. Or maybe I will keep that part of the analysis out of pandas. I don't know yet. Just seems like a place where easy mistakes might creep into an analysis, if one isn't careful. Of course, one should always be careful; I know.

Thanks again for the quick and helpful answers!
Harm

Goyo

unread,
May 27, 2016, 2:32:27 AM5/27/16
to PyData


El jueves, 26 de mayo de 2016, 15:32:53 (UTC+2), Harm Schütt escribió:
Thank you all, 

that makes things clear. I stumbled upon this when porting some old analysis code from another language to Python. I'll think about just rounding the rolling output then. Or maybe I will keep that part of the analysis out of pandas. I don't know yet. Just seems like a place where easy mistakes might creep into an analysis, if one isn't careful. Of course, one should always be careful; I know.

It seems like you can improve the accuracy using .apply() instead of .agg(). I guess it will be slower, though.

sequence = [3.6630, 2.1860, -1.6470, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
df
= pd.DataFrame(sequence, columns=["x"])

df
["x_sum"] = df["x"].rolling(10).apply(np.sum)
df
["x_vol"] = df["x"].rolling(10).apply(np.std)

df
["something_divided_by_vol"] = 1/df["x_vol"]
print(df)

        x  x_sum     x_vol  something_divided_by_vol
0   3.663    NaN       NaN                       NaN
1   2.186    NaN       NaN                       NaN
2  -1.647    NaN       NaN                       NaN
3   0.000    NaN       NaN                       NaN
4   0.000    NaN       NaN                       NaN
5   0.000    NaN       NaN                       NaN
6   0.000    NaN       NaN                       NaN
7   0.000    NaN       NaN                       NaN
8   0.000    NaN       NaN                       NaN
9   0.000  4.202  1.383586                  0.722760
10  0.000  0.539  0.863838                  1.157625
11  0.000 -1.647  0.494100                  2.023882
12  0.000  0.000  0.000000                       inf
13  0.000  0.000  0.000000                       inf
14  0.000  0.000  0.000000                       inf

Harm Schütt

unread,
May 27, 2016, 9:48:50 AM5/27/16
to PyData
Nice! Execution speed is not a major concern for me right now, so this works. I was trying to combine multiple rolling computations for several variables into one method chain (group, roll, agg, etc.), simply for the sake of readability. So I was set on .agg(). Never occurred to me to check whether .apply() would behave differently. 

Thank you for the help!

Jaime Fernández del Río

unread,
May 27, 2016, 1:09:25 PM5/27/16
to pyd...@googlegroups.com
Jeff opened an issue, to have this fixed by using Kahan summation. I am on a business trip this week and the next, so I don't think I will get to it until I return home. But this instability should be fixed shortly.

Jaime

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages