A big change to moving window functions

34 views
Skip to first unread message

Keith Goodman

unread,
Jan 5, 2015, 4:06:06 PM1/5/15
to bottl...@googlegroups.com
I am currently rewriting bottleneck from the ground up. At the moment I'm taking a look at the moving window functions.

The moving window functions currently come in pairs (except move_median): for example, move_sum and move_nansum. The former requires that there are no NaNs in the window (else it will return NaN for that particular moving window); the latter requires at least one non-NaN.

But what if you want to require 3 or more non-NaNs? Or any other number in [1, window width]? Pandas handles this with a parameter called minp (min periods). Well, I've copied that (calling it minc for min count) in this branch:

https://github.com/kwgoodman/bottleneck/tree/rewrite_move_nmin

In other words I've replaced the old move_sum and move_nansum with a new move_sum that takes an extra parameter. Fewer functions, more functionality.

I'd like to merge this very soon, so complain (or just comment) now.

I also rewrote the slow versions of the moving window functions, dropping support for the filter and strides methods and only providing the python loop method.


Stephan Hoyer

unread,
Jan 5, 2015, 4:43:19 PM1/5/15
to bottl...@googlegroups.com, bottl...@googlegroups.com
On Monday, Jan 5, 2015 at 10:06 PM, Keith Goodman <kwgo...@gmail.com>, wrote:
In other words I've replaced the old move_sum and move_nansum with a new move_sum that takes an extra parameter. Fewer functions, more functionality.
I think this is a good idea. The current implementation of move_nansum is particularly confusing now that nansum (in numpy 1.9) returns 0 for all nan arrays. I suppose I could get that new version with minc=0? It's better to require users to specify what they want. 
I would, however, suggest that min_count would be a better name, one that is actually self-descriptive. Both are equally fast to type in a world of autocomplete but I would struggle to guess the meaning of minc.

I also rewrote the slow versions of the moving window functions, dropping support for the filter and strides methods and only providing the python loop method.
I'm a little sad to see the strides method go -- I thought it was nice to have that method around, at least as a reference point. But I suppose there's not really any need for it now that the moving window functions are n-dimensional.

Keith Goodman

unread,
Jan 5, 2015, 5:49:00 PM1/5/15
to bottl...@googlegroups.com
On Mon, Jan 5, 2015 at 1:43 PM, Stephan Hoyer <sho...@gmail.com> wrote:

On Monday, Jan 5, 2015 at 10:06 PM, Keith Goodman <kwgo...@gmail.com>, wrote:
In other words I've replaced the old move_sum and move_nansum with a new move_sum that takes an extra parameter. Fewer functions, more functionality.
I think this is a good idea. The current implementation of move_nansum is particularly confusing now that nansum (in numpy 1.9) returns 0 for all nan arrays. I suppose I could get that new version with minc=0? It's better to require users to specify what they want. 


Hm. I didn't allow minc to be zero. I guess move_sum is the only one that has a default value even if there are no data. Should I try to add minc=0 capability to all the moving window fucntions?
 
I would, however, suggest that min_count would be a better name, one that is actually self-descriptive. Both are equally fast to type in a world of autocomplete but I would struggle to guess the meaning of minc.

Great idea. I pushed the change to github.
 
I also rewrote the slow versions of the moving window functions, dropping support for the filter and strides methods and only providing the python loop method.
I'm a little sad to see the strides method go -- I thought it was nice to have that method around, at least as a reference point. But I suppose there's not really any need for it now that the moving window functions are n-dimensional.


On the bright side, slow.move.py went from 1225 lines to 94 lines. And scipy is no longer an (optional) dependency.

Keith Goodman

unread,
Jan 6, 2015, 12:12:12 PM1/6/15
to bottl...@googlegroups.com
On Mon, Jan 5, 2015 at 2:49 PM, Keith Goodman <kwgo...@gmail.com> wrote:
On Mon, Jan 5, 2015 at 1:43 PM, Stephan Hoyer <sho...@gmail.com> wrote:


On Monday, Jan 5, 2015 at 10:06 PM, Keith Goodman <kwgo...@gmail.com>, wrote:
In other words I've replaced the old move_sum and move_nansum with a new move_sum that takes an extra parameter. Fewer functions, more functionality.
I think this is a good idea. The current implementation of move_nansum is particularly confusing now that nansum (in numpy 1.9) returns 0 for all nan arrays. I suppose I could get that new version with minc=0? It's better to require users to specify what they want. 


Hm. I didn't allow minc to be zero. I guess move_sum is the only one that has a default value even if there are no data. Should I try to add minc=0 capability to all the moving window fucntions?

Unit tests fail for move_min and move_max. I could fix it by adding an if statement but that would slow things down when min_count != 0. After I speed up move_{min, max} it should be easy to fix. So I'll hold off for now.

Allowing min_count=0 adds complexity to the code. And I don't think it would be used much. Most often I bet min_count=0 would be just a user error (maybe meant axis=0). Is it worth it?

Stephan Hoyer

unread,
Jan 6, 2015, 12:16:10 PM1/6/15
to bottl...@googlegroups.com
On Tue, Jan 6, 2015 at 9:12 AM, Keith Goodman <kwgo...@gmail.com> wrote:
Allowing min_count=0 adds complexity to the code. And I don't think it would be used much. Most often I bet min_count=0 would be just a 
user error (maybe meant axis=0). Is it worth it?

Probably not :). I agree that it seems unlikely to be used much.

Keith Goodman

unread,
Jan 6, 2015, 12:28:00 PM1/6/15
to bottl...@googlegroups.com
Ok, good. I'll move on and change the default value of min_count from the current -1 (a negative value means that min_count is set equal to window) to None.

Keith Goodman

unread,
Jan 6, 2015, 3:14:49 PM1/6/15
to bottl...@googlegroups.com
I merged the rewrite_move_nmin branch into rewrite. Thanks for the comments.

Stephan Hoyer

unread,
Jan 8, 2015, 1:32:28 AM1/8/15
to bottl...@googlegroups.com
Hi Keith,

A related thought occurred to me: what about similarly consolidating sum and nansum into sum with the parameter min_count? This would require some additional logic, but it would not be too bad, I think.

Stephan

On Mon, Jan 5, 2015 at 1:06 PM, Keith Goodman <kwgo...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "bottle-neck" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bottle-neck...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Keith Goodman

unread,
Jan 8, 2015, 12:30:28 PM1/8/15
to bottl...@googlegroups.com
On Wed, Jan 7, 2015 at 10:32 PM, Stephan Hoyer <sho...@gmail.com> wrote:
A related thought occurred to me: what about similarly consolidating sum and nansum into sum with the parameter min_count? This would require some additional logic, but it would not be too bad, I think.

Yeah, I thought of that too. It seems like a bigger change than changing the moving window functions. I guess that is because numpy has func and nanfunc (so that is what people are used to) whereas numpy doesn't have moving window functions.

I like the consistency of having min_count in both func and move_func. And I like having fewer functions (by collapsing func and nanfunc into func) but more functionality. (Actually, bottleneck has very few funcs, mostly just nanfuncs.)

Originally I thought of bottleneck functions as replacements for numpy functions. But maybe a better goal is to make the bottleneck package as good as it can be in its own right.

The big downside, of course, is speed. Moving window functions have to keep track of NaNs (even with min_count=window) but functions like sum (which doesn't exist in bottleneck) do not need to check if each element is NaN. Adding min_count would mean checking each element, which would be slower than numpy.sum for large float input arrays.

Yeah, so I don't see it as a clear win. Maybe I could keep the name nansum but add min_count. Then at least it would not be compared speed-wise with np.sum.

Stephan Hoyer

unread,
Jan 8, 2015, 9:55:18 PM1/8/15
to bottl...@googlegroups.com
On Thu, Jan 8, 2015 at 9:30 AM, Keith Goodman <kwgo...@gmail.com> wrote:
The big downside, of course, is speed. Moving window functions have to keep track of NaNs (even with min_count=window) but functions like sum (which doesn't exist in bottleneck) do not need to check if each element is NaN. Adding min_count would mean checking each element, which would be slower than numpy.sum for large float input arrays

There's also keeping track of the non-NaN count, which is not necessary with the current nansum (although we already have that for mean, std and var).

I suspect that you could deal with the slow down for particular values of min_count by writing fastpath versions, or even differing directly to np.sum in some cases.

I do think it's reasonable to keep the name nansum but add the min_count parameter. The confusing bit is that to be consistent with numpy you'll need different values of min_count for nansum (0) and nanmean (1), though I suppose even if min_count is 0 for nanmean the result is also nan, because that triggers a division by zero.

Keith Goodman

unread,
Jan 9, 2015, 11:43:40 AM1/9/15
to bottl...@googlegroups.com
So nansum, for example, would be one function for the user but three functions for the developer:

if min_count = 0 (default):
    current bn.nansum code
elif min_count is None:  # which means min_count=a.shape[axis]
   write new code for sum which wouldn't have to check for NaN or count them, so fast
elif min_count in interval [1, a.shape[axis]]:
   write code that checks for nans and counts

The reducing functions are already the most complicated since I separately handle (1) reducing along all axes, (2) reducing along a single axis, (3) 0d input array.

Hmmm........

Stephan Hoyer

unread,
Jan 9, 2015, 11:56:04 AM1/9/15
to bottl...@googlegroups.com, bottl...@googlegroups.com


On Friday, Jan 9, 2015 at 8:43 AM, Keith Goodman <kwgo...@gmail.com>, wrote:
On Thu, Jan 8, 2015 at 6:55 PM, Stephan Hoyer <sho...@gmail.com> wrote:
So nansum, for example, would be one function for the user but three functions for the developer:

if min_count = 0 (default):
    current bn.nansum code
elif min_count is None:  # which means min_count=a.shape[axis]
   write new code for sum which wouldn't have to check for NaN or count them, so fast
elif min_count in interval [1, a.shape[axis]]:
   write code that checks for nans and counts
​Yes, but case 2 is just np.sum, and you'll only need the separate case 3 for sum -- all the only nan aggregation functions default to min_count=1 (or rather, min_count=0 is equivalent to min_count=1). And note that min_count=1 for sum is in fact what pandas uses. Maybe median would need one more special case? I'm not quite sure there without looking at the source.

Keith Goodman

unread,
Jan 9, 2015, 3:40:48 PM1/9/15
to bottl...@googlegroups.com
 I hacked together a bn.nansum that uses min_count: https://github.com/kwgoodman/bottleneck/tree/rewrite_min_count

I made the function signature nansum(a, min_count, axis). But since bn.nansum(a, 0) should probably give the same output as np.nansum(a, 0) maybe nansum(a, axis, min_count) is better (which would mean changing the moving window signatures as well).

nansum was probably a poor choice to use as an example as it is coded differently from the other reduce functions. Oh, well.

Anyway, give it a try and let me know what you think. Do you think min_count would get used?
Reply all
Reply to author
Forward
0 new messages