roly: double heap implementation tweaks

15 views
Skip to first unread message

Keith Goodman

unread,
May 3, 2011, 12:13:35 PM5/3/11
to bottl...@googlegroups.com, J. David Lee, Richard Harter, Wes McKinney
There are two changes that would be nice to make to the current double
heap implementation:

1. As suggested by David, don't calculate whether the window is odd or
even at each step through the data. That's a waste. An int can be
added to the struct, perhaps called odd whose value can be set at
initialization. Same can be done for the linked list. It will be
interesting to see how much difference it makes in speed.

2. To initialize the data in the struct I currently pass in a pointer
to the first element of the numpy array. The init function (C code)
then grabs as many elements as the window is wide. That's not
convenient when the input array is more than 1d. So I'd like to copy
the interface that David uses. Initialize the structure but then make
a function that fills the structure with one element at a time. We can
then fill the initial window by looping through the data on the cython
side instead of the C side.

If anyone would like to work on any of this either alone or together
or if you have ideas, let me know. It may take me a while to get to
it.

Or is it too early to optimize? Is there another algorithm that would be faster?

Keith Goodman

unread,
May 8, 2011, 4:28:40 PM5/8/11
to bottl...@googlegroups.com, J. David Lee, Richard Harter, Wes McKinney
On Tue, May 3, 2011 at 9:13 AM, Keith Goodman <kwgo...@gmail.com> wrote:
> There are two changes that would be nice to make to the current double
> heap implementation:
>
> 1. As suggested by David, don't calculate whether the window is odd or
> even at each step through the data. That's a waste. An int can be
> added to the struct, perhaps called odd whose value can be set at
> initialization. Same can be done for the linked list. It will be
> interesting to see how much difference it makes in speed.
>
> 2. To initialize the data in the struct I currently pass in a pointer
> to the first element of the numpy array. The init function (C code)
> then grabs as many elements as the window is wide. That's not
> convenient when the input array is more than 1d. So I'd like to copy
> the interface that David uses. Initialize the structure but then make
> a function that fills the structure with one element at a time. We can
> then fill the initial window by looping through the data on the cython
> side instead of the C side.

I made both of the changes above and pushed to the repo.

I think I'm pretty much done.

Wes, do you have plans to work on a skiplist-based moving median? How
do you think its performance will compare to a double heap?

Wes McKinney

unread,
Dec 28, 2011, 5:16:19 PM12/28/11
to Keith Goodman, bottl...@googlegroups.com, J. David Lee, Richard Harter, pystat...@googlegroups.com
On Mon, May 9, 2011 at 7:30 AM, Wes McKinney <wesm...@gmail.com> wrote:
> It appears the double-heap implementation has the same asymptotic
> runtime, so there may not be much point for the moment-- I might get
> to it sometime this summer but unlikely in the coming weeks.
>
> - Wes

Hey all,

Wow, it's been a long time. I had a tinker with the skiplist-based
moving median in pandas today and wrote a C implementation of the
skiplist data structure:

https://github.com/wesm/pandas/blob/master/pandas/src/skiplist.h

There is a also a Cython header (skiplist.pxd) in there so it can be
used in Cython

There's probably a better way to implement this since I'm no C expert,
but I was curious about the performance. It appears that the double
heap implementation in bottleneck is about 6 times or so faster.

arr is a length-10000 float64 array

In [5]: timeit bn.move_median(arr, 1000)
1000 loops, best of 3: 781 us per loop

In [6]: import pandas._sandbox as sbx

In [8]: timeit sbx.roll_median(arr, 100, 100)
100 loops, best of 3: 5.1 ms per loop

The C version is in turn about 7-8x faster than the current pure
Cython one (which has a "too much Python" problem).

- Wes

Wes McKinney

unread,
Dec 28, 2011, 5:23:10 PM12/28/11
to Keith Goodman, bottl...@googlegroups.com, J. David Lee, Richard Harter, pystat...@googlegroups.com

(just sent this to keith-only by accident)

Pardon me, the double-heap is about *10x* faster:

In [12]: timeit sbx.roll_median(arr, 1000, 1000)
100 loops, best of 3: 7.83 ms per loop

Keith Goodman

unread,
Jan 6, 2012, 4:48:10 PM1/6/12
to bottl...@googlegroups.com
On Wed, Dec 28, 2011 at 2:16 PM, Wes McKinney <wesm...@gmail.com> wrote:

> Wow, it's been a long time. I had a tinker with the skiplist-based
> moving median in pandas today and wrote a C implementation of the
> skiplist data structure:
>
> https://github.com/wesm/pandas/blob/master/pandas/src/skiplist.h

Nice. Now we have 4 different moving window medians: skip list, linked
list, double heap, python for loop.

Reply all
Reply to author
Forward
0 new messages