roly: updates?

J. David Lee

unread,

May 29, 2011, 7:54:09 PM5/29/11

to bottl...@googlegroups.com

I was just wondering if there has been any more work done on the moving
median codes, or if there is anything I can do to help out.

Thanks,

David

Keith Goodman

unread,

May 29, 2011, 8:00:42 PM5/29/11

to bottl...@googlegroups.com

On Sun, May 29, 2011 at 4:54 PM, J. David Lee <jo...@cs.wisc.edu> wrote:
> I was just wondering if there has been any more work done on the moving
> median codes, or if there is anything I can do to help out.

The double heap implementation still has the bug that you found. I
haven't been able to fix it. It would make me really HAPPY if someone
could figure out what is going on.

Here's some debug info copy and pasted from another thread:

I added some debugging print statements. In the following example the
buffer that holds the small values is not always sorted:

>> a = np.arange(10).astype('float64')
>> window = 7
>> import roly
>> roly.doubleheap.move_median(a, window)

# This one is good
Small Heap (length=4)
3.000000 2.000000 1.000000 0.000000
Big Heap (length=3)
4.000000 5.000000 6.000000
Median: 3.000000

# Note that 4.0 should be the first value of the small heap; output is wrong
Small Heap (length=4)
3.000000 2.000000 1.000000 4.000000
Big Heap (length=3)
5.000000 7.000000 6.000000
Median: 3.000000

# This one is good
Small Heap (length=4)
5.000000 2.000000 3.000000 4.000000
Big Heap (length=3)
6.000000 7.000000 8.000000
Median: 5.000000

# This one is good
Small Heap (length=4)
6.000000 5.000000 3.000000 4.000000
Big Heap (length=3)
7.000000 9.000000 8.000000
Median: 6.000000

array([ nan, nan, nan, nan, nan, nan, 3., 3., 5., 6.])

The problem seems to occur when the old, outgoing value is the last
element in the heap.

Keith Goodman

unread,

May 30, 2011, 6:46:26 PM5/30/11

to bottl...@googlegroups.com

I found another double heap implementation of a moving window median:
http://ideone.com/XPbl6

It is slower than Richard's but it gives the right output:

>> a = np.random.randn(100000)
>> timeit roly.doubleheap.move_median(a, 10)
100 loops, best of 3: 4.77 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10)
100 loops, best of 3: 6.45 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 100)
100 loops, best of 3: 6.33 ms per loop
>> timeit roly.doubleheap2.move_median(a, 100)
100 loops, best of 3: 11.9 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 1000)
100 loops, best of 3: 7.33 ms per loop
>> timeit roly.doubleheap2.move_median(a, 1000)
100 loops, best of 3: 16.3 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 10000)
100 loops, best of 3: 10.3 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10000)
10 loops, best of 3: 21.7 ms per loop

It gives the correct output:

>> m0 = roly.slow.move_median(a, 23)
>> m2 = roly.doubleheap2.move_median(a, 23)
>> (m0[23:] == m2[23:]).all()
True

It would be nice to have the speed of Richard's implementation. But I
can't figure out the cause of the bug :(

I added the second double heap implementation to the roly repo. I also
added a Makefile. To build roly, just do:

$ make all

in the roly/roly directory.

J. David Lee

unread,

May 31, 2011, 4:02:26 PM5/31/11

to bottl...@googlegroups.com

Here is another double heap implementation I wrote this afternoon in an
effort to understand how it works. If anyone is interested in wrapping
it with Cython and benchmarking, I'd appreciate it!

Also, I don't have a function to free memory in this code, so it
currently leaks memory.

Thanks,

David

new_double_heap.c

Keith Goodman

unread,

May 31, 2011, 7:40:41 PM5/31/11

to bottl...@googlegroups.com

On Tue, May 31, 2011 at 1:02 PM, J. David Lee <jo...@cs.wisc.edu> wrote:

> Here is another double heap implementation I wrote this afternoon in an
> effort to understand how it works. If anyone is interested in wrapping it
> with Cython and benchmarking, I'd appreciate it!
>
> Also, I don't have a function to free memory in this code, so it currently
> leaks memory.

Nice work!

I cython wrapped it and committed it to the repo. I took the liberty
of copy and pasting the copyright from your clinkedlist.c into your
double heap code.

It is faster than doubleheap2! (Yours is doubleheap3; Richard's is doubleheap):

>> a = np.random.randn(10000000)
>> timeit roly.doubleheap.move_median(a, 10)
1 loops, best of 3: 513 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10)
1 loops, best of 3: 676 ms per loop
>> timeit roly.doubleheap3.move_median(a, 10)
1 loops, best of 3: 513 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 100)
1 loops, best of 3: 663 ms per loop
>> timeit roly.doubleheap2.move_median(a, 100)
1 loops, best of 3: 1.15 s per loop
>> timeit roly.doubleheap3.move_median(a, 100)
1 loops, best of 3: 733 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 1000)
1 loops, best of 3: 764 ms per loop
>> timeit roly.doubleheap2.move_median(a, 1000)
1 loops, best of 3: 1.66 s per loop
>> timeit roly.doubleheap3.move_median(a, 1000)
1 loops, best of 3: 989 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 10000)
1 loops, best of 3: 936 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10000)
1 loops, best of 3: 2.3 s per loop
>> timeit roly.doubleheap3.move_median(a, 10000)
1 loops, best of 3: 1.54 s per loop

>> a = np.random.randn(1000)
>> m0 = roly.slow.move_median(a, 23)
>> m3 = roly.doubleheap3.move_median(a, 23)
>> (m0[23:] == m3[23:]).all()
True

I think we should use your implementation. Does that sound like a good
plan? Maybe clean it up a bit by grouping the public functions
together, add a function to free memory, and perhaps change
dh_create(window) to return a pointer so that I can remove the
cython.address(dh) from the cython wrapping (I don't know enough to
know if that would speed things up but there's always hope):

for i in range(window):
y[i] = np.nan
cdef double_heap dh = dh_create(window)
for i in range(window):
dh_insert_init(cython.address(dh), i, a[i])
dh_init_median(cython.address(dh))
y[window-1] = dh_median(cython.address(dh))
for i in range(window, n):
dh_update(cython.address(dh), a[i])
y[i] = dh_median(cython.address(dh))
return y

Keith Goodman

unread,

May 31, 2011, 7:47:20 PM5/31/11

to bottl...@googlegroups.com

Oh, and return the mean when the window is even. I know this is
prototype code; I'm just trying to make a list of what would need to
be done if we decide to use your implementation. If I add it to
bottleneck then it will get unit testing, doc, maintenance, will work
on 1d, 2d, 3d input, etc.

J. David Lee

unread,

Jun 1, 2011, 12:23:23 AM6/1/11

to bottl...@googlegroups.com

I think the organization and some things like naming could be improved.
I'd also be interested in hearing any suggestions for making the code
faster. I've tried a few things, such as inlining, and had some success,
but I'd like to get it closer to the original doubleheap in speed.

David

Keith Goodman

unread,

Jun 1, 2011, 1:02:53 PM6/1/11

to bottl...@googlegroups.com

On Tue, May 31, 2011 at 9:23 PM, J. David Lee <jo...@cs.wisc.edu> wrote:

> I think the organization and some things like naming could be improved. I'd
> also be interested in hearing any suggestions for making the code faster.
> I've tried a few things, such as inlining, and had some success, but I'd
> like to get it closer to the original doubleheap in speed.

I don't really understand the algorithm or implementation. So just
making some wild guesses here. It looks like when the new value enters
the wrong heap (e.g. enters the small heap but should be in big heap)
Richard does the swap (rebalance) first then sorts the heaps. You sort
the heap it enters first then sort it again after the swap.

J. David Lee

unread,

Jun 1, 2011, 5:59:53 PM6/1/11

to bottl...@googlegroups.com

I've checked in some revisions to the double heap code, and I've
attached a plot of the timings of the different algorithms.

David

heaps.png

Keith Goodman

unread,

Jun 1, 2011, 6:08:53 PM6/1/11

to bottl...@googlegroups.com

It could be that the bug in DH1 is giving it an unfair advantage.

What problem size did you benchmark? If the problem is quick to solve
then timeit will do many loops and the memory leak in DH3 will slow
things down.

J. David Lee

unread,

Jun 1, 2011, 6:56:08 PM6/1/11

to bottl...@googlegroups.com

I think that may be right. I looked at random arrays of 1,000,000
elements. I also have the code to fix the memory leak, I'll put that in
tonight.

David

Keith Goodman

unread,

Jun 1, 2011, 8:57:17 PM6/1/11

to bottl...@googlegroups.com

I hooked up your code on the cython side to free memory. That allowed
me to modify your nice benchmark to look at the time to initialize the
heaps. Plot attached.

I assume the extra overhead is because DH3 uses a for loop in the
initialization but DH1 doesn't? Is it possible to use a single array
to store everything? Perhaps that would make indexing faster too since
the memory for one element would be in one place? I bet the speed of
an implementation of this algorithm depends heavily on memory access.
But I'm taking wild guesses again.

overhead.png

Keith Goodman

unread,

Jun 1, 2011, 9:00:59 PM6/1/11

to bottl...@googlegroups.com

I meant two arrays, not one.

DH1:

w = malloc(sizeof *w);
w->nodes = malloc(nw * sizeof *(w->nodes));

DH3:

mm->nodes = malloc(len * sizeof(struct mm_node*));
for(i = 0; i < len; ++i) {
mm->nodes[i] = malloc(sizeof(struct mm_node));
}

J. David Lee

unread,

Jun 2, 2011, 1:18:45 AM6/2/11

to bottl...@googlegroups.com

I'm don't know what code you were using for benchmarking, but try the
latest commit and see if that's any better. I changed how the
initialization was done, changed the heap branching factor from 2 to 8,
and did some manual loop unrolling.

Regarding the initialization, one issue with keeping an array of nodes
is that swapping the nodes involves copying all of their members, while
swapping node pointers involves less work. In the end the only way to
tell is to do the test, but I don't think I'll have the time to do that
any time soon. If you'd like to give it a try, I'd be interested in
seeing the results.

Thanks,

David

J. David Lee

unread,

Jun 2, 2011, 9:45:14 AM6/2/11

to bottl...@googlegroups.com

I made a few more changes, and was able to eliminate the malloc call in
the loop during init. It's quite a bit faster - thanks for the suggestion.

David

heaps.png

Keith Goodman

unread,

Jun 2, 2011, 12:02:45 PM6/2/11

to bottl...@googlegroups.com

On Thu, Jun 2, 2011 at 6:45 AM, J. David Lee <jo...@cs.wisc.edu> wrote:

> I made a few more changes, and was able to eliminate the malloc call in the
> loop during init. It's quite a bit faster - thanks for the suggestion.

Wow! It is wicked fast. I extended the test (but did not commit) to
the point where DH3 is faster than DH1!

I made a change so that now the mean is returned if the window is
even. I didn't want to get in the way of your optimizations so I made
separate mm_get_median_odd and mm_get_median_even functions and used a
function pointer on the cython side. Problem is that it makes the code
on the cython side too messy. So I'll try to move it to the C side
when I get a chance and if I can figure out how.

Seems like the big speed up came from increasing NUM_CHILDREN. I don't
know what it does but I tried increasing it by adding more cases to
the switch statements. I got a segfault. Oh well.

J. David Lee

unread,

Jun 2, 2011, 1:34:08 PM6/2/11

to bottl...@googlegroups.com

I had that problem as well. I think it has to do with the compiler
screwing up the jump table when the number of cases is beyond 8. I'm
sure there is a way around it, but I haven't figured it out.

The double heap algorithm works by keeping two heaps, one for the values
smaller than the median, and one for the values larger than the median.
A heap is a partially sorted data structure: in a min-heap, all of a
node's children are larger than the node. When we insert a new value, we
have to move it up or down the heap, but never across. This allow us to
ignore the vast majority of values when we're moving a new node to its
final position.

With fewer children, it's faster to move a node down the tree because we
only have to search through a few children to find the one largest one
(to swap places with it), but the tree is deeper. With more children per
node, the tree is not as deep, but it's more expensive to find the
largest child of a node because they all need to be looped through.

There are a lot of variables that will determine what the best branching
factor should be, including loop bounds checking overhead, function call
overhead, etc.

I've been trying to replace the switch statement with a Duff's device
macro which would allow more children, but I haven't yet had any success.

David

Keith Goodman

unread,

Jun 2, 2011, 2:17:28 PM6/2/11

to bottl...@googlegroups.com

On Thu, Jun 2, 2011 at 10:34 AM, J. David Lee <jo...@cs.wisc.edu> wrote:
> On 06/02/2011 11:02 AM, Keith Goodman wrote:

>> Seems like the big speed up came from increasing NUM_CHILDREN. I don't
>> know what it does but I tried increasing it by adding more cases to
>> the switch statements. I got a segfault. Oh well.
>>
>
> I had that problem as well. I think it has to do with the compiler screwing
> up the jump table when the number of cases is beyond 8. I'm sure there is a
> way around it, but I haven't figured it out.

As a test I replaced the switch statements with

for (i = num_children; i > 0; i--) {
child = heap[i_first + i - 1];
if(child->val < node->val)
node = child;
}

and then increased NUM_CHILDREN. Still got a segfault. So the switch
statement might not be the problem.

J. David Lee

unread,

Jun 2, 2011, 7:31:22 PM6/2/11

to bottl...@googlegroups.com

That should all be fixed now. There was a problem with a pointer, and I
still don't know why it worked when the number of children was 2, 4, or
8, but not otherwise. In any event, it's fixed.

I've made a few more speed improvements - all that I can think of, so I
think this code will have to do, so long as no bugs are found.

David

Keith Goodman

unread,

Jun 2, 2011, 7:59:54 PM6/2/11

to bottl...@googlegroups.com

Excellent! The code is in very good shape. And fast!

I uncommented one line in doubleheap3.pyx:

#y[window-1] = mm_get_median(mm)

Removing the function pointer nonsense I added will make it much
easier to use. Thanks.

Reply all

Reply to author

Forward