Thanks,
David
The double heap implementation still has the bug that you found. I
haven't been able to fix it. It would make me really HAPPY if someone
could figure out what is going on.
Here's some debug info copy and pasted from another thread:
I added some debugging print statements. In the following example the
buffer that holds the small values is not always sorted:
>> a = np.arange(10).astype('float64')
>> window = 7
>> import roly
>> roly.doubleheap.move_median(a, window)
# This one is good
Small Heap (length=4)
3.000000 2.000000 1.000000 0.000000
Big Heap (length=3)
4.000000 5.000000 6.000000
Median: 3.000000
# Note that 4.0 should be the first value of the small heap; output is wrong
Small Heap (length=4)
3.000000 2.000000 1.000000 4.000000
Big Heap (length=3)
5.000000 7.000000 6.000000
Median: 3.000000
# This one is good
Small Heap (length=4)
5.000000 2.000000 3.000000 4.000000
Big Heap (length=3)
6.000000 7.000000 8.000000
Median: 5.000000
# This one is good
Small Heap (length=4)
6.000000 5.000000 3.000000 4.000000
Big Heap (length=3)
7.000000 9.000000 8.000000
Median: 6.000000
array([ nan, nan, nan, nan, nan, nan, 3., 3., 5., 6.])
The problem seems to occur when the old, outgoing value is the last
element in the heap.
I found another double heap implementation of a moving window median:
http://ideone.com/XPbl6
It is slower than Richard's but it gives the right output:
>> a = np.random.randn(100000)
>> timeit roly.doubleheap.move_median(a, 10)
100 loops, best of 3: 4.77 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10)
100 loops, best of 3: 6.45 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 100)
100 loops, best of 3: 6.33 ms per loop
>> timeit roly.doubleheap2.move_median(a, 100)
100 loops, best of 3: 11.9 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 1000)
100 loops, best of 3: 7.33 ms per loop
>> timeit roly.doubleheap2.move_median(a, 1000)
100 loops, best of 3: 16.3 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 10000)
100 loops, best of 3: 10.3 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10000)
10 loops, best of 3: 21.7 ms per loop
It gives the correct output:
>> m0 = roly.slow.move_median(a, 23)
>> m2 = roly.doubleheap2.move_median(a, 23)
>> (m0[23:] == m2[23:]).all()
True
It would be nice to have the speed of Richard's implementation. But I
can't figure out the cause of the bug :(
I added the second double heap implementation to the roly repo. I also
added a Makefile. To build roly, just do:
$ make all
in the roly/roly directory.
Also, I don't have a function to free memory in this code, so it
currently leaks memory.
Thanks,
David
> Here is another double heap implementation I wrote this afternoon in an
> effort to understand how it works. If anyone is interested in wrapping it
> with Cython and benchmarking, I'd appreciate it!
>
> Also, I don't have a function to free memory in this code, so it currently
> leaks memory.
Nice work!
I cython wrapped it and committed it to the repo. I took the liberty
of copy and pasting the copyright from your clinkedlist.c into your
double heap code.
It is faster than doubleheap2! (Yours is doubleheap3; Richard's is doubleheap):
>> a = np.random.randn(10000000)
>> timeit roly.doubleheap.move_median(a, 10)
1 loops, best of 3: 513 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10)
1 loops, best of 3: 676 ms per loop
>> timeit roly.doubleheap3.move_median(a, 10)
1 loops, best of 3: 513 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 100)
1 loops, best of 3: 663 ms per loop
>> timeit roly.doubleheap2.move_median(a, 100)
1 loops, best of 3: 1.15 s per loop
>> timeit roly.doubleheap3.move_median(a, 100)
1 loops, best of 3: 733 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 1000)
1 loops, best of 3: 764 ms per loop
>> timeit roly.doubleheap2.move_median(a, 1000)
1 loops, best of 3: 1.66 s per loop
>> timeit roly.doubleheap3.move_median(a, 1000)
1 loops, best of 3: 989 ms per loop
>>
>> timeit roly.doubleheap.move_median(a, 10000)
1 loops, best of 3: 936 ms per loop
>> timeit roly.doubleheap2.move_median(a, 10000)
1 loops, best of 3: 2.3 s per loop
>> timeit roly.doubleheap3.move_median(a, 10000)
1 loops, best of 3: 1.54 s per loop
>> a = np.random.randn(1000)
>> m0 = roly.slow.move_median(a, 23)
>> m3 = roly.doubleheap3.move_median(a, 23)
>> (m0[23:] == m3[23:]).all()
True
I think we should use your implementation. Does that sound like a good
plan? Maybe clean it up a bit by grouping the public functions
together, add a function to free memory, and perhaps change
dh_create(window) to return a pointer so that I can remove the
cython.address(dh) from the cython wrapping (I don't know enough to
know if that would speed things up but there's always hope):
for i in range(window):
y[i] = np.nan
cdef double_heap dh = dh_create(window)
for i in range(window):
dh_insert_init(cython.address(dh), i, a[i])
dh_init_median(cython.address(dh))
y[window-1] = dh_median(cython.address(dh))
for i in range(window, n):
dh_update(cython.address(dh), a[i])
y[i] = dh_median(cython.address(dh))
return y
Oh, and return the mean when the window is even. I know this is
prototype code; I'm just trying to make a list of what would need to
be done if we decide to use your implementation. If I add it to
bottleneck then it will get unit testing, doc, maintenance, will work
on 1d, 2d, 3d input, etc.
I think the organization and some things like naming could be improved.
I'd also be interested in hearing any suggestions for making the code
faster. I've tried a few things, such as inlining, and had some success,
but I'd like to get it closer to the original doubleheap in speed.
David
> I think the organization and some things like naming could be improved. I'd
> also be interested in hearing any suggestions for making the code faster.
> I've tried a few things, such as inlining, and had some success, but I'd
> like to get it closer to the original doubleheap in speed.
I don't really understand the algorithm or implementation. So just
making some wild guesses here. It looks like when the new value enters
the wrong heap (e.g. enters the small heap but should be in big heap)
Richard does the swap (rebalance) first then sorts the heaps. You sort
the heap it enters first then sort it again after the swap.
It could be that the bug in DH1 is giving it an unfair advantage.
What problem size did you benchmark? If the problem is quick to solve
then timeit will do many loops and the memory leak in DH3 will slow
things down.
I think that may be right. I looked at random arrays of 1,000,000
elements. I also have the code to fix the memory leak, I'll put that in
tonight.
David
I hooked up your code on the cython side to free memory. That allowed
me to modify your nice benchmark to look at the time to initialize the
heaps. Plot attached.
I assume the extra overhead is because DH3 uses a for loop in the
initialization but DH1 doesn't? Is it possible to use a single array
to store everything? Perhaps that would make indexing faster too since
the memory for one element would be in one place? I bet the speed of
an implementation of this algorithm depends heavily on memory access.
But I'm taking wild guesses again.
I meant two arrays, not one.
DH1:
w = malloc(sizeof *w);
w->nodes = malloc(nw * sizeof *(w->nodes));
DH3:
mm->nodes = malloc(len * sizeof(struct mm_node*));
for(i = 0; i < len; ++i) {
mm->nodes[i] = malloc(sizeof(struct mm_node));
}
Regarding the initialization, one issue with keeping an array of nodes
is that swapping the nodes involves copying all of their members, while
swapping node pointers involves less work. In the end the only way to
tell is to do the test, but I don't think I'll have the time to do that
any time soon. If you'd like to give it a try, I'd be interested in
seeing the results.
Thanks,
David
> I made a few more changes, and was able to eliminate the malloc call in the
> loop during init. It's quite a bit faster - thanks for the suggestion.
Wow! It is wicked fast. I extended the test (but did not commit) to
the point where DH3 is faster than DH1!
I made a change so that now the mean is returned if the window is
even. I didn't want to get in the way of your optimizations so I made
separate mm_get_median_odd and mm_get_median_even functions and used a
function pointer on the cython side. Problem is that it makes the code
on the cython side too messy. So I'll try to move it to the C side
when I get a chance and if I can figure out how.
Seems like the big speed up came from increasing NUM_CHILDREN. I don't
know what it does but I tried increasing it by adding more cases to
the switch statements. I got a segfault. Oh well.
The double heap algorithm works by keeping two heaps, one for the values
smaller than the median, and one for the values larger than the median.
A heap is a partially sorted data structure: in a min-heap, all of a
node's children are larger than the node. When we insert a new value, we
have to move it up or down the heap, but never across. This allow us to
ignore the vast majority of values when we're moving a new node to its
final position.
With fewer children, it's faster to move a node down the tree because we
only have to search through a few children to find the one largest one
(to swap places with it), but the tree is deeper. With more children per
node, the tree is not as deep, but it's more expensive to find the
largest child of a node because they all need to be looped through.
There are a lot of variables that will determine what the best branching
factor should be, including loop bounds checking overhead, function call
overhead, etc.
I've been trying to replace the switch statement with a Duff's device
macro which would allow more children, but I haven't yet had any success.
David
>> Seems like the big speed up came from increasing NUM_CHILDREN. I don't
>> know what it does but I tried increasing it by adding more cases to
>> the switch statements. I got a segfault. Oh well.
>>
>
> I had that problem as well. I think it has to do with the compiler screwing
> up the jump table when the number of cases is beyond 8. I'm sure there is a
> way around it, but I haven't figured it out.
As a test I replaced the switch statements with
for (i = num_children; i > 0; i--) {
child = heap[i_first + i - 1];
if(child->val < node->val)
node = child;
}
and then increased NUM_CHILDREN. Still got a segfault. So the switch
statement might not be the problem.
I've made a few more speed improvements - all that I can think of, so I
think this code will have to do, so long as no bugs are found.
David
Excellent! The code is in very good shape. And fast!
I uncommented one line in doubleheap3.pyx:
#y[window-1] = mm_get_median(mm)
Removing the function pointer nonsense I added will make it much
easier to use. Thanks.