http://darcs.monoid.at/kdtree/
http://darcs.monoid.at/kdtree/dist/kdtree-0.0.0.tar.gz
(The code still needs thorough testing and the API is unstable.)
The module
http://darcs.monoid.at/kdtree/test.hs
shows the basic usage, which is quite similar to IxSet. I'm planning on
rewriting the rebalancing code as it is currently not very efficient in the
worst case (albeit, for large kd-trees the complete tree seldom needs
rebalancing). The alternative method that I have in mind should produce a
kd-tree in O(n) time that is balanced with high probability.
Feedback and suggestions highly appreciated of course! :)
Peter
On 29 August 2010 19:48, Jeremy Shaw <jer...@n-heptane.com> wrote:
> Awesome! I won't have time to look at this in detail until next weekend. But
> I am very excited about the work anyway :)
>
> - jeremy
>
> On Aug 28, 2010, at 3:20 PM, Peter Robinson wrote:
>
>> The kd-tree implementation posted by Jeremy inspired me to start
>> implementing a
>> similar data structure, an adaptive kd-tree. The main difference
>> between them is that
>> the adaptive kd-tree stores the actual data only in its leafs, which
>> allows more
>> efficient rebalancing, while still giving us O(log n) lookups and updates.
>>
>> inserts, lookups, deletes and some range queries are already working
>> --- implementing
>> tree traversal functions in Haskell is real fun. :) I'm currently adding
>> the
>> balancing part, which is a bit more involved as we have to subsequently
>> choose median points.
>>
>> I'll put the repository on the web if anyone's interested.
>>
>> Peter
>>
>> On 26 August 2010 20:52, Peter Robinson <thal...@gmail.com> wrote:
>>>>>
>>>>> Nice work on the kd-tree. Do you have any thoughts on rebalancing? As
>>>>> far
>>>>> as
>>>>> I know there's no efficient way to rebalance a kd-tree: essentially
>>>>> you're stuck
>>>>> with rebuilding the entire tree by choosing a median element as the
>>>>> root.
>>>>
>>>> No idea. I was first going to read the papers linked to from the
>>>> Data.Map
>>>> docs. But it sounds like there may be some additional issues here.
>>>
>>> Well I guess it depends on the typical use case. I'm satisfied with a
>>> data
>>> structure that supports fast O(log n) lookups, since they occur far more
>>> often than updates, so even if the occasional "rebalancing" (=rebuilding)
>>> takes O(n*loglog n) (according to Wikipedia), it might not be too bad.
>>>
>>>> I see there is something called a bkd-tree which aims to address the
>>>> rebalancing issue, but I have not looked at the paper.
>>>
>>> Thanks for the hint, I'll take a look at it.
>>>
>>> Peter
>>>
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "HAppS" group.
>>>> To post to this group, send email to ha...@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> happs+un...@googlegroups.com.
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/happs?hl=en.
>>>>
>>>>
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "HAppS" group.
>> To post to this group, send email to ha...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> happs+un...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/happs?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "HAppS" group.
> To post to this group, send email to ha...@googlegroups.com.
> To unsubscribe from this group, send email to
> happs+un...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/happs?hl=en.
>
>
Thanks for the hint about the test suite of ixset, I'll definitely
take a look at it.
Yes, I think kdtree could eventually be used as a drop-in replacement. Since the
query operators of ixset are quite intuitive I see no need in defining something
new.
I don't have any conclusive results about the performance yet, however, to get a
rough picture of the memory consumption and runtime of kdtree vs ixset I've ran
the following sample program twice (compiled with -O2), once for testkdtree and
once for testixset:
--------------------------------------------------------------------------------
... <snip>
testkdtree :: [Foo] -> KdTree Foo
testkdtree fs =
let tree = fromList fs 0.5 in
tree @= 'a' @< (200000::Int)
-- print $ size tree
testixset :: [Foo] -> I.IxSet Foo
testixset fs =
let ixset = I.fromList fs in
ixset I.@= 'a' I.@< (200000::Int)
main = do
r <- getStdGen
let fs = L.nub $ take 100000 (randoms r) :: [Foo]
let q = I.toList (testixset fs)
-- let q = toList (testkdtree fs)
print $ length q
--------------------------------------------------------------------------------
As you can see below:
kdtree: 97MB and ~7.4min
ixset: 170MB and ~9.2min
(the garbage colector needs to do more work for ixset.)
Considering that I've made no optimization whatsoever to the code of kdtree
yet, I think these preliminary results show that kdtree has potential. :)
Peter
********************* Using Kd-Trees:
./test2 +RTS -K500M -s
real 7m40.179s
user 7m29.081s
sys 0m0.143s
1,006,477,936 bytes allocated in the heap
748,153,712 bytes copied during GC
34,420,776 bytes maximum residency (18 sample(s))
648,440 bytes maximum slop
97 MB total memory in use (2 MB lost due to fragmentation)
Generation 0: 1902 collections, 0 parallel, 1.23s, 1.18s elapsed
Generation 1: 18 collections, 0 parallel, 1.23s, 1.28s elapsed
INIT time 0.00s ( 2.34s elapsed)
MUT time 446.62s (450.57s elapsed)
GC time 2.46s ( 2.46s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 449.08s (455.37s elapsed)
%GC time 0.5% (0.5% elapsed)
Alloc rate 2,253,556 bytes per MUT second
Productivity 99.5% of total user, 98.1% of total elapsed
********************* Using ixset:
./test2 +RTS -K500M -s
real 9m22.023s
user 9m11.681s
sys 0m0.140s
949,278,720 bytes allocated in the heap
613,518,848 bytes copied during GC
66,232,160 bytes maximum residency (14 sample(s))
23,623,664 bytes maximum slop
170 MB total memory in use (3 MB lost due to fragmentation)
Generation 0: 1562 collections, 0 parallel, 7.94s, 7.99s elapsed
Generation 1: 14 collections, 0 parallel, 0.94s, 1.22s elapsed
INIT time 0.00s ( 2.09s elapsed)
MUT time 542.80s (547.03s elapsed)
GC time 8.88s ( 9.21s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 551.68s (558.33s elapsed)
%GC time 1.6% (1.6% elapsed)
Alloc rate 1,748,840 bytes per MUT second
Productivity 98.4% of total user, 97.2% of total elapsed
Thanks for your suggestions, I've updated test2.hs accordingly and included
the results below. You'll notice that I didn't remove 'nub', as
'KdTree.fromList' doesn't handle duplicates yet (but 'KdTree.insert' does).
However, 'nub' can only dominate time complexity but not space complexity.
That said, kdtree still has many rough edges so it won't be ready for
production use any time soon.
Peter
------------------------------------------
testkdtree2 fs elems =
let tree = fromList fs 0.5 in
let results = map (\r -> tree @= r) elems in
L.length $ L.filter null results
testixset2 fs elems =
let ixset = I.fromList fs in
let results = map (\r -> ixset I.@= r) elems in
L.length $ L.filter I.null results
main = do
r <- getStdGen
let fs = L.nub $ take 100000 (randoms r) :: [Foo]
let elems = take 1000000 (randoms r) :: [Int]
print $ testixset2 fs elems
------------------------------------------
********* kdtree:
55,999,235,024 bytes allocated in the heap
650,993,792 bytes copied during GC
38,261,668 bytes maximum residency (21 sample(s))
877,248 bytes maximum slop
98 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 106838 collections, 0 parallel, 0.25s, 10.62s elapsed
Generation 1: 21 collections, 0 parallel, 0.18s, 3.75s elapsed
INIT time 0.00s ( 2.12s elapsed)
MUT time 217.29s (853.05s elapsed)
GC time 0.43s ( 14.37s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 217.72s (869.54s elapsed)
%GC time 0.2% (1.7% elapsed)
Alloc rate 257,713,610 bytes per MUT second
Productivity 99.8% of total user, 25.0% of total elapsed
real 14m33.728s
user 3m37.726s
sys 10m45.195s
********* ixset:
3,829,672,000 bytes allocated in the heap
489,897,320 bytes copied during GC
54,820,932 bytes maximum residency (13 sample(s))
10,609,220 bytes maximum slop
156 MB total memory in use (2 MB lost due to fragmentation)
Generation 0: 7260 collections, 0 parallel, 0.28s, 11.82s elapsed
Generation 1: 13 collections, 0 parallel, 0.16s, 4.63s elapsed
INIT time 0.00s ( 1.47s elapsed)
MUT time 216.63s (1218.39s elapsed)
GC time 0.45s ( 16.45s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 217.08s (1236.31s elapsed)
%GC time 0.2% (1.3% elapsed)
Alloc rate 17,678,465 bytes per MUT second
Productivity 99.8% of total user, 17.5% of total elapsed
real 20m39.725s
user 3m37.076s
sys 16m49.158s
Contributions are welcome of course. :) I've meanwhile started to work on the
new construction/rebalancing algorithm but it will take a few days until it's
done. I'm also looking into parallelizing the tree traversal, because when you
lookup a search key of a specific dimension (=index type), all the children of
nodes with a different dimension must be followed, so doing this in parallel
should give us a significant performance boost.
>> ********* kdtree:
>> 38,261,668 bytes maximum residency (21 sample(s))
>> 98 MB total memory in use (0 MB lost due to fragmentation)
>>
>> ********* ixset:
>> 54,820,932 bytes maximum residency (13 sample(s))
>> 156 MB total memory in use (2 MB lost due to fragmentation)
>
>
> I'd like to really see how it behaves when there are 10 or 20 indexes
> involved.
One thing to keep in mind is that if you have k indices and n elements, the
kd-tree only makes sense w.r.t. time complexity if n >> 2^k (not taking
parallel traversal into account).
Peter
> I have run out of ideas how to improve IxSet anyway :)
>
> --
> Gracjan
>
I also remembered a feature that would be nice to have: auto-increment.
I often have a types like this in my IxSets:
newtype FooId = FooId { unFooId :: Integer }
data Foo = Foo { fooId :: FooId, ... }
Where FooId is supposed to be a unique index.
Right now to deal with that unique index I have to have two types in my state:
data FooState = FooState { nextId :: FooId, foos :: IxSet Foo }
When I want to insert a new Foo in the set, I have to:
1. get the nextId from FooState
2. update the fooId field in the foo value
3. increment nextId
4. insert the foo value into ixset
5. save the updated ixset and nextId back in FooState
It we be a lot nicer if I could just do:
insertAuto :: (AutoIncr a) => a -> IxSet a -> (a, IxSet a)
The first elementd of the return tuple is the input 'a' with the fooId
field updated. The second element is the IxSet with the new value
inserted.
I am not really sure how easy this is to implement. But it would be a
killer feature I think..
Anyway, just something to keep in the back of our heads.
- jeremy
That auto increment feature looks interesting. I think it should be doable -
not sure yet how efficient it could be.
Meanwhile I had a go at optimizing the tree size a bit and adding parallel
tree traversal. So now running the query (see test2.hs in the repo)
tree @= r @< (r+1)
ixset I.@= r I.@< (r+1)
on a tree/ixset of 50000 'Foo' elements, 50 times with some random integer r,
yields when running on a quad core with -N4:
ixset:
2.712576s elapsed.
13,533,208 bytes maximum slop
77 MB total memory in use (1 MB lost due to fragmentation)
kdtree:
1.195291s elapsed.
204,576 bytes maximum slop
26 MB total memory in use (0 MB lost due to fragmentation)
SPARKS: 350 (344 converted, 0 pruned)
So this seems to indicate that we're going in the right direction. :)
Peter
- jeremy