Sorry about the slow reply.
Making IxSet speedier would be an excellent task for happstack 0.7,
which will focus heavily on the persistence layer.
As you have noted, the current IxSet implementation basically rebuilds
the IxSet after each operation -- which is not very great for
efficiency.
It seems to me that the reason we have to do that is because all the
indices are stored in separate maps. We basically have a separate
data-structure for each index, and so when we do a query on one of
those indices, we have no way of knowing what is valid/invalid in the
other maps, so we have to recreate them from scratch. under the hood,
Data.Map is basically just a balanced tree. (See the comments at the
top of the docs for links to relevant papers,
http://hackage.haskell.org/packages/archive/containers/0.3.0.0/doc/html/Data-Map.html)
I am wondering if the best solution is to construct IxSet so that it
only has one tree under the hood. For example, something like the
kd-tree data structure,
http://en.wikipedia.org/wiki/Kd-tree
In a, Map k v, at each node of the tree, the 'k' is examined to
determine whether values go to the current node, or to the left or
right.
But, what if we extended it so that instead of examining just a single
index 'k', we examined all the indices ? Well, one problem with
examine all the indices at every node, is that we have many possible
directions to go. If we have two indices, for example, (Char, Int),
then we no longer have a clear idea of ordering. For example, if we
have the three elements:
('b', 2) ('a', 3) ('c', 1)
If our root node is ('b', 2), should we insert ('a', 3) to the left or
the right ? If we look at just the Char part, and insert to the left,
then we can efficiently look up values of the Char type, but we would
have to do an exhaustive search for the Int values, since they were
not considered when building the tree. Or perhaps we need four
directions instead of two. instead of left vs right, we have ll, lg,
gl, gg. Short for (less, less), (less, greater), (greater, less),
(greater, greater).
Then if we had, ('b', 2) and we insert ('a', 3) it be add at lg, since
'a' < 'b', and 3 > 2.
Now if we are searching on just the Char index, we find values that
are less than a Char taking the union of ll and lg, and greater than
Char by the union of gl and gg.
Likewise, if we are searching on just the Index, we take the union of
ll and gl, or lg and gg.
It also works well if we are searching for an index that is a (Char,
Int). To determine which way to go next we compare the Char and Int to
the current node and that will tell us if we should go ll, lg, gl, or
gg.
The downside of this method is that that data-structure is dependent
on the number of indices we have. If we want to have 3-indices, then
we have 8 possible directions. For n-indices we have 2^n possible
directions. If your datastructure has 16 indices, you are going to
have 65536 directions.. GHC is probably not designed to cope with a
data type with that many constructors. Plus, it means we have to have
unique datatypes and query functions for every different size 'n'.
But maybe we can do something similar. We stick with a binary tree
(which values stored in the nodes and the leaves), but at each node we
change which index we are looking at, cycling through all the indices
over time.
So let's say we have a tree with the root element (b, 2).
Next we want to insert (a, 3). Since we are at the first level of the
tree, we look at the first index. Since a < b, we insert it to the
left.
Then we want to insert (a, 4). We start again at the root of the tree,
so we look at the first index a < b, and insert to the left. But since
there is already a child at the second level we must do another check.
Since we are at the second level, we look at the second index. 4 > 3,
so we insert to the right.
Now, let's say we want to find all the values with index 'b'. We start
at the root node. Since it is level 1 of the tree, we are looking at
the first index. We see that b > a, so we know that all the 'b' values
are going to be on the right hand side of the tree.
Here is the useful part though, as we attempt to build a new tree that
contains only 'b' values, we simply need to prune the existing tree.
For example, we know that none of the values to the left of the root
node (in this case) can appear in the resulting IxSet, because none of
the have 'b'. Also, the ordering of the Int indexes in the pruned tree
will still be correct. If we just did pruning, then that IxSet might
be out of balance. But the ordering information it contained would
still be correct. And, if we want to rebalance it, we it would still
be faster than rebuilding from scratch, since we have correct
ordering information already.
Let's say we want to search for the index '4' instead.
At level 1, only the Char index was considered. So 4 values could be
on the right or the left. So, what we need to do is search both the
right and left trees, and then concat the result of those searches.
At the second level, we are do compare the int indexes. Here we do
know if values are found to the left or right because we have an Int
index. So we are able to easily prune a whole branch of the tree.
This data-structure also works nicely for deletions. We only have to
prune (and possibly rebalance) a single tree instead of rebuilding
everything from scratch. And, insertions are easy too.
In a dynamically typed language, this tree would be easy to implement,
since there is no problem with having each node contain an index of a
different type.
The problem is a bit trickier in Haskell because we do have to tell
the type system what is going on.. The solution is probably to use
existentials or dynamics?
Anyway, I have definitely fudged some details here. But hopefully I
got the idea across. A simple first step would be to try to create a
implementation of IxSet which just has:
data IxSet = ..
insert :: a -> IxSet a -> IxSet a
fromList :: IxSet a -> [a]
(@=) :: IxSet a -> k -> IxSet a
That should be enough to get the basics of the type worked out.
What do you think?
- jeremy
> --
> You received this message because you are subscribed to the Google Groups "HAppS" group.
> To post to this group, send email to ha...@googlegroups.com.
> To unsubscribe from this group, send email to happs+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/happs?hl=en.
>
>
toList :: IxSet a -> [a], not fromList. A fromList would be nice as
well, but it is just a fold using insert.
- jeremy
-- a sample ixset
ix1 :: IxSet (Char, Int)
ix1 = fromList [ ('b',2), ('a', 3), ('a', 4), ('c',3), ('b',1), ('c',
2), ('b', 3), ('a', 2), ('c',1), ('a', 1) ]
-- some queries
-- yields: fromList [('b',1),('b',3),('b',2)]
q1 = ix1 @= 'b'
-- yields: fromList [('a',2),('b',2),('c',2)]
q2 = ix1 @= (2 :: Int)
-- yields: fromList [('b',2)]
q3 = ix1 @= 'b' @= (2 :: Int)
-- won't type check because there is no instance, Index String (Char, Int)
-- q3 = ix1 @= "foo"
Okay, so it actually looks a lot like what we have now. Which is good!
One interesting addition is that you actually get a compile time error
if you try to query an invalid index.
The interesting changes are, of course, the internals. In q3, we have
an intermediate ixset, but that ixset is not completely rebuilt from
scratch. it is built by pruning the old ixset. So, it should be pretty
efficient..
It would be cool if someone could do it right :)
- jeremy
Nice work on the kd-tree. Do you have any thoughts on rebalancing? As far as
I know there's no efficient way to rebalance a kd-tree: essentially
you're stuck
with rebuilding the entire tree by choosing a median element as the root.
Peter
> Hi, Jeremy.
>
> Nice work on the kd-tree. Do you have any thoughts on rebalancing?
> As far as
> I know there's no efficient way to rebalance a kd-tree: essentially
> you're stuck
> with rebuilding the entire tree by choosing a median element as the
> root.
No idea. I was first going to read the papers linked to from the
Data.Map docs. But it sounds like there may be some additional issues
here.
I see there is something called a bkd-tree which aims to address the
rebalancing issue, but I have not looked at the paper.
- jeremy
Well I guess it depends on the typical use case. I'm satisfied with a data
structure that supports fast O(log n) lookups, since they occur far more
often than updates, so even if the occasional "rebalancing" (=rebuilding)
takes O(n*loglog n) (according to Wikipedia), it might not be too bad.
> I see there is something called a bkd-tree which aims to address the
> rebalancing issue, but I have not looked at the paper.
Thanks for the hint, I'll take a look at it.
Peter
inserts, lookups, deletes and some range queries are already working
--- implementing
tree traversal functions in Haskell is real fun. :) I'm currently adding the
balancing part, which is a bit more involved as we have to subsequently
choose median points.
I'll put the repository on the web if anyone's interested.
Peter
- jeremy
> --
Sorry I have not responded, I am probably not going to be able to look
at this until next week due to other commitments.
I think we are going to be able to make a 0.6 release in September. So
we should focus on what changes we want to have for that release.
- jeremy