IxSet query of query result

Gracjan Polak

unread,

Aug 13, 2010, 12:45:00 PM8/13/10

to HAppS

Hi all Happstackers,

This post talks about complexity of queries in IxSet, possible
workarounds and a proposal for redesign. To make things concrete let
us consider an example:

data Foo = Foo Int String

$(inferIxSet "Foos" ''Foo 'noCalcs [''String, ''Int])

Declaration above creates a set of Foo with two indexes: over the Int
field and over the String field. Let as consider a query:

myset = fromList [Foo 1 "a", Foo 1 "b"]

print $ size (myset @= (1 :: Int) @= ("a" :: String))

When this is evaluated the following happens:
1. Index over Int is used to lookup 1 giving a Set of results. O(log
n).
2. From that Set an IxSet is generated. O(k).
3. Index over String is evaluated. O(n log n).
4. The resulting Map is looked up giving a Set of results. O(log n).
5. From that Set an IxSet is generated. O(k).
6. Index over first Index (Int in this case) is evaluated. O(n log n).
7. From that Index a Set is generated. O(n log n).
8. Size of that set is taken. O(1).
9. Value printed. Done.

There are a couple of structures involved here. Set = Data.Set. Map =
Data.Map. IxSet = [Index]. Index = Map key (Set value). n - number of
elements in set. k - number of indexes in set.

There is clearly some room for improvement here.

Option 1: Add a Data.Set as a field of IxSet.
data IxSet = IxSet (Set a) [Map key (Set a)]

In 5 through 6 we see needless and costly conversion. Just adding a
field would carry a Set from 5 to 7.

Option 2: Add a Data.Set as a field of IxSet and use linear search
from now on.
data IxSet = IxSet Bool (Set a) [Map key (Set a)]

Gives same benefit as option 1 plus see points 3 and 4. There we first
create a Map to use it just once to lookup one key. That is O(n log n
+ log n). Just going over the Set would be O(n) and that is better.
Rule: only the first query uses indexes, ones that follow use linear
search.

Option 3: Make query functions return just a Set.
(@=) :: IxSet a -> index -> Set a

This puts it explicitly in the system that using sequential indexing
doesn't do what you would like it to do. Recreating indexes is more
costly than just going through all elements of the set.

The above analysis is correct (I hope) for the case of equal index
search and when we search for indexes of different types. If you want
to take @< and @> into account everything becomes a bit more muddy.

What do you think we should do with IxSet?

--
Gracjan

Jeremy Shaw

unread,

Aug 18, 2010, 3:09:10 PM8/18/10

to ha...@googlegroups.com

Hey!

Sorry about the slow reply.

Making IxSet speedier would be an excellent task for happstack 0.7,
which will focus heavily on the persistence layer.

As you have noted, the current IxSet implementation basically rebuilds
the IxSet after each operation -- which is not very great for
efficiency.

It seems to me that the reason we have to do that is because all the
indices are stored in separate maps. We basically have a separate
data-structure for each index, and so when we do a query on one of
those indices, we have no way of knowing what is valid/invalid in the
other maps, so we have to recreate them from scratch. under the hood,
Data.Map is basically just a balanced tree. (See the comments at the
top of the docs for links to relevant papers,
http://hackage.haskell.org/packages/archive/containers/0.3.0.0/doc/html/Data-Map.html)

I am wondering if the best solution is to construct IxSet so that it
only has one tree under the hood. For example, something like the
kd-tree data structure,

http://en.wikipedia.org/wiki/Kd-tree

In a, Map k v, at each node of the tree, the 'k' is examined to
determine whether values go to the current node, or to the left or
right.

But, what if we extended it so that instead of examining just a single
index 'k', we examined all the indices ? Well, one problem with
examine all the indices at every node, is that we have many possible
directions to go. If we have two indices, for example, (Char, Int),
then we no longer have a clear idea of ordering. For example, if we
have the three elements:

('b', 2) ('a', 3) ('c', 1)

If our root node is ('b', 2), should we insert ('a', 3) to the left or
the right ? If we look at just the Char part, and insert to the left,
then we can efficiently look up values of the Char type, but we would
have to do an exhaustive search for the Int values, since they were
not considered when building the tree. Or perhaps we need four
directions instead of two. instead of left vs right, we have ll, lg,
gl, gg. Short for (less, less), (less, greater), (greater, less),
(greater, greater).

Then if we had, ('b', 2) and we insert ('a', 3) it be add at lg, since
'a' < 'b', and 3 > 2.

Now if we are searching on just the Char index, we find values that
are less than a Char taking the union of ll and lg, and greater than
Char by the union of gl and gg.

Likewise, if we are searching on just the Index, we take the union of
ll and gl, or lg and gg.

It also works well if we are searching for an index that is a (Char,
Int). To determine which way to go next we compare the Char and Int to
the current node and that will tell us if we should go ll, lg, gl, or
gg.

The downside of this method is that that data-structure is dependent
on the number of indices we have. If we want to have 3-indices, then
we have 8 possible directions. For n-indices we have 2^n possible
directions. If your datastructure has 16 indices, you are going to
have 65536 directions.. GHC is probably not designed to cope with a
data type with that many constructors. Plus, it means we have to have
unique datatypes and query functions for every different size 'n'.

But maybe we can do something similar. We stick with a binary tree
(which values stored in the nodes and the leaves), but at each node we
change which index we are looking at, cycling through all the indices
over time.

So let's say we have a tree with the root element (b, 2).

Next we want to insert (a, 3). Since we are at the first level of the
tree, we look at the first index. Since a < b, we insert it to the
left.

Then we want to insert (a, 4). We start again at the root of the tree,
so we look at the first index a < b, and insert to the left. But since
there is already a child at the second level we must do another check.
Since we are at the second level, we look at the second index. 4 > 3,
so we insert to the right.

Now, let's say we want to find all the values with index 'b'. We start
at the root node. Since it is level 1 of the tree, we are looking at
the first index. We see that b > a, so we know that all the 'b' values
are going to be on the right hand side of the tree.

Here is the useful part though, as we attempt to build a new tree that
contains only 'b' values, we simply need to prune the existing tree.
For example, we know that none of the values to the left of the root
node (in this case) can appear in the resulting IxSet, because none of
the have 'b'. Also, the ordering of the Int indexes in the pruned tree
will still be correct. If we just did pruning, then that IxSet might
be out of balance. But the ordering information it contained would
still be correct. And, if we want to rebalance it, we it would still
be faster than rebuilding from scratch, since we have correct
ordering information already.

Let's say we want to search for the index '4' instead.

At level 1, only the Char index was considered. So 4 values could be
on the right or the left. So, what we need to do is search both the
right and left trees, and then concat the result of those searches.

At the second level, we are do compare the int indexes. Here we do
know if values are found to the left or right because we have an Int
index. So we are able to easily prune a whole branch of the tree.

This data-structure also works nicely for deletions. We only have to
prune (and possibly rebalance) a single tree instead of rebuilding
everything from scratch. And, insertions are easy too.

In a dynamically typed language, this tree would be easy to implement,
since there is no problem with having each node contain an index of a
different type.

The problem is a bit trickier in Haskell because we do have to tell
the type system what is going on.. The solution is probably to use
existentials or dynamics?

Anyway, I have definitely fudged some details here. But hopefully I
got the idea across. A simple first step would be to try to create a
implementation of IxSet which just has:

data IxSet = ..
insert :: a -> IxSet a -> IxSet a
fromList :: IxSet a -> [a]
(@=) :: IxSet a -> k -> IxSet a

That should be enough to get the basics of the type worked out.

What do you think?

- jeremy

> --
> You received this message because you are subscribed to the Google Groups "HAppS" group.
> To post to this group, send email to ha...@googlegroups.com.
> To unsubscribe from this group, send email to happs+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/happs?hl=en.
>
>

Jeremy Shaw

unread,

Aug 18, 2010, 3:10:15 PM8/18/10

to ha...@googlegroups.com

Oops, that should be:

toList :: IxSet a -> [a], not fromList. A fromList would be nice as
well, but it is just a fold using insert.

- jeremy

Jeremy Shaw

unread,

Aug 18, 2010, 3:50:23 PM8/18/10

to ha...@googlegroups.com

I have attached a hacked up implementation which appears to work. But
it does not doing any tree balancing. Here is a demo of what it can
do:

-- a sample ixset

ix1 :: IxSet (Char, Int)
ix1 = fromList [ ('b',2), ('a', 3), ('a', 4), ('c',3), ('b',1), ('c',
2), ('b', 3), ('a', 2), ('c',1), ('a', 1) ]

-- some queries

-- yields: fromList [('b',1),('b',3),('b',2)]
q1 = ix1 @= 'b'

-- yields: fromList [('a',2),('b',2),('c',2)]
q2 = ix1 @= (2 :: Int)

-- yields: fromList [('b',2)]
q3 = ix1 @= 'b' @= (2 :: Int)

-- won't type check because there is no instance, Index String (Char, Int)
-- q3 = ix1 @= "foo"

Okay, so it actually looks a lot like what we have now. Which is good!
One interesting addition is that you actually get a compile time error
if you try to query an invalid index.

The interesting changes are, of course, the internals. In q3, we have
an intermediate ixset, but that ixset is not completely rebuilt from
scratch. it is built by pruning the old ixset. So, it should be pretty
efficient..

It would be cool if someone could do it right :)

- jeremy

ixset.hs

Gracjan Polak

unread,

Aug 19, 2010, 8:39:05 AM8/19/10

to HAppS

Seems like we have a perfect long term attack plan! That is fantastic!

I think we should cover some obvious holes in current IxSet design. We
will have a test suite and a base line to compare too.

Then kd-map based IxSet will follow. It will need some time to mature,
as always, so there will be transition phase.

--
Gracjan

stepcut

unread,

Aug 21, 2010, 3:52:53 PM8/21/10

to HAppS

On Aug 19, 7:39 am, Gracjan Polak <gracjanpo...@gmail.com> wrote:
> Seems like we have a perfect long term attack plan! That is fantastic!
>
> I think we should cover some obvious holes in current IxSet design. We
> will have a test suite and a base line to compare too.

Right. We should not skip easy improvements to the current code base,
just because we hope to replace it with something better some day.

> Then kd-map based IxSet will follow. It will need some time to mature,
> as always, so there will be transition phase.

Right. I think that will affect the choices we make now as to what
path to take.

So, here are some of my thoughts.

In your analysis, you are considering the case where you only do one
query on the intermediate IxSet. That is certainly the most common
usage pattern. But I am a bit concerned about what happens in the
other cases.

Imagine you have a cache table of some sort and you want to expire all
the entries that are older than a certain time. You might do a query
like:

let new = ixset @> cutoffTime

That is, you just do a query that returns all the values which are
newer than the cutoff time. Then you store that ixset back in your
state. It would be a real shame if all further queries against that
set were silently done in a linear manner.

Option 3 gets around that issue by making the switch from IxSet to Set
explicit. That seems better because you won't accidentally switch
states. But, that would also be a pretty significant API change which
would break existing code. But, if we end up with a reimplement IxSet
type like the alternative that I described, then the API change
wouldn't really be needed. In fact, it would make things worse I
think. So it would be annoying to change it, and then change it back
later.

Option 1 seems like it would be API compatible with the current
codebase. But, I wonder what the memory usage would be like? The #1
complaint about happstack-state right now is that it uses too much
RAM. I have not heard any complaints about it being too slow. So, I
think we would want to know more about how much memory overhead option
1 would add, and how much performance it would actually gain us.

I would say that in my own code, a majority of my queries are a single
query. So, the chaining issue is not a huge deal. Though, there is
still some overhead, because I ultimately need to convert the results
of the query to a list. In the cases where I do chain queries, the
first query tends to eliminate most of the values, so that the 'n' and
'k' values for the remaining queries are fairly small. Certainly,
given the current implementation, that is the most efficient way to do
it (aka, order the chain from most specific to least specific).

Here is another option we have not talked about at all, but could be
relevant to either implementation.

Right now the query operators (@=, @<, etc) are functions which do the
work immediately. But what if they were just DSL combinators that
build a query value. Then you could optimize the query before you run
it. The optimizer could note that the query contains two equals and do
something smarter. For example, it would do the map lookup for the
first key, then use the second key to filter the result set before
turning it back into an ixset. (Or maybe one of the query params would
specify if it should return a Set or an IxSet).

Here is a little, non-optimizing prototype which uses lists instead of
IxSet:

{-# LANGUAGE ExistentialQuantification, MultiParamTypeClasses,
FlexibleInstances #-}

class IsKey k a where
isKey :: k -> a -> Bool

data Query a
= I [a]
| forall k. (IsKey k a) => Equals k (Query a)

(@=) :: (IsKey k a) => Query a -> k -> Query a
query @= k = Equals k query

i :: [a] -> Query a
i = I

runQuery :: Query a -> [a]
runQuery (I l) = l
runQuery (Equals k l) = filter (isKey k) $ runQuery l

instance IsKey Char (Char, Int) where
isKey k (c, _) = k == c

instance IsKey Int (Char, Int) where
isKey k (_, i) = k == i

test :: Query (Char, Int)
test = (i [('m',1), ('m', 2), ('r',2), ('e',3)]) @= 'm' @= (1 :: Int)

Obviously, this change breaks the API a little bit as well. However,
might be a change we would keep with the alternative IxSet
implementation as well.

Any comments on any of my thoughts?

- jeremy

Gracjan Polak

unread,

Aug 22, 2010, 4:04:03 AM8/22/10

to HAppS

HI Jeremy,

I agree on everything in your analisis, so I'll skip commenting that.

I'll say my word about the DSL, that I have done more thinking about.

Current code tries to optimize seqential queries on the same index.
For example if you have code like:

let selection = myset @> cutoffTimeLow @< cutoffTimeHigh

it will use the same index twice without recreating it from scratch.
This is how I made @<>, @<=>, @<>= and @<=>= work resonably fast. It
also applies to @+ and @* when you try to query based on the same
index.

So I consider same index query as mostly covered. Then, the question
arises: what else can a DSL optimization pre-pass buy us? In case you
mix index types, you need to recreate tables. The only thing I see DSL
good at is to sort queries by type. So instead of having:

let selection = myset @> cutoffTimeLow @= myIDOfSomething @<
cutoffTimeHigh

it will magically transform this into:

let selection = myset @= myIDOfSomething @> cutoffTimeLow @<
cutoffTimeHigh

Guessing that @= produces smaller sets and that @> and @< are used
with the same index here. How often such a need arise? And are we sure
optimize will really make things better?

Second comment about memory usage:

Right now I ensured that both keys and values are represented in the
set in single memory location (unless compiler does something strange
to my data). Therefore memory cost of an index is just size of the Map
key (Set value) spine (structure). Therefore if we add another set to
the mix it will cost us only the total byte occupied by Set structure
(values are already in the IxSet in some other points).

The above applies only to fully evaluated sets. Most (all?)
intermediate sets have only thunks representing all indexes and in
case we go for my change also the Set. Only one of them will be ever
used.

Peter Robinson

unread,

Aug 26, 2010, 10:54:09 AM8/26/10

to ha...@googlegroups.com

Hi, Jeremy.

Nice work on the kd-tree. Do you have any thoughts on rebalancing? As far as
I know there's no efficient way to rebalance a kd-tree: essentially
you're stuck
with rebuilding the entire tree by choosing a median element as the root.

Peter

Jeremy Shaw

unread,

Aug 26, 2010, 11:42:02 AM8/26/10

to ha...@googlegroups.com

On Aug 26, 2010, at 9:54 AM, Peter Robinson wrote:

> Hi, Jeremy.
>
> Nice work on the kd-tree. Do you have any thoughts on rebalancing?
> As far as
> I know there's no efficient way to rebalance a kd-tree: essentially
> you're stuck
> with rebuilding the entire tree by choosing a median element as the
> root.

No idea. I was first going to read the papers linked to from the
Data.Map docs. But it sounds like there may be some additional issues
here.

I see there is something called a bkd-tree which aims to address the
rebalancing issue, but I have not looked at the paper.

- jeremy

Peter Robinson

unread,

Aug 26, 2010, 2:52:27 PM8/26/10

to ha...@googlegroups.com

>> Nice work on the kd-tree. Do you have any thoughts on rebalancing? As far
>> as
>> I know there's no efficient way to rebalance a kd-tree: essentially
>> you're stuck
>> with rebuilding the entire tree by choosing a median element as the root.
>
> No idea. I was first going to read the papers linked to from the Data.Map
> docs. But it sounds like there may be some additional issues here.

Well I guess it depends on the typical use case. I'm satisfied with a data
structure that supports fast O(log n) lookups, since they occur far more
often than updates, so even if the occasional "rebalancing" (=rebuilding)
takes O(n*loglog n) (according to Wikipedia), it might not be too bad.

> I see there is something called a bkd-tree which aims to address the
> rebalancing issue, but I have not looked at the paper.

Thanks for the hint, I'll take a look at it.

Peter

Peter Robinson

unread,

Aug 28, 2010, 4:20:08 PM8/28/10

to ha...@googlegroups.com

The kd-tree implementation posted by Jeremy inspired me to start implementing a
similar data structure, an adaptive kd-tree. The main difference
between them is that
the adaptive kd-tree stores the actual data only in its leafs, which allows more
efficient rebalancing, while still giving us O(log n) lookups and updates.

inserts, lookups, deletes and some range queries are already working
--- implementing
tree traversal functions in Haskell is real fun. :) I'm currently adding the
balancing part, which is a bit more involved as we have to subsequently
choose median points.

I'll put the repository on the web if anyone's interested.

Peter

Jeremy Shaw

unread,

Aug 29, 2010, 1:48:07 PM8/29/10

to ha...@googlegroups.com

Awesome! I won't have time to look at this in detail until next
weekend. But I am very excited about the work anyway :)

- jeremy

> --

Jeremy Shaw

unread,

Aug 30, 2010, 4:48:16 PM8/30/10

to ha...@googlegroups.com

Hello,

Sorry I have not responded, I am probably not going to be able to look
at this until next week due to other commitments.

I think we are going to be able to make a 0.6 release in September. So
we should focus on what changes we want to have for that release.

- jeremy

Reply all

Reply to author

Forward