H2 Multidimensional Index

109 views
Skip to first unread message

Martin Davis

unread,
Apr 18, 2008, 7:43:04 PM4/18/08
to H2 Database
Can anyone comment on whether the H2 MultiDimensional index can be
used to store and query objects with an extent (i.e. of dimension
higher than 0 - in other words, not just points).

AFAIK this is not possible with an indexing scheme based on a
linearization of space. But I'd be happy to be shown that I'm wrong.



Thomas Mueller

unread,
Apr 22, 2008, 1:47:32 PM4/22/08
to h2-da...@googlegroups.com
Hi,

> Can anyone comment on whether the H2 MultiDimensional index can be
> used to store and query objects with an extent (i.e. of dimension
> higher than 0 - in other words, not just points).

No, just points.

> AFAIK this is not possible with an indexing scheme based on a
> linearization of space. But I'd be happy to be shown that I'm wrong.

You are right.

Regards,
Thomas

Martin Davis

unread,
Apr 23, 2008, 12:12:17 PM4/23/08
to H2 Database
Thanks for the confirmation, Thomas.

So I guess to offer a true spatial index there are two choices:
1. implement a native R-tree
2. implement some sort of grid index over top of the existing B-tree
(this is the approach that ArcSDE and SQLServer use, so it's not a
total hack)

It seems like your FullText indexing might provide a nice pattern for
doing #2. The query re-writing is a bit hideous, but presumably will
work, given the prior examples.

Martin

Thomas Mueller

unread,
Apr 23, 2008, 5:12:45 PM4/23/08
to h2-da...@googlegroups.com
Hi,

> So I guess to offer a true spatial index there are two choices:
> 1. implement a native R-tree
> 2. implement some sort of grid index over top of the existing B-tree
> (this is the approach that ArcSDE and SQLServer use, so it's not a
> total hack)

Yes. The advantage of 2. is that it works for all database, and can be
written independent of the database used. I would prefer this
solution. Optimizations are still possible if the database supports
more features.

> It seems like your FullText indexing might provide a nice pattern for
> doing #2. The query re-writing is a bit hideous, but presumably will
> work, given the prior examples.

That's a good idea. I wouldn't directly re-use fulltext index, but
create a new table with the relation between object and point. The
bounding box doesn't necessarily need to be rectangular. One idea is:

class GeoPoint { number[] location; number getLocation1D(); }
class GeoObject { number uniqueId; GeoPoint[] getPoints(); }

in the database, you would have a table that is the relation between
GeoObject and GeoPoint.

Maybe it would be better to have not just one, but a number of indexes
(let's say a coarse grained index, and a fine grained).

Regards,
Thomas

Martin Davis

unread,
Apr 25, 2008, 2:02:56 PM4/25/08
to H2 Database


On Apr 23, 2:12 pm, "Thomas Mueller" <thomas.tom.muel...@gmail.com>
wrote:
>
> Yes. The advantage of 2. is that it works for all database, and can be
> written independent of the database used. I would prefer this
> solution. Optimizations are still possible if the database supports
> more features.

True. There's plenty of precedent for this approach (SDE, SQLServer).

However, there are some downsides to grid-like indexing:
1. they requires some ugly parameters. All the ones I've seen require
you to specify the extent of your data, and also some parameters to
define the density of the grid at each level. These are pretty opaque
to tune
2. the query structure is more complex. It pretty much requires a
query-rewriting step (which may be more or less hidden by the SQL
interface)
3. I expect that performance is less (partly as a result of 2)

I would much prefer to see an R-tree implemented. Most "quality"
spatial DBs have gone this route (Oracle, mySQL, PostGIS). I expect
to see SQLServer cave in and do this at some point as well.

> That's a good idea. I wouldn't directly re-use fulltext index, but
> create a new table with the relation between object and point. The
> bounding box doesn't necessarily need to be rectangular.

Not sure I follow. Every other spatial DB I'm aware of uses bounding
boxes. I would be cautious about doing something radically different,
at least for the first version.

> Maybe it would be better to have not just one, but a number of indexes
> (let's say a coarse grained index, and a fine grained).

Yes, most grid index implementations use several levels of index
(which is yet another parameter that the user has to specify...)

Thomas Mueller

unread,
Apr 27, 2008, 5:50:59 AM4/27/08
to h2-da...@googlegroups.com
Hi,

> there are some downsides to grid-like indexing:
> 1. they requires some ugly parameters. All the ones I've seen require
> you to specify the extent of your data, and also some parameters to
> define the density of the grid at each level. These are pretty opaque
> to tune

Could it be self-tuning? The index should be re-created after
inserting data (or after you know what the distribution is), but that
sounds like updating index statistics - I think it's not that bad.

> 2. the query structure is more complex. It pretty much requires a
> query-rewriting step (which may be more or less hidden by the SQL
> interface)

I agree. But this logic would be on a higher level, and can be easily
and independently debugged.

> 3. I expect that performance is less (partly as a result of 2)

I'm not sure - why don't we test it :-) My feeling is that Z-order as
described in http://en.wikipedia.org/wiki/Z-order_%28curve%29 is fast
if implemented correctly. The H2 MultiDimensional index is based on
that.

> I would much prefer to see an R-tree implemented. Most "quality"
> spatial DBs have gone this route (Oracle, mySQL, PostGIS).

OK, it's your project ;-) I will help you as much as I can, but I will
not be able to implement the R-tree index.

> Every other spatial DB I'm aware of uses bounding
> boxes. I would be cautious about doing something radically different,
> at least for the first version.

Do you have any links, I don't know too much about spatial indexing.

My idea is: For each object, store the location and bounding box. But
for the index, use the grid approach. I think we need to agree on use
cases. One use case is: store and index buildings and streets to
create a map. We should then agree on some demo data and an
application. I would store things like this:

Object 1 { type: street, name: "A1", level: 5, bounds: (0, 10)/(100,
60), polygon: ..., segments: 1001 - 1100 }
Object 1001 { level: 8, bounds: (0, 59)/(1, 60), polygon: ..., parent: 1 }

So each larger street contains street segments, knows the bounding box
and the polygon. Level is the zoom level (1: world, 10: house). There
are multiple indexes, one for each zoom level. A zoom level could
contain 4 times the space than the previous zoom level, or 16 times.
Each object is indexed as follows: calculate the "squares" (each
square has a number) it occupies in the given zoom level; add a link
to this object to the index. With the data above it could be:

index_level_5: square 500: street 1; square 501: street 1, square 505: street 1

When you want to look up what to draw for the given zoom level, you
search in the right index all data for the given squares you need to
draw. So that's basically like a fulltext search index. This is how I
would do it using the current H2 MultiDimensional index. I guess my
description is not that clear...

Regards,
Thomas

Martin Davis

unread,
Apr 28, 2008, 4:48:24 PM4/28/08
to H2 Database


>
> > there are some downsides to grid-like indexing:
> > 1. they requires some ugly parameters. All the ones I've seen require
> > you to specify the extent of your data, and also some parameters to
> > define the density of the grid at each level. These are pretty opaque
> > to tune
>
> Could it be self-tuning? The index should be re-created after
> inserting data (or after you know what the distribution is), but that
> sounds like updating index statistics - I think it's not that bad.

You might be able to make a tool that would advise on index parameter
settings. (SDE has this - I can only presume SQLServer does as well,
since they have quite a complex set of magic numbers). I don't see
that this could be run automatically after every insert, however,
since changing params would basically require a full reindex.

>
> > 3. I expect that performance is less (partly as a result of 2)
>
> I'm not sure - why don't we test it :-) My feeling is that Z-order as
> described inhttp://en.wikipedia.org/wiki/Z-order_%28curve%29is fast
> if implemented correctly. The H2 MultiDimensional index is based on
> that.

Real testing would require a pretty much full implementation of both
paradigms, I suspect. In which case I would just use the R-tree 8^)
But you are right, I have no evidence about relative performance, just
gut feel (which I happily admit could be wrong)
>
> > I would much prefer to see an R-tree implemented. Most "quality"
> > spatial DBs have gone this route (Oracle, mySQL, PostGIS).
>
> OK, it's your project ;-) I will help you as much as I can, but I will
> not be able to implement the R-tree index.

Sure. Neither will I, probably 8^) Hopefully some clever SOC
grantee...

>
> > Every other spatial DB I'm aware of uses bounding
> > boxes. I would be cautious about doing something radically different,
> > at least for the first version.
>
> Do you have any links, I don't know too much about spatial indexing.

Here's a post on the SQLServer index:
http://johanneskebeck.spaces.live.com/blog/cns!42E1F70205EC8A96!3589.entry

Hanan Samet's books are the bible of spatial indexing.

There's hundreds of papers on R-trees, of course.

I don't have a good link for SDE indexing, but it's pretty simple - a
multi-level grid.


>
> My idea is: For each object, store the location and bounding box. But
> for the index, use the grid approach. I think we need to agree on use
> cases.

My use case is: provide fast range query access to any table
containing spatial data 8^)

All spatial DBs that I'm aware of implement a single kind of spatial
index that meets all use cases. I don't think you have to think any
harder than this. There's plenty of precedence out there - this is a
very well understood problem.

Martin

Thomas Mueller

unread,
May 5, 2008, 1:52:36 PM5/5/08
to h2-da...@googlegroups.com
Hi,

> index parameter settings.


> changing params would basically require a full reindex.

Yes, that's true. However I don't think that this will be happening a lot.

> Real testing would require a pretty much full implementation of both
> paradigms, I suspect.

I'm not sure, but if yes then let's do that :-) What about a simple
in-memory R-Tree implementation first? We can then better decide if
it's easy to add this to the database or not.

Regards,
Thomas

Martin Davis

unread,
May 5, 2008, 7:09:32 PM5/5/08
to H2 Database
(Splitting this out into a new thread)

This sounds like a great way to proceed.

I'm not familiar with how H2 implements in-memory indexes. Are they
created and maintained for the duration of a user connection, or for
the lifetime of the DB process? (I presume the latter...) If so,
that could be a very useful feature in its own right - it should be a
lot faster than a disk-based index, right?

As I mentioned in my other post, there seems to be lots of existing
Java R-tree implementations. Should just be a matter of finding one
with a good license and good functionality, and then cutting it in...

I guess this will require lots of little changes to the SQL DML syntax
& language processor, right? (And I guess I should just look at the
code instead of asking all these questions... 8^)

On May 5, 10:52 am, "Thomas Mueller" <thomas.tom.muel...@gmail.com>
wrote:

Thomas Mueller

unread,
May 7, 2008, 2:39:22 PM5/7/08
to h2-da...@googlegroups.com
Hi,

> I'm not familiar with how H2 implements in-memory indexes. Are they
> created and maintained for the duration of a user connection, or for
> the lifetime of the DB process? (I presume the latter...)

Yes, for the lifetime of the database.

> If so, that could be a very useful feature in its own right - it should be a
> lot faster than a disk-based index, right?

Yes.

> As I mentioned in my other post, there seems to be lots of existing
> Java R-tree implementations.

But I guess many are complex and / or don't have the right license.
And my experience is that to understand it fully you need to implement
at least a part of it yourself.

> I guess this will require lots of little changes to the SQL DML syntax
> & language processor, right?

Yes, at some point.

What I meant is: I would first implement the R tree outside of the
database, as a standalone package. At that time the API would be Java.
Adding the R tree to the parser and adding the features to the parser
does not have priority in my view. I would first try to solve the
hardest problem.

Also I would implement a test case first.

> (And I guess I should just look at the
> code instead of asking all these questions... 8^)

If you can't find things in H2 let me know. I need to improve the
source code documentation anyway. The parser is Parser.java, and the
index is Index.java, but you could extend BaseIndex.

Regards,
Thomas

Reply all
Reply to author
Forward
0 new messages