148 views

Skip to first unread message

Jul 12, 2013, 5:43:34 PM7/12/13

to uzay...@googlegroups.com

Good Afternoon,

I had stumbled upon your code lurking our way across the internet searching for a hashing solution to allow an efficient bounding box search using Hbase or Accumulo. Prior to this, most of my testing was done with quick snippets of code from Stack Overflow so I was very excited to find an actual full implementation. I initially also jumped on this implementation after understanding that Hilbert curve's had the better qualities of locality preservation compared to other SFCs.

So after some testing and wrangling of the data, I set up a loop to dump some data into my test database (pseudocode follows):

for (int x = 100; x<=200; x++)

for (int y = 100; y<=200; y++) {

writeToTable(getBinIndex(x, y), "test", "test", x + "," + y);

}

This generated a dataset with the RowID set to the Hilbert Index you provide (the return output of chi.toBigEndianByteArray() from your sample code, essentially.)

I then setup a search to find the points between 150, 150 and 170, 170. The pseudocode looks something like this:

scanFromTable(getBinIndex(150, 150), getBinIndex(170,170));

The results back surprised me:

Total Rows: 10000

Total Points within the bounding box: 400

Total Rows returned from Hilbert Range: 3266

3,266 is a far reduced set for a bounding box, admittedly, but is it really "locality preserving" at that point? Did I do something wrong?

During that iteration I had specified the integer sizes for the Hilbert array as 62, 62. So I tried setting it lower in case that had altered the results, so I set it to 8,8 and I received the same exact results. Any thoughts here or was I just expecting a magic bullet where there was none?

I had stumbled upon your code lurking our way across the internet searching for a hashing solution to allow an efficient bounding box search using Hbase or Accumulo. Prior to this, most of my testing was done with quick snippets of code from Stack Overflow so I was very excited to find an actual full implementation. I initially also jumped on this implementation after understanding that Hilbert curve's had the better qualities of locality preservation compared to other SFCs.

So after some testing and wrangling of the data, I set up a loop to dump some data into my test database (pseudocode follows):

for (int x = 100; x<=200; x++)

for (int y = 100; y<=200; y++) {

writeToTable(getBinIndex(x, y), "test", "test", x + "," + y);

}

This generated a dataset with the RowID set to the Hilbert Index you provide (the return output of chi.toBigEndianByteArray() from your sample code, essentially.)

I then setup a search to find the points between 150, 150 and 170, 170. The pseudocode looks something like this:

scanFromTable(getBinIndex(150, 150), getBinIndex(170,170));

The results back surprised me:

Total Rows: 10000

Total Points within the bounding box: 400

Total Rows returned from Hilbert Range: 3266

3,266 is a far reduced set for a bounding box, admittedly, but is it really "locality preserving" at that point? Did I do something wrong?

During that iteration I had specified the integer sizes for the Hilbert array as 62, 62. So I tried setting it lower in case that had altered the results, so I set it to 8,8 and I received the same exact results. Any thoughts here or was I just expecting a magic bullet where there was none?

Jul 13, 2013, 6:35:14 AM7/13/13

to uzay...@googlegroups.com

Hi,

Ideally you should need a small number of range queries to fully cover the query region, selecting only small sections outside the region of interest. I created a small test for the query you mentioned, which you can find attached. Here's a summary of what I get:

filteredIndexRanges.size()=1 selected=BigIntegerContent[value=2992]

...

filteredIndexRanges.size()=3 selected=BigIntegerContent[value=824]

filteredIndexRanges.size()=4 selected=BigIntegerContent[value=744]

...

filteredIndexRanges.size()=15 selected=BigIntegerContent[value=400]

I'm not sure how you're getting the number 3266, since with just one range you should get exactly 2992 points without restricting them to the range [100,200)x[100,200). If you can make a self-contained code sample I could have a look at it.

So if you can perform 3 or 4 range queries then you'll selecting about twice the ideal number of points. I think it's mostly about finding the tradeoff between the number of ranges and the number of spurious points you may be getting. Also, if you have more information about the density of the data then you should get better range queries by implementing RegionInspector yourself.

Best,

Daniel

Jul 16, 2013, 6:21:44 PM7/16/13

to uzay...@googlegroups.com

Daniel,

Hello and thanks again for the help. I think there was a misunderstanding on my side on how to use your code and your post was both timely and helped me quite a bit to wrap my head around how this works.

When it comes time to query those results back, based on the code above, are we are calculating the entire space of that curve and then limiting the result-set based on an optimized set of byte-ranges that best fit the results that we need to return? So we're building a new curve linearly and then iterating through that to return the FilteredIndexRanges?

If that is indeed the case, then that should lead me to my next question...

I was able to get your snippet to work quite well... not as well as you, and I'm not sure where my discrepancy is. I was able to get it down to 415 results based on your code. Not perfect like you had, but much much more improved. (I didn't bother to do any further testing such as testing for spurious points based on the built-in flag so the discrepancy is most definitely on my end, I was just super excited to see it work!)

Based on my testing, it seemed to work very well on a two dimensional array of points and the results came back very snappy and very precise using FilteredIndexRanges. SOooo I decided to push my luck and add a third dimension in for time. After playing with the decimal values of how I wanted to store my X, and Y lat and long points, I decided to add in a third dimension (Z) for time. It was a requirement of the project to build this off of a millisecond timestamp range which required me to use a BigInteger.

Inserting into the database was quick and snappy-- but querying back against it using X (16 bit), Y (16 bit) and Z (62 bit) seems to be EXTREMELY computationally intensive-- especially if the range is very broad. Is this a side effect of building the curve linearly and it being exponentially larger based on the third dimension? The test database is built using this:

for (int x = 0; x<1000;x++) {

x = generateRandomX(); // Returns long between 0 - 3600

y = generateRandomY(); // Returns long between 0 - 1800

z = generateRandomTS(); // Returns big integer between 1373932800000, 137393640000

generateHilbertIndex(x, y, z);

}

On query time, I'm using essentially your code pasted above but with the addition of another range and the extra bits for the third dimension. The query ranges are X = (2000, 2100), Y = (1000, 1100), Z = (1373932800000, 1373936400000)

As of now it's been running for the last 389 minutes still trying to build the index ranges. Is there a better way to speed this up or is this an inherent limitation to the dataset with multiple dimensions? I'm probably doing something wrong again!!

Hello and thanks again for the help. I think there was a misunderstanding on my side on how to use your code and your post was both timely and helped me quite a bit to wrap my head around how this works.

When it comes time to query those results back, based on the code above, are we are calculating the entire space of that curve and then limiting the result-set based on an optimized set of byte-ranges that best fit the results that we need to return? So we're building a new curve linearly and then iterating through that to return the FilteredIndexRanges?

If that is indeed the case, then that should lead me to my next question...

I was able to get your snippet to work quite well... not as well as you, and I'm not sure where my discrepancy is. I was able to get it down to 415 results based on your code. Not perfect like you had, but much much more improved. (I didn't bother to do any further testing such as testing for spurious points based on the built-in flag so the discrepancy is most definitely on my end, I was just super excited to see it work!)

Based on my testing, it seemed to work very well on a two dimensional array of points and the results came back very snappy and very precise using FilteredIndexRanges. SOooo I decided to push my luck and add a third dimension in for time. After playing with the decimal values of how I wanted to store my X, and Y lat and long points, I decided to add in a third dimension (Z) for time. It was a requirement of the project to build this off of a millisecond timestamp range which required me to use a BigInteger.

Inserting into the database was quick and snappy-- but querying back against it using X (16 bit), Y (16 bit) and Z (62 bit) seems to be EXTREMELY computationally intensive-- especially if the range is very broad. Is this a side effect of building the curve linearly and it being exponentially larger based on the third dimension? The test database is built using this:

for (int x = 0; x<1000;x++) {

x = generateRandomX(); // Returns long between 0 - 3600

y = generateRandomY(); // Returns long between 0 - 1800

z = generateRandomTS(); // Returns big integer between 1373932800000, 137393640000

generateHilbertIndex(x, y, z);

}

On query time, I'm using essentially your code pasted above but with the addition of another range and the extra bits for the third dimension. The query ranges are X = (2000, 2100), Y = (1000, 1100), Z = (1373932800000, 1373936400000)

As of now it's been running for the last 389 minutes still trying to build the index ranges. Is there a better way to speed this up or is this an inherent limitation to the dataset with multiple dimensions? I'm probably doing something wrong again!!

Jul 17, 2013, 10:28:38 AM7/17/13

to uzay...@googlegroups.com

Hi,

Please note that the ranges have an exclusive upper bound. If you treat it as inclusive that may explain why you get 400 + 15 (one spurious point for each range) points instead of 400.

As for the query building time for large ranges, let me explain first how the range queries are built.

The library implements a very simple algorithm that does not impose any restrictions on the form of the query region, i.e., it can be spherical, rectangular, star-shaped, anything you want. The ranges are built by basically walking around the surface of the query region. While a specialised routine for (hyper)rectangular query regions would probably speed things up a lot, the speedup may be limited to a constant factor, but I would be very glad to find out otherwise. In the general case I don't think that walking the query surface can be entirely avoided, but again, I hope to be proved wrong.

The main assumption of the algorithm is that fetching from Disk/Network a number of rows equal to the Volume of the query region (divided by a constant factor - see "minOverlappingContent" below) should be much slower than visiting the full Surface of the query region In-Memory. This assumption by itself does not help much if the data are very sparse.

In your case, you have an average distance of about 1.2 billion on the time axis. The library doesn't know that, and it assumes by default that you're planning to fetch some (2100-2000)*(1100-1000)*(1373932800000-137393640000)=12e15 entries from the database, so it takes its time to build the range queries.

What can be done about that?

If the query range length on every axis is more or less the same then you can use the parameter minOverlappingContent of SimpleRegionInspector.create() to instruct the query builder not to look inside hypercubes with a volume smaller than that, and instead if there is an overlap of the query region with such a hypercube then the full hypercube is included in the range queries. What means is that the last log2(minOverlappingContent) bits of the index, i.e., the last "log2(minOverlappingContent) / dimensionality" bits from each dimension, are simply ignored, and you will get spurious data points (Query.isPotentialOverSelectivity will return true). So you may want to set the parameter for example equal to queryVolume / expectedNumberOfRows * C, where C is a constant. For perfectly uniform data a good value would be C=1.

Note that the parameter minOverlappingContent is unrelated to the parameter maxFilteredRanges of BacktrackingQueryBuilder.create(), which also needs to be set such as to limit the maximum number of range queries that you are willing to perform.

In your example the query range lengths of the 3 axes differ by 10 orders of magnitude, so the surface area of the query region is much higher than necessary. In order to optimise the index for queries where the Time axis range is 10 order of magnitude higher than the query range length on the X and Y axes, you will need to drop the lowest math.ceil(math.log((1373932800000 - 137393640000) / 100, 2)) = 34 bits from the Time axis when building the index. In your case the remaining time accuracy will be about 199 days for a query range of 39 years. After this you can use the parameter minOverlappingContent to further speed-up query building if necessary, and then limit the number of range queries into the database with maxFilteredRanges.

In the end you may want to have one index for each kind of query shape in terms of orders of magnitude difference between the query range lengths of the axes. For example, queries of size 1x10x100 and 100x10000x10000 can use the same index, only that if you want the same query building time, minOverlappingContent needs to be about 100^3 times larger in the latter case. This does not mean that all axes must have the same precision (number of bits). It is the range length of the query on each dimension, not the precision of each dimension, that matters.

For example, if you limit your range query to 100 milliseconds, i.e. from 1373932800000 to 1373932800100, then it works just fine, even with the default value minOverlappingContent=1:

filteredIndexRanges.size()=1 selected=BigIntegerContent[value=48693508096]

filteredIndexRanges.size()=10 selected=BigIntegerContent[value=2068352]

filteredIndexRanges.size()=100 selected=BigIntegerContent[value=1178880]

filteredIndexRanges.size()=599 selected=BigIntegerContent[value=1000000]

The idea is that the more regular the query shape is, the smaller its surface area becomes (in fact, what matters is surfaceArea / (minOverlappingContent ^ (number of dimensions - 1))), and so the processing time to build the range queries drops as well.

Moreover, if your data are mostly static then you can compile an in-memory aggregated view of the data using BoundedRollup and use that to further speed-up query building and create better range queries.

Best,

Daniel

Il giorno venerdì 12 luglio 2013 23:43:34 UTC+2, zakrat ha scritto:

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu