Range query with 64bit precision BigInteger hangs

87 views
Skip to first unread message

Tilmann Z

unread,
Jul 29, 2014, 1:00:06 PM7/29/14
to uzay...@googlegroups.com
I think I have a different problem than https://groups.google.com/forum/#!topic/uzaygezen/1NpA9lVZXS0

I also need 64bit precision (I'm encoding 'double'float values) and I started using BigInteger.
But I execute a 2-dimensional query, the code hangs in CompactHilbertCurve.accept(), where it performs an endless loop, 'i' iterates between few values in the for-loop at the end of the method.
The attached code demonstrates the problem. It works if I use only one dimension instead of two. Btw, I'm using version 0.2 via maven.

public class TestCompactHilbert {

   
private CompactHilbertCurve sfc;  //space filling curve

   
private final int DIM;
   
private final int DEPTH;// = 64;

   
private TestCompactHilbert(int DIM, int DEPTH) {
       
this.DIM = DIM;
       
this.DEPTH = DEPTH;
   
}

   
private long toSortableLong(double value) {
       
//To create a sortable long, we convert the double to a long using the IEEE-754 standard,
       
//which stores floats in the form <sign><exponent-127><mantissa> .
       
//This result is properly ordered longs for all positive doubles. Negative values have
       
//inverse ordering. For negative doubles, we therefore simply invert them to make them
       
//sortable, however the sign must be inverted again to stay negative.
       
long r = Double.doubleToRawLongBits(value);
       
return (r >= 0) ? r : r ^ 0x7FFFFFFFFFFFFFFFL;
   
}

   
private BigInteger[][] generateRanges(double[] min, double[] max) {
       
BigInteger[][] ranges = new BigInteger[DIM][2];
       
for (int j = 0; j < DIM; ++j) {
            ranges
[j][0] = BigInteger.valueOf(toSortableLong(min[j]));
            ranges
[j][1] = BigInteger.valueOf(toSortableLong(max[j]));
       
}
       
return ranges;
   
}
   
   
private void order2Query(double[] minD, double[] maxD) {
       
BigInteger[][] ranges = generateRanges(minD, maxD);
       
       
int maxRanges = 1;
        queryAndFilterBI
(sfc, ranges, maxRanges, null);
   
}


   
public static void main(String[] args) {
       
int DIM = 2;
       
TestCompactHilbert x = new TestCompactHilbert(DIM, 64);
        x
.run();
   
}

   
private void run() {
       
int[] dd = new int[DIM];
       
Arrays.fill(dd, DEPTH);
        sfc
= new CompactHilbertCurve(new int[]{DEPTH, DEPTH});
       
       
       
//range queries
       
double[] qDLow = new double[]{0.01,  0.01};
       
double[] qDUpp = new double[]{0.011, 0.011};
       
System.out.println("Waiting...");
        order2Query
(qDLow, qDUpp);
       
System.out.println("Done!");
   
}



   
private static final Logger logger = Logger.getLogger(TestCompactHilbert.class.getSimpleName());


   
private void queryAndFilterBI(
           
SpaceFillingCurve sfc, BigInteger[][] ranges,
           
int maxRanges, Map<Pow2LengthBitSetRange, NodeValue<BigIntegerContent>> rolledupMap) {
       
List<BigIntegerRange> region = rangesToQueryRegionBI(ranges);
       
List<FilteredIndexRange<Object, BigIntegerRange>> indexRanges = queryBI(
                region
, sfc, maxRanges, rolledupMap);
       
Assert.assertTrue(indexRanges.size() <= maxRanges);
        logger
.log(Level.INFO, "indexRanges={0}", indexRanges);
   
}

   
private List<FilteredIndexRange<Object, BigIntegerRange>> queryBI(
           
List<BigIntegerRange> region, SpaceFillingCurve sfc, int maxRanges,
           
Map<Pow2LengthBitSetRange, NodeValue<BigIntegerContent>> rolledupMap) {
       
List<? extends List<BigIntegerRange>> x = ImmutableList.of(region);
       
BigIntegerContent zero = new BigIntegerContent(BigInteger.ZERO);
       
Object filter = "";
       
BigIntegerContent one = new BigIntegerContent(BigInteger.ONE);
       
RegionInspector<Object, BigIntegerContent> simpleRegionInspector =
               
SimpleRegionInspector.create(
                x
, one, Functions.constant(filter), BigIntegerRangeHome.INSTANCE, zero);
       
final RegionInspector<Object, BigIntegerContent> regionInspector;
       
if (rolledupMap == null) {
            regionInspector
= simpleRegionInspector;
       
} else {
            regionInspector
= MapRegionInspector.create(
                    rolledupMap
, simpleRegionInspector, false, zero, one);
       
}
       
// Not using using sub-ranges here.
       
PlainFilterCombiner<Object, BigInteger, BigIntegerContent, BigIntegerRange> combiner =
               
new PlainFilterCombiner<>(filter);
       
QueryBuilder<Object, BigIntegerRange> queryBuilder = BacktrackingQueryBuilder.create(
                regionInspector
, combiner, maxRanges, true, BigIntegerRangeHome.INSTANCE, zero);
        sfc
.accept(new ZoomingSpaceVisitorAdapter(sfc, queryBuilder));
       
Query<Object, BigIntegerRange> query = queryBuilder.get();
       
return query.getFilteredIndexRanges();
   
}

   
private static List<BigIntegerRange> rangesToQueryRegionBI(BigInteger[][] ranges) {
       
List<BigIntegerRange> region = new ArrayList<>();
       
for (int j = 0; j < ranges.length; ++j) {
            region
.add(BigIntegerRange.of(ranges[j][0], ranges[j][1]));
       
}
       
return region;
   
}

}


The code is partially copied/adapted from your HBaseQueryTest.

Am I doing anything wrong?

Another question would be if you recommend LongArrayBitVector instead of BigInteger? But I couldn't see a way of specifying queries with LongArrayBitVectors.

Daniel

unread,
Jul 29, 2014, 3:13:12 PM7/29/14
to uzay...@googlegroups.com
In double-precision floating point representation 0.01=4576918229304087675 and 0.011=4577494690056391098. Therefore there are 576460752303423 other numbers in-between.
The query-building algorithm practically traverses the perimeter of the query region to build the query, which in this case is in the order of 10^15.
If you switch to float, and maybe support only non-negative values so that you only need 31+31 bits which fit in long, then building the query takes a couple of seconds. In this case the input ranges are LongRange[start=1008981770,end=1010055512], and the result is indexRanges=[FilteredIndexRange[indexRange=LongRange[start=765620778416341128,end=765624810245652408],filter=,potentialOverSelectivity=true]].

It's better to have the values somewhat uniformly distributed. Floating point is very dense around zero, so unless your data is also like that, it may be better to use a fixed-point representation domain.

       
List<? extends List<<span style="color: #606;" class="
...

Tilmann Z

unread,
Jul 30, 2014, 9:58:21 AM7/30/14
to uzay...@googlegroups.com
Thanks, I will try that.
In course of my research, I'm actually comparing different indices for multi-dimensional behaviour. So using 32 floats for one index and 64bit for others is a bit unfair, but should still give me a general idea.

I'm not Hilbert-expert, but would you say that the performance problem with large ranges (10^15 intermediate values) is a general problem with the hilbert approach or more a problem with the current implementation?

Daniel

unread,
Jul 30, 2014, 10:20:57 AM7/30/14
to uzay...@googlegroups.com
I think the problem is inherent to the Hilbert curve, but probably to other space filling curves as well, as long as you need maxRanges >= 2 optimal ranges. For maxRanges=1 it's rather easy to make an implementation that simply finds the minimum and maximum Hilbert index of the query region. Are you only interested in maxRanges=1?

Tilmann Z

unread,
Jul 30, 2014, 2:50:07 PM7/30/14
to uzay...@googlegroups.com
I'm working on the PH-Tree (relatively new: PDF), which uses a Z-Curve (morton curve). It scales linearily with the number of bits (precision), performance depends mostly on the number of actually stored in a range. The PH-tree does not really have the concept of splitting a range query into multiple ranges.


> For maxRanges=1 it's rather easy to make an implementation that simply finds the minimum and maximum Hilbert index of the query region. Are you only interested in maxRanges=1?
Well, as I understand, having only a single range means that I need to traverse all intermediate values and check them for validity, which can be quite expensive. I only chose this in my example as a value that should return at some point, but I didn't understand the implications. But Higher ranges don't seem to work either.
Would you suggest a specific value for maxRanges?

I don't really care, I just want it to be fast :-)

Daniel

unread,
Jul 31, 2014, 6:14:02 AM7/31/14
to uzay...@googlegroups.com
Nice work the PH-tree.

I think I'm starting to understand your requirements. I assume you're working with a PH-tree stored on disk?

I have not implemented something like UB-Tree's getNextZValue (http://www.vldb.org/conf/2000/P263.pdf) for the Hilbert curve. If you have some useful references about that, I would be pleased to know. I would also like to know if there is something like http://mistral.informatik.tu-muenchen.de/results/publications/dwacos99.pdf for non-rectangular volumes with the Hilbert curve.

So the most efficient thing supported by uzaygezen at the moment boils down to a depth-first search, for which you'll need to implement

Rather than taking you directly to the next Hilbert value covered by the query region, it will ask you to determine if there is an intersection between the query region and each subregion, in the Hilbert order. Let me know if that's useful to you or if you need any help.

Tilmann Z

unread,
Jul 31, 2014, 12:02:58 PM7/31/14
to uzay...@googlegroups.com
Thanks.

I'm currently using the PH-Tree only in memory (64GB max).

I don't really know the UB-Tree and unfortunately I don't have any references other than than the ones you posted...
Do you know the XXL project? They have implemented a range of index-structures, including a Hilbert-R-Tree. Maybe they are using the getNextZValue?

I haven't actually implemented your suggestion above. The problem is that my example above really is a simplification, my test run from 2 to 20 dimension. That means I have to use BigInteger, not 'long'. I tried converting to 32-bit floats before converting to BigIntegers, but the queries (for example 3 dimensions) still don't seem to finish within a few minutes. I tried maxRegions = {1; 10000; Integer.MAX_INT} with only 10000 entries in the tree.

Tilmann Z

unread,
Jul 31, 2014, 12:04:45 PM7/31/14
to uzay...@googlegroups.com
I think I forgot the link to the XXL library:
https://code.google.com/p/xxl/source/browse/#svn%2Ftrunk%2Fxxlcore%2Fsrc%2Fxxl%2Fcore%2FindexStructures

--> Not sure how I can review my posts here...




Am Donnerstag, 31. Juli 2014 12:14:02 UTC+2 schrieb Daniel:

Daniel

unread,
Jul 31, 2014, 5:20:29 PM7/31/14
to uzay...@googlegroups.com
Thanks for the link, I'll check it out.

Let me explain a bit better what I meant. From a quick look at the PH-tree paper, my understanding is that a region query is performed by what is basically a depth-first traversal of the PH-tree. Moreover, since the PH-tree encodes the keys by bit-interleaving, such a traversal produces in fact the leaves in Z-order. I think that you are exploring the possibility of using space-filling curves other than Z-order.

If you use the Hilbert encoding of the keys in the PH-tree rather than Z-order, you can implement the region query by simultaneously traversing both the Hilbert space and the PH-tree. You can do that by calling CompactHilbertCurve.accept(visitor), but that would still be too slow for a sparse tree. It would be much better I think to traverse the PH-tree and adapt some of the code in CompactHilbertCurve.accept() to compute/maintain cartesian coordinates during the PH-tree traversal. Does that make sense?

To see how CompactHilbertCurve.accept can be used, you may want to check HBaseQueryTest.createRolledupCache and HBaseQueryTest.query - but that has very little in common with your use-case.

Tilmann Z

unread,
Aug 4, 2014, 11:39:58 AM8/4/14
to uzay...@googlegroups.com
> Let me explain a bit better what I meant. From a quick look at the PH-tree paper, my understanding is that a region query is performed by what is basically a depth-first traversal of the PH-tree. Moreover, since the PH-tree encodes the keys by bit-interleaving, such a traversal produces in fact the leaves in Z-order.
Yes, correct.

Using space filling curves other than z-order would be an option, but that's not my immediate goal. The ph-tree has a 'natural' z-ordering, and applying any other ordering would add an immediate performance cost. It is true that HilbertCurves should have better properties than z-curves, but I currently think that the benefit of HilbertCurves is too small to justify the additional processing cost. But indeed, that would have to be evaluated properly.

For now I was just trying to store a HilbertCurve in a Binary Patricia-Tree (I use this one). I thought that would give me a basic idea of how Hilbert curves perform, without putting too much effort into it. If they would perform comparably to the PH-tree for low dimension (2 to 5 dimension or so), then I would look into adapting the PH-tree to use hilbert curves.
But it currently seems as if HilbertTrees are not really an option when using standard binary trees? Maybe I have to look a bit more into the accept() method to understand the concept and see how it could be better used or even adapted for my CritBit or PH-Tree.

Anyway, thanks for your help.
Reply all
Reply to author
Forward
0 new messages