Hi,
Please note that the ranges have an exclusive upper bound. If you treat it as inclusive that may explain why you get 400 + 15 (one spurious point for each range) points instead of 400.
As for the query building time for large ranges, let me explain first how the range queries are built.
The library implements a very simple algorithm that does not impose any restrictions on the form of the query region, i.e., it can be spherical, rectangular, star-shaped, anything you want. The ranges are built by basically walking around the surface of the query region. While a specialised routine for (hyper)rectangular query regions would probably speed things up a lot, the speedup may be limited to a constant factor, but I would be very glad to find out otherwise. In the general case I don't think that walking the query surface can be entirely avoided, but again, I hope to be proved wrong.
The main assumption of the algorithm is that fetching from Disk/Network a number of rows equal to the Volume of the query region (divided by a constant factor - see "minOverlappingContent" below) should be much slower than visiting the full Surface of the query region In-Memory. This assumption by itself does not help much if the data are very sparse.
In your case, you have an average distance of about 1.2 billion on the time axis. The library doesn't know that, and it assumes by default that you're planning to fetch some (2100-2000)*(1100-1000)*(1373932800000-137393640000)=12e15 entries from the database, so it takes its time to build the range queries.
What can be done about that?
If the query range length on every axis is more or less the same then you can use the parameter minOverlappingContent of SimpleRegionInspector.create() to instruct the query builder not to look inside hypercubes with a volume smaller than that, and instead if there is an overlap of the query region with such a hypercube then the full hypercube is included in the range queries. What means is that the last log2(minOverlappingContent) bits of the index, i.e., the last "log2(minOverlappingContent) / dimensionality" bits from each dimension, are simply ignored, and you will get spurious data points (Query.isPotentialOverSelectivity will return true). So you may want to set the parameter for example equal to queryVolume / expectedNumberOfRows * C, where C is a constant. For perfectly uniform data a good value would be C=1.
Note that the parameter minOverlappingContent is unrelated to the parameter maxFilteredRanges of BacktrackingQueryBuilder.create(), which also needs to be set such as to limit the maximum number of range queries that you are willing to perform.
In your example the query range lengths of the 3 axes differ by 10 orders of magnitude, so the surface area of the query region is much higher than necessary. In order to optimise the index for queries where the Time axis range is 10 order of magnitude higher than the query range length on the X and Y axes, you will need to drop the lowest math.ceil(math.log((1373932800000 - 137393640000) / 100, 2)) = 34 bits from the Time axis when building the index. In your case the remaining time accuracy will be about 199 days for a query range of 39 years. After this you can use the parameter minOverlappingContent to further speed-up query building if necessary, and then limit the number of range queries into the database with maxFilteredRanges.
In the end you may want to have one index for each kind of query shape in terms of orders of magnitude difference between the query range lengths of the axes. For example, queries of size 1x10x100 and 100x10000x10000 can use the same index, only that if you want the same query building time, minOverlappingContent needs to be about 100^3 times larger in the latter case. This does not mean that all axes must have the same precision (number of bits). It is the range length of the query on each dimension, not the precision of each dimension, that matters.
For example, if you limit your range query to 100 milliseconds, i.e. from 1373932800000 to 1373932800100, then it works just fine, even with the default value minOverlappingContent=1:
filteredIndexRanges.size()=1 selected=BigIntegerContent[value=48693508096]
filteredIndexRanges.size()=10 selected=BigIntegerContent[value=2068352]
filteredIndexRanges.size()=100 selected=BigIntegerContent[value=1178880]
filteredIndexRanges.size()=599 selected=BigIntegerContent[value=1000000]
The idea is that the more regular the query shape is, the smaller its surface area becomes (in fact, what matters is surfaceArea / (minOverlappingContent ^ (number of dimensions - 1))), and so the processing time to build the range queries drops as well.
Moreover, if your data are mostly static then you can compile an in-memory aggregated view of the data using BoundedRollup and use that to further speed-up query building and create better range queries.
Best,
Daniel
Il giorno venerdì 12 luglio 2013 23:43:34 UTC+2, zakrat ha scritto: