Intersect query with slice larger than page size

Murray

unread,

May 6, 2012, 5:26:27 PM5/6/12

to rav...@googlegroups.com

When using the Intersect() query, if the paging size is less than the total documents in a slice (one of the clauses joined by the Intersect()), the query may not return any rows (seemingly dependent on position in the index). When no rows are returned, resubmitting the query with RavenQueryStatistics return values to .Skip(pagenum * pagesize + skippedresults) repeatedly also continues to return zero rows (and the skippedresults just continues to increase).

I’ve attached a simple test case with 10000 documents to demonstrate the potential issue.

Using build 923 plus the patch Matt Warren gave (var document = currentReader.Document(doc); //Note don't need to add currentBase to doc) in https://groups.google.com/forum/?fromgroups#!topic/ravendb/dGeFF1Unawk

TestIntersectSliceLargerThanPageSize.txt

Matt Warren

unread,

May 6, 2012, 8:39:50 PM5/6/12

to rav...@googlegroups.com

Thanks for the detailed test case and thanks for finding these bugs in the Intersect() search, it's nice to have someone testing it out.

Your analysis is pretty spot on, it's not handling paged searched properly and it's it linked to the position in the index. If the "set" of docs matching each sub-query are at the start and the end of the index, then we never find them. At the moment it doesn't try enough times to find a "page" of intersecting docs, it in effect gives up after 2 failures.

I'll have to think about a better way of handling it though, because we need an efficient way of doing the intersection. On one hand, we need to be keep going until we're sure we can't get a complete page of intersecting docs. On the other hand we don't want to key trying in the scenario where there are actually less than a page size worth of intersecting docs. I'll try and come up with something in the next couple of days.

Murray

unread,

May 7, 2012, 5:59:42 PM5/7/12

to rav...@googlegroups.com

Thanks Matt.

I think this capability will ultimately be a game changer, with application in ecommerce, BI, target marketing and more.

So, I'm thinking, it seems like in some ways the intersect is similar to what happens in an index on an analyzed string field. Each token must be indexed. If the query is .Where(o => o.stringfield == "foo" && o.stringfield == "bar") then the Lucene query is (stringfield:foo) AND (stringfield:bar). Seems like an intersection has to be done to return that.

Itamar Syn-Hershko

unread,

May 7, 2012, 6:03:19 PM5/7/12

to rav...@googlegroups.com

Murray, not exactly. this happens in a much lower level API and basically is just a matter of comparing bitsets, whereas we have to resort to querying the index twice

Matt Warren

unread,

May 8, 2012, 6:02:07 AM5/8/12

to rav...@googlegroups.com

Yeah, although over the weekend I was actually wondering if we could use bit-sets to do the intersecting queries for us?

I.e. rather than using the IntersectionCollector, instead get the bit set for each sub-query and then combine them? The downside is that ordering becomes harder (or impossible?), but it might be more efficient that the current approach. I'll try running a few tests and post something on this thread when I'm done.

Oren Eini (Ayende Rahien)

unread,

May 8, 2012, 6:03:09 AM5/8/12

to rav...@googlegroups.com

The problem is that you can't just use bit sets, you have to use document ids.

Itamar Syn-Hershko

unread,

May 8, 2012, 6:06:10 AM5/8/12

to rav...@googlegroups.com

It will require a lot of low level dev to implement sorting etc, and as long as you intersect a few queries it is efficient enough this way

Might be a nice Lucene contrib tho :)

On Tue, May 8, 2012 at 1:02 PM, Matt Warren <matt...@gmail.com> wrote:

Matt Warren

unread,

May 8, 2012, 6:19:54 AM5/8/12

to rav...@googlegroups.com

Yeah that is an issue, with regards to the whole LuceneID v. Raven DocID issue, does anything in this post help? See http://invertedindex.blogspot.co.uk/2009/04/lucene-dociduid-mapping-and-payload.html.

From how I read it, is seems that there's a more efficient way of maintaining the mappings between the 2 types of Ids. So you populate a "lookup" array when you open the IndexReader, instead of having to load the lucene doc and pull out the "__document_id" field each time. Although it seems that there is some complexity around when you do this, to make sure it doesn't get out of sync.

Oren Eini (Ayende Rahien)

unread,

May 8, 2012, 6:44:16 AM5/8/12

to rav...@googlegroups.com

Interesting, and certainly possible, although that would mean it would only work for new indexes.

I am not sure how valid that would be, we would have to get the bitset for each query, map this to the doc id, then do the interesection.

Matt Warren

unread,

May 8, 2012, 9:44:44 AM5/8/12

to rav...@googlegroups.com

Murray,

I've just made some modifications that fix this issue, are you able to try it out on your side as well?

If you're happy to modify your copy of the source again, you can get the changes from here https://github.com/mattwarren/ravendb/blob/master/Raven.Database/Indexing/Index.cs#L688. All the fixes you need are in the IntersectionQuery() function, so you can just copy the entire function over the top of your local copy.

I've sent a pull request for this issue and your other one, so if they're approved they be in the next unstable build.

Itamar Syn-Hershko

unread,

May 8, 2012, 10:31:21 AM5/8/12

to rav...@googlegroups.com

You will have to refresh that lookup table on IndexSearcher.Reopen() - and that can be quite costly since you have one reader + one table PER INDEX

And what happens when you have multiple records per __document_id (e.g. SelectMany scenarios) - I think the added complexity of those scenarios calls for a better method than arrays within arrays?

Murray

unread,

May 8, 2012, 10:57:16 PM5/8/12

to rav...@googlegroups.com

Matt,

I'll pull it down and try it out now. Thanks.

Murray

unread,

May 8, 2012, 11:24:35 PM5/8/12

to rav...@googlegroups.com

OK that works, but I'm not sure about the RavenQueryStatistics returned. The stats seem to indicate a larger number of documents than are actually there. Can the result be paged?

Murray

unread,

May 8, 2012, 11:36:31 PM5/8/12

to rav...@googlegroups.com

Um, I guess it would help if I read the comments in the code.

This will definitely work for the applications I have in mind.

Thanks!

Matt Warren

unread,

May 9, 2012, 3:10:43 AM5/9/12

to rav...@googlegroups.com

Glad it works for you as well, let me know if you have any more issues.

With regards to the query stats, are you talking about the comments here https://github.com/mattwarren/ravendb/blob/master/Raven.Database/Indexing/Index.cs#L745? If so, the issue is that we have no cheap/easy way of getting the total hits for an intersection search without collecting all the docs in the index. So we don't bother, although when I get round to writing the docs, I'll make sure that it's documented. The idea is that we collect the minumum amount of docs needed to satisfy the requested page size.

BTW the code is now in the latest build #926, see https://groups.google.com/forum/?fromgroups#!topic/ravendb/8zhlCk1gDKU

Murray

unread,

May 9, 2012, 8:45:18 AM5/9/12

to rav...@googlegroups.com

Great, we will update the development code today.

Thanks again.

Reply all

Reply to author

Forward