Gimd and Lucene

10 views
Skip to first unread message

Grzegorz Kossakowski

unread,
Dec 24, 2009, 8:39:36 AM12/24/09
to gimd-d...@googlegroups.com
Hi Shawn,

I remember you saying that you have done some research on how to use
Lucene and you had some code prototypes showing API fragments usage
that Gimd is likely to use. Could you share it somehow?

I remember you saying that Lucene is mostly able to quickly return
Ranges. This is fine as seeking for ranges covers many cases of Gimd
queries. I only wonder if Lucene is capable of performing union and
intersection operations (`or` and `and` operations) on set of ranges.
I'm mentioning set of ranges because union and intersection are closed
operations on set of ranges.

My idea would be that query evaluation consists of three steps of filtering:
1. First of all we extract as many fragments of AST as possible and
pass them to Lucene. As result we get set of entries which is superset
of what query should return.
2. That superset is further filtered by applying rest of AST fragments
that are not Predicate instances.
3. The remaining step would be apply all Predicate instances which
involves evaluation of some arbitrary function.

The second step might not be necessary if it turns out that all AST
fragments that are not Predicate can be translated to Lucene queries.
Of course, above description is simplified because ordinary operator
and Predicates can be freely mixed.

Another observation one could make is that NOT operation is
troublesome when it comes to query optimizations. The problem with not
is that it corresponds to operation of taking complement set. This
operation is non-local in a sense that in order to perform it one has
to consider all elements of the universe which is all elements stored
in Gimd in our case. The only strategy one can deal with it is trying
to eliminate NOT by rewriting query using De Morgan's laws and similar
laws. If it's not possible to eliminate all NOTs in a query we are
essentially screwed and we have to fall-back to full db scan, right?

I haven't considered if it's possible to come up with such AST that
cannot be rewritten to be free of NOT operations. As soon as we have
all nodes for AST defined I'll try to come up with some kind of proof.

That's it for now. I look forward for some hints how to use Lucene and
to what extend we can use it to speed up queries evaluation.

--
Best regards and happy Christmas!
Grzegorz Kossakowski

Shawn Pearce

unread,
Dec 24, 2009, 3:59:49 PM12/24/09
to gimd-d...@googlegroups.com
Grzegorz Kossakowski <grzegorz.k...@gmail.com> wrote:
> I remember you saying that you have done some research on how to use
> Lucene and you had some code prototypes showing API fragments usage
> that Gimd is likely to use. Could you share it somehow?

Yup, I did. I dusted it off a little by adding a simple Maven
build and posted it here:

http://git.spearce.org/?p=lucene-jgit.git;a=summary

Its not something I would want to use for production use, we should
rewrite it completely, but its at least a pointer into how to use
Lucene and JGit together a bit.



> I remember you saying that Lucene is mostly able to quickly return
> Ranges. This is fine as seeking for ranges covers many cases of Gimd
> queries. I only wonder if Lucene is capable of performing union and
> intersection operations (`or` and `and` operations) on set of ranges.
> I'm mentioning set of ranges because union and intersection are closed
> operations on set of ranges.

Yes, it also does union/intersection quite well. Intersection
especially as this is necessary to get efficient full text results
(take all documents with "jgit", and that set against the set of
all documents with "lucene" to answer "jgit lucene").



> My idea would be that query evaluation consists of three steps of filtering:
> 1. First of all we extract as many fragments of AST as possible and
> pass them to Lucene. As result we get set of entries which is superset
> of what query should return.

Yes.

> 2. That superset is further filtered by applying rest of AST fragments
> that are not Predicate instances.

Yes.

> 3. The remaining step would be apply all Predicate instances which
> involves evaluation of some arbitrary function.

Yes.

> The second step might not be necessary if it turns out that all AST
> fragments that are not Predicate can be translated to Lucene queries.
> Of course, above description is simplified because ordinary operator
> and Predicates can be freely mixed.

Yup.

> Another observation one could make is that NOT operation is
> troublesome when it comes to query optimizations. The problem with not
> is that it corresponds to operation of taking complement set. This
> operation is non-local in a sense that in order to perform it one has
> to consider all elements of the universe which is all elements stored
> in Gimd in our case. The only strategy one can deal with it is trying
> to eliminate NOT by rewriting query using De Morgan's laws and similar
> laws. If it's not possible to eliminate all NOTs in a query we are
> essentially screwed and we have to fall-back to full db scan, right?

Correct.

My answer here is, Gerrit Code Review does not currently use
NOT operators in its query processing. Therefore we can simply
punt in gimd. If a query contains NOT, don't bother with Lucene
optimization, just table scan the database.

Well behaved applications that want efficient query processing
should avoid NOT, for exactly the reasons you mention above.

Only ad-hoc, interactive shell operations should expect to be using
NOT, and in many cases, these queries wouldn't be able to be answered
by an index in Lucene so will have to table scan. Since they are
ad-hoc and not part of the normal application transaction processing,
we can make the user wait longer to get an answer.

> I haven't considered if it's possible to come up with such AST that
> cannot be rewritten to be free of NOT operations. As soon as we have
> all nodes for AST defined I'll try to come up with some kind of proof.

Its difficult. If you know all possible values for the fields
which are being negated, you can clearly rewrite them in terms of
the allowed values.

For a boolean field, if the predicate is "NOT (a === false)" then
clearly you can rewrite it as "a === true".

For an enumeration field that has 3 values, if the predicate is
"a != ONE" than you can rewrite it as "a === TWO || a === THREE".

But if the field is free-form, or the range of values cannot
be efficiently enumerated, NOT can't be rewritten in terms of a
non-NOT operator.

Grzegorz Kossakowski

unread,
Apr 5, 2010, 1:40:15 PM4/5/10
to gimd-d...@googlegroups.com
(reposting my response as there seems to be some problem with Google groups)

On Dec 24 2009, 10:59 pm, Shawn Pearce <s...@google.com> wrote:


> Grzegorz Kossakowski <grzegorz.kossakow...@gmail.com> wrote:
> > I remember you saying that you have done some research on how to use
> > Lucene and you had some code prototypes showing API fragments usage
> > that Gimd is likely to use. Could you share it somehow?
>
> Yup, I did. I dusted it off a little by adding a simple Maven
> build and posted it here:
>
> http://git.spearce.org/?p=lucene-jgit.git;a=summary
>
> Its not something I would want to use for production use, we should
> rewrite it completely, but its at least a pointer into how to use
> Lucene and JGit together a bit.

Thanks. I had a quick look into your code.

AFAIU, that code takes every file found in a given commit and stores
it's whole content as one field processed by Lucene's analyzer.

So we have 1-1 relation between Lucene's Document and file stored in a
given commit from JGit.

I guess this approach cannot be used in Gimd as we want to query on
single field (in Gimd's sense) of particular UserType. It looks like a
single document should correspond to one instance of data defined by
UserType. Thus we would have relation 1-to many between files stored
in Gimd and Documents stored in Lucene. Single Document would store:
* filePath where the contents that Document represents comes from
* UserType identifier (possibly just class name)
* list of fields defined by UserType

This way it would be possible to use Lucene to search for all files
that store data that possibly will satisfy query conditions.

We need to store UserType identifier because otherwise we could get
lots of false hits from Lucene for some field that uses some common
name (like "name").

How does it sound?

Agreed with all you said.

--
Best regards,
Grzegorz Kossakowski

Reply all
Reply to author
Forward
0 new messages