"Unstable" scores

52 views
Skip to first unread message

Achim Domma

unread,
Oct 29, 2014, 6:43:28 PM10/29/14
to helio...@googlegroups.com
I have some strange behavior in our app, which is a pain to debug, but I think I have nailed it down to the following problem:

I have fixed query, including the following filter:

fq: {!frange l=0.1 }query($q)

I'm aware that filtering on a score like this is considered a bad idea, but it's reasonably heuristic, which worked fine so far. I understand, that the score might change, depending on the texts indexed in the core. But when I update just some metadata on some documents and do a commit, I afterwards get a different number of documents back. Without changing and fields related to the query!

As far as I can reproduce it, the result depends on the first queries executed after the commit. Sometimes I get for example 11333 documents (out of about 640k) and then 12004. I was able to see, that the score for some docs varies in the last digits of the score, so the difference might be caused by that.

But could somebody explain to me, why the score would change after a commit, even if the related fields have not changed?

kind regards,
Achim

Yonik Seeley

unread,
Oct 29, 2014, 7:51:44 PM10/29/14
to helio...@googlegroups.com
> But could somebody explain to me, why the score would change after a commit,
> even if the related fields have not changed?

This is a lucene issue (not a bug... a design trade-off).

1) When you delete a document, it is only marked as deleted so that it
is not returned with search results. It is really deleted when the
segment it is in undergoes a merge.

2) The inverted-index is completely unaffected by deleted documents.
It would be extraordinarily expensive to reflect deletions in these
structures.

3) Part of full-text scoring includes "idf" (inverse document
frequency)... this depends on both the number of documents in the
index and the number of documents containing the term. The latter is
part of the inverted index structure and does not reflect deletions.

Example: I have an index with 1M documents. The term "text:foobar"
occurs in 500 documents. The IDF function is passed (500,1000000).

Now I update (re-index) one of the documents, changing one of the
other fields. The overwritten document hasn't been merged away yet,
and it's statistics still appear in the inverted index. Now the
number of documents containing the term "text:foobar" is 501, and the
IDF calculation has been changed.

Fixes: really remove deleted documents, which will cause their
inverted index statistics to be updated
- call optimize... this merges all segments and hence removes all
deleted docs. This is extremely expensive as it re-writes the entire
index.
- use "expungeDeletes=true" on your commit command. This is also very
expensive, rewriting the complete segment if it has even a single
deletion.

-Yonik
Reply all
Reply to author
Forward
0 new messages