Custom scoring or extract TF-IDF values

99 views
Skip to first unread message

Mikkel Denker

unread,
Oct 19, 2020, 9:44:36 AM10/19/20
to bleve
Hi,

First of all thanks for all the work that has been put into this library. It looks great.

I see that there have been talks about supporting custom scoring functions of documents for some time, and was wondering if this has been implemented. I can't seem to find it in the documentation, but I might have missed it. If not, do you reckon it would require a huge refactor of the scorer/searcher package if tried to implement it and put in a pull request?

It would also be useful to be able to get the term frequencies and IDF values from the search result for further re-ranking of the documents, but I don't seem to be able to get the values from the search request even though I set the IncludeTermVectors to true. Is this correct or am I missing something?

Best regards.

Marty Schoch

unread,
Oct 19, 2020, 9:59:54 AM10/19/20
to bl...@googlegroups.com
No,

Currently Bleve does not support custom scoring.  And you are correct, it will take a significant change to do it correctly.  I would caution against creating a big PR and hoping it gets merged.  As a breaking change, it will only get merged as we prepare a new major release, and it's something many people will want to discuss the API and implementation of.

I have experimented with custom scoring the Bluge library I'm also working on.  (I mention this not to plug Bluge, but because there you can see some of my thinking about how customizable scoring should work).

There are 3 primary interfaces:

type Similarity interface {
   ComputeNorm(numTerms int) float32
   Scorer(boost float64, collectionStats segment.CollectionStats, termStats segment.TermStats) Scorer
}

type Scorer interface {
   Score(freq int, norm float64) float64
   Explain(freq int, norm float64) *Explanation
}

type CompositeScorer interface {
   ScoreComposite(constituents []*DocumentMatch) float64
   ExplainComposite(constituents []*DocumentMatch) *Explanation
}

The first is Similarity, which goes beyond just scoring at query time, it also provides a function to compute norm values (which are recording at indexing time).  In Bluge, our BM25 implementation satisfies this Similarity interface.  Note that Similarity also returns a Scorer, which is the second interface.  The Scorer interface is responsible "leaf queries", meaning querie that produces results directly from finding a document using a term in the index. Score is used if you only need the score, Explain is used if you need the textual description as well.  You see that the frequency is passed in to these functions.  Now, you might think that the IDF and other scoring details are missing, but they are actually passed in through the constructor, as that part does not change when scoring each match.
Finally, there is a composite scorer.  These scorers are used when we need to combine scores from matches in multiple query clauses (boolean, and/or, etc).

So, in Bluge these interfaces describe how one can compute scores.  The final pieces is that when building a query, you can customize the scorer/composite-scorer used.  This gives you full control of how each query clause produces it's score.

This by itself is a big step forward, but still doesn't yet allow for plugging in some other behavior.  For example, in Bluge a Fuzzy query now weights it's matches by the edit distance required to match the term (using the same formula as lucene).  But, ideally this functionality should be configurable as well, so there are probably yet more interfaces to think through.


marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/1c2466b6-f088-464e-b6d4-78724ca900aan%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages