Text search score formula

1,604 views
Skip to first unread message

Przemysław Kalita

unread,
May 26, 2014, 1:50:57 PM5/26/14
to mongod...@googlegroups.com
Hello,

Could someone provide me the text search score formula?

Thanks

Asya Kamsky

unread,
May 28, 2014, 8:57:16 PM5/28/14
to mongodb-user
I don't think it's documented anywhere, so I guess the remaining alternative is to check the source code.  All text search code is here:


Asya



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/372d07af-39c5-423c-8a2a-68b9221fd0b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

onego...@gmail.com

unread,
Jun 13, 2014, 8:55:54 PM6/13/14
to mongod...@googlegroups.com
Could someone shed some light regarding the ScoreHelperStruct from fts_spec.cpp? I'm sensing a sum of geometric series, but I don't understand how that contributes to the scoring.

William Berkeley

unread,
Jun 26, 2014, 2:58:35 PM6/26/14
to mongod...@googlegroups.com
Looking at the code in the master branch (https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/fts_spec.h), ScoreHelperStruct is define on line 55. It's a per term struct storing the count, freq (which is more like a score than a frequency), and a factor exp that will be used to weight the frequencies in calculating the score of the string.

This struct is used in _scoreStringV2 at line 174 in fts_spec.cpp (https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/fts_spec.cpp), where one is created for each text token in the input string. The struct stores a count of the number of times the term appears (data.count) and a computed value freq. The freq is a weighted count that is weighted by data.exp in the following manner:

start with exp = 0
each time the term occurs:
if exp = 0, set exp = 1, else set exp = 2 * exp
increment the frequency by 1/exp

So, in fact, you are right that there is a sum of a geometric series here. If a term occurs k times, then the freq of the term (which is more like a score than a frequency, but it's called freq in the struct) will be
1 + 1/2 + ... + (1/2)^(k - 1) = (1 - (1/2)^k)/(1 - 1/2) = 2(1 - 1/2^k)

This has the effect of the considering subsequent occurrences of the same term in the string as less important, with the maximum freq being 2 as the number of occurrences of the term in the string gets large. You can see how this affects the score of the entire string at line 228, where the contribution of the term to the score of the entire string is weighted by the term's freq.
Please follow up if you have any further questions about scoring in full text search.

-Will

Paweł Poręba

unread,
Apr 28, 2016, 8:52:30 AM4/28/16
to mongodb-user, william....@mongodb.com
I'm not sure if that's the whole thing. Looking at the example from the tutorial, we got two groups of results: one with score of 0.75 (1 occurence) and second with scores of 1 (2 occurences). First of all, the searched term occurs exactly one in every document. So, following your equation they all should get 1 score point. Yet they don't. Seems like they are different factors included then. What's more, your equation will never succeed with result of 0.7 for any integer k.

Stephen Steneker

unread,
Apr 28, 2016, 11:00:31 AM4/28/16
to mongodb-user

On Thursday, 28 April 2016 22:52:30 UTC+10, Paweł Poręba wrote:

I'm not sure if that's the whole thing. Looking at the example from the tutorial, we got two groups of results: one with score of 0.75 (1 occurence) and second with scores of 1 (2 occurences). First of all, the searched term occurs exactly one in every document. So, following your equation they all should get 1 score point. Yet they don't. Seems like they are different factors included then. What's more, your equation will never succeed with result of 0.7 for any integer k.

Hi Paweł,

In future please start a new discussion including details of the specific MongoDB server version you are using, rather than resurrecting an old post. Implementation details and features are subject to change, so a two year old discussion may not apply to your version of MongoDB or be relevant to your question. A post specific to your environment & question will help us help you, and you can reference prior discussion(s) by including the links in your new post :).

The description Will posted in earlier discussion on this thread was focused on the scoring for multiple terms.

There are indeed some nuances which will help you understand scoring for the examples you cited. 

As at MongoDB 3.2:

  • The word count does not include stopwords. Stopwords are extremely common words that are filtered out prior to tokenizing as they generally aren’t relevant for matching.

  • There is a weight coefficient which adjusts the score based on the matching term versus the total field length.

  • There is a small boost applied if a term is identical to the raw (untokenized) form of the field.

Prior versions of MongoDB should be similar, but you’d have to check the code to be sure. Aside from the above points, there are other changes likely between releases. For example, MongoDB 3.2 added a number of text search enhancements including better handling of Unicode diacritics and tokenization.

The example you cited in the documentation shows sample output but doesn’t go into depth on how the score is calculated. The current sample documents happen to have stopwords in every title field (i.e. “and”, “more”, and “some”), and could be better crafted/explained. The scoring doesn’t match your expectations because of the first two nuances I mentioned above (stopwords and weight coefficient).

The C++ code snippet for the weight coefficient in MongoDB 3.2 is:

    // in order to adjust weights as a function of term count as it
    // relates to total field length. ie. is this the only word or
    // a frequently occuring term? or does it only show up once in
    // a long block of text?

    double coeff = (0.5 * data.count / numTokens) + 0.5;

So, revisiting the titles in the $meta score example you cited and working out how those documents would be indexed and scored against a text search for “cake”:

raw title stopwords removed stemmed data.count (matches) numTokens coeff score
cakes and ale cakes, ale cake, ale 1 2 (0.5 * 1/2) + 0.5 0.75
more cakes cakes cake 1 1 (0.5 * 1/1) + 0.5 1.00
some cakes cakes cake 1 1 (0.5 * 1/1) + 0.5 1.00

This happens to be a fairly straightforward example, but the overall calculated score is also influenced by other factors such as: options set for the text index & the query (e.g. language for stemming), whether a term matches multiple times in a single field, whether a term matches across multiple fields in a text index, whether field weights are used to influence scoring, etc.

I’ll raise a DOCS ticket in the MongoDB Jira suggesting some clearer example documents and further details for text score calculation.

Regards,
Stephen

Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages