--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/372d07af-39c5-423c-8a2a-68b9221fd0b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On Thursday, 28 April 2016 22:52:30 UTC+10, Paweł Poręba wrote:
I'm not sure if that's the whole thing. Looking at the example from the tutorial, we got two groups of results: one with score of 0.75 (1 occurence) and second with scores of 1 (2 occurences). First of all, the searched term occurs exactly one in every document. So, following your equation they all should get 1 score point. Yet they don't. Seems like they are different factors included then. What's more, your equation will never succeed with result of 0.7 for any integer k.
Hi Paweł,
In future please start a new discussion including details of the specific MongoDB server version you are using, rather than resurrecting an old post. Implementation details and features are subject to change, so a two year old discussion may not apply to your version of MongoDB or be relevant to your question. A post specific to your environment & question will help us help you, and you can reference prior discussion(s) by including the links in your new post :).
The description Will posted in earlier discussion on this thread was focused on the scoring for multiple terms.
There are indeed some nuances which will help you understand scoring for the examples you cited.
As at MongoDB 3.2:
The word count does not include stopwords. Stopwords are extremely common words that are filtered out prior to tokenizing as they generally aren’t relevant for matching.
There is a weight coefficient which adjusts the score based on the matching term versus the total field length.
There is a small boost applied if a term is identical to the raw (untokenized) form of the field.
Prior versions of MongoDB should be similar, but you’d have to check the code to be sure. Aside from the above points, there are other changes likely between releases. For example, MongoDB 3.2 added a number of text search enhancements including better handling of Unicode diacritics and tokenization.
The example you cited in the documentation shows sample output but doesn’t go into depth on how the score is calculated. The current sample documents happen to have stopwords in every title
field (i.e. “and”, “more”, and “some”), and could be better crafted/explained. The scoring doesn’t match your expectations because of the first two nuances I mentioned above (stopwords and weight coefficient).
The C++ code snippet for the weight coefficient in MongoDB 3.2 is:
// in order to adjust weights as a function of term count as it
// relates to total field length. ie. is this the only word or
// a frequently occuring term? or does it only show up once in
// a long block of text?
double coeff = (0.5 * data.count / numTokens) + 0.5;
So, revisiting the titles in the $meta
score example you cited and working out how those documents would be indexed and scored against a text search for “cake”:
raw title | stopwords removed | stemmed | data.count (matches) | numTokens | coeff | score |
---|---|---|---|---|---|---|
cakes and ale | cakes, ale | cake, ale | 1 | 2 | (0.5 * 1/2) + 0.5 | 0.75 |
more cakes | cakes | cake | 1 | 1 | (0.5 * 1/1) + 0.5 | 1.00 |
some cakes | cakes | cake | 1 | 1 | (0.5 * 1/1) + 0.5 | 1.00 |
This happens to be a fairly straightforward example, but the overall calculated score is also influenced by other factors such as: options set for the text index & the query (e.g. language for stemming), whether a term matches multiple times in a single field, whether a term matches across multiple fields in a text index, whether field weights are used to influence scoring, etc.
I’ll raise a DOCS ticket in the MongoDB Jira suggesting some clearer example documents and further details for text score calculation.
Regards,
Stephen