Boost on Numbers

39 views
Skip to first unread message

Op Sharma

unread,
Aug 28, 2021, 1:06:37 PM8/28/21
to bleve
Hello,

Can you please help me figure out an issue I am struggling with. 

My requirement is I have a field whose value is number and i want to order the search based on the numbers. Meaning numbers are from 1 - 10, but I want to boost the results if the value is 3 then I want to boost whose values is 6 followed by 2, 4, 7 and so on.

But when I am doing <filed>:3^100 <filed>:6^90 <filed>:2^80 <filed>:4^70 

Its not giving me results in that order. I wonder if boost works around numbers like this ?

Also is there any other way to solve this problem? Please suggest, looking forward to hear from you

Thanks

Marty Schoch

unread,
Aug 28, 2021, 1:20:34 PM8/28/21
to bl...@googlegroups.com
Unfortunately I don't think this use case is well supported right now.  I'm aware of at least 2 problems that relate to it:

1.  Document matches found during a numeric range search are scored using TF/IDF scoring.  This means that the original score (prior to any boosting) is already some arbitrary meaningless number (the tf/idf of whichever specially encoded numeric term was found).  This means that even if boosting worked correctly, you're multiplying the boost factor against some meaningless number, so you're unlikely to be happy with the results.

2.  Your query is of the form "A or B or C or D", and again our scoring algorithm defaults to behavior that makes sense for a different use case.  Specifically, each of the clauses have their score "normalized".  This make sense for some types of text search, but again does not make sense here.  The effect is that much of the boosting end up being "undone" by the normalization.

Switching the numeric queries to have a constant score would make sense, but I haven't reviewed the code to see how much work is involved.  Fixing the normalization is probably off the table for now, at least until the team is ready to upgrade to BM25, where the Lucene implementation has removed the normalization factor.

NOTE:  If you really only need to index the numbers 1-10, you should not index them as numbers or use the numeric range query.  Instead you should index the values as text terms, and use regular term matching.  This won't fix your specific scoring issue, but will ultimately perform significantly better (smaller index and faster searches).

marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/1b0564b1-4e32-4108-88e7-a180ba5fd1b0n%40googlegroups.com.

Op Sharma

unread,
Aug 28, 2021, 2:18:45 PM8/28/21
to bleve
Thanks Marty for quick response and explaining the issue in detail, very helpful.

Sorry i am not able to follow your suggestion / proposal completely can you please help with that bit more.

So in my example if i have multiple queues (fixed max size of 10 entries) in the system and i want to place the next entry I want to lookup for the queue which has 3 entries first (defined priority), if not found then i want to place entry in queue which has 6 entries over the queues which has 2,4,5 entries.

As per your suggestion i should be using the text terms over numbers, to simplify I can use "one", "two". "three" terms to define current size in the queues. Followed by exact term match
+<field>:3 and get results, if not found then do another call for +<field>:6 and so on till i get the result? So rather calling once worse case scenario i will be calling 10 times ?

Apologies if I misunderstood. 

Thanks 

Marty Schoch

unread,
Aug 28, 2021, 3:24:30 PM8/28/21
to bl...@googlegroups.com
Sorry for adding to the confusion.

My recommendation to use text fields instead of numeric for your use case has to do with the internal implementation.  The numeric field type is designed to support numeric range queries in a generic way (not knowing anything about the cardinality of your data).  The way it makes searches performant, even if the range is large, and matches lots of documents, is to index multiple terms for each value.  I won't explain the whole process, because it's not really necessary to understand the problem.  In short, if you index the value numeric 1.0, we generate 16 terms for the index.  If instead you index the text "1" or "one", we index just 1 term.

On the query side, in both cases you issue just a single query.  Either NumericRangeQuery from 1-10 or a DisjunctionQuery containing 10 term queries, one each for "1", "2", etc.  Internally, the execution of both of these is quite similar (disjunction of a hopefully small number of term queries).  But if your index is 16x smaller, everything will be faster.

The NumericRangeQuery is most useful if you have lots of discrete numeric values, and need to support arbitrary ranges across the entire domain.  In your case, you're paying extra indexing cost to support features you're not using.

However, as I said, this still does not address your scoring issue.  I brought it up for completeness, because I often see the NumericRangeQuery used in this way, but it does not directly relate to your original question.

Unfortunately scoring is one of Bleve's weakest areas, and other than rescoring all the matches on the client side (which also means retrieving all hits, not just the top N), I don't have a good recommendation right now.

marty

Reply all
Reply to author
Forward
0 new messages