On 2013-08-18, at 4:20 AM, Philippe Ombredanne <
pombr...@nexb.com> wrote:
> On Sat, Aug 17, 2013 at 5:40 PM, Michal Trunečka
> <
michal....@gmail.com> wrote:
>> I have phrase search enabled in my index, which works as expected, but when
>> getting results from hit.highlights, I get all occurances of all the terms
>> separately, not in the phrase.
>>
>> I want to get only the matched whole phrase, is there a way to do that?
OK, I've looked into this, and unfortunately I don't think this is easy to do with the current high-level highlighting code, which only knows about individual terms.
If your use case is simple, e.g. you are only searching for a single phrase, *and* you knew the phrase would be contained in a single fragment (e.g. using the SentenceFragmenter), you could write a custom scorer that would give any fragment that didn't contain the terms in the correct order a score of 0 (see an example of this approach below).
However, what happens if you have a query with a free term and a phrase ANDed together? Or two phrases? I think to solve the problem in a general way you'd end up needing to teach the highlighting system about queries, instead of just dealing with terms. That could potentially be a major project.
Cheers,
Matt
=
Here's an example of a custom scorer, if you're interested. First you must apply the following patch, since the current code doesn't record the positions of matched terms (this is included in the repo now):
--- a/src/whoosh/highlight.py Mon Aug 19 16:41:58 2013 -0400
+++ b/src/whoosh/highlight.py Mon Aug 19 16:58:25 2013 -0400
@@ -885,7 +885,7 @@
else:
# Retokenize the text
analyzer = results.searcher.schema[fieldname].analyzer
- tokens = analyzer(text, chars=True, mode="query",
+ tokens = analyzer(text, positions=True, chars=True, mode="query",
removestops=False)
# Set Token.matched attribute for tokens that match a query term
tokens = set_matched_filter(tokens, words)
Then you can write a custom scorer class:
class CustomScorer(highlight.FragmentScorer):
def __init__(self, phrase):
# Get the list of words from the phrase query
self.words = phrase.words
def __call__(self, f):
# Create a dictionary mapping words to the positions the word
# occurs at, e.g. "foo" -> [1, 5, 10]
d = defaultdict(list)
for token in f.matches:
d[token.text].append(token.pos)
# For each position the first word appears at, check to see if the
# rest of the words appear in order at the subsequent positions
firstword = self.words[0]
for pos in d[firstword]:
found = False
for word in self.words[1:]:
pos += 1
if pos not in d[word]:
break
else:
found = True
if found:
return 100
return 0
Then you can use it like this:
phrase_query = query.Phrase("text", "foo fee foh fum".split())
results = searcher.search(phrase_query)
results.fragmenter = highlight.SentenceFragmenter()
results.scorer = CustomScorer(phrase_query)