How to highlight only searched phrase, not single terms?

699 views
Skip to first unread message

Michal Trunečka

unread,
Aug 17, 2013, 11:40:03 AM8/17/13
to who...@googlegroups.com
Hi,

I have phrase search enabled in my index, which works as expected, but when getting results from hit.highlights, I get all occurances of all the terms separately, not in the phrase.

I want to get only the matched whole phrase, is there a way to do that?

Thanks, Michal Trunecka

Michal Trunečka

unread,
Aug 17, 2013, 3:14:04 PM8/17/13
to who...@googlegroups.com
Or at least, how can I get the position of the found phrase?

Philippe Ombredanne

unread,
Aug 18, 2013, 4:20:38 AM8/18/13
to who...@googlegroups.com
I use matchers to get the matched spans .... but I do not know how to
get the matched spans from a searcher (without using a matcher, but
instead a searcher directly)
Given an index and a query you can play with a matcher this way:
with myindex.searcher() as searcher:
matcher = myquery.matcher(searcher)
while matcher.is_active():
docnum = matcher.id()
path = searcher.stored_fields(docnum)['path']
for span in matcher.spans():
print path, span.startchar, span.endchar

--
Philippe Ombredanne

+1 650 799 0949 | pombr...@nexB.com
DejaCode Enterprise at http://www.dejacode.com
nexB Inc. at http://www.nexb.com

Matt Chaput

unread,
Aug 18, 2013, 12:28:56 PM8/18/13
to who...@googlegroups.com
> I have phrase search enabled in my index, which works as expected, but when getting results from hit.highlights, I get all occurances of all the terms separately, not in the phrase.
>
> I want to get only the matched whole phrase, is there a way to do that?

There might be a way in the current architecture using spans. I don't have time this weekend to look into it but I'll try to reply tomorrow.

Matt

Matt Chaput

unread,
Aug 19, 2013, 5:16:46 PM8/19/13
to who...@googlegroups.com

On 2013-08-18, at 4:20 AM, Philippe Ombredanne <pombr...@nexb.com> wrote:

> On Sat, Aug 17, 2013 at 5:40 PM, Michal Trunečka
> <michal....@gmail.com> wrote:
>> I have phrase search enabled in my index, which works as expected, but when
>> getting results from hit.highlights, I get all occurances of all the terms
>> separately, not in the phrase.
>>
>> I want to get only the matched whole phrase, is there a way to do that?

OK, I've looked into this, and unfortunately I don't think this is easy to do with the current high-level highlighting code, which only knows about individual terms.

If your use case is simple, e.g. you are only searching for a single phrase, *and* you knew the phrase would be contained in a single fragment (e.g. using the SentenceFragmenter), you could write a custom scorer that would give any fragment that didn't contain the terms in the correct order a score of 0 (see an example of this approach below).

However, what happens if you have a query with a free term and a phrase ANDed together? Or two phrases? I think to solve the problem in a general way you'd end up needing to teach the highlighting system about queries, instead of just dealing with terms. That could potentially be a major project.

Cheers,

Matt

=

Here's an example of a custom scorer, if you're interested. First you must apply the following patch, since the current code doesn't record the positions of matched terms (this is included in the repo now):


--- a/src/whoosh/highlight.py Mon Aug 19 16:41:58 2013 -0400
+++ b/src/whoosh/highlight.py Mon Aug 19 16:58:25 2013 -0400
@@ -885,7 +885,7 @@
else:
# Retokenize the text
analyzer = results.searcher.schema[fieldname].analyzer
- tokens = analyzer(text, chars=True, mode="query",
+ tokens = analyzer(text, positions=True, chars=True, mode="query",
removestops=False)
# Set Token.matched attribute for tokens that match a query term
tokens = set_matched_filter(tokens, words)


Then you can write a custom scorer class:


class CustomScorer(highlight.FragmentScorer):
def __init__(self, phrase):
# Get the list of words from the phrase query
self.words = phrase.words

def __call__(self, f):
# Create a dictionary mapping words to the positions the word
# occurs at, e.g. "foo" -> [1, 5, 10]
d = defaultdict(list)
for token in f.matches:
d[token.text].append(token.pos)

# For each position the first word appears at, check to see if the
# rest of the words appear in order at the subsequent positions
firstword = self.words[0]
for pos in d[firstword]:
found = False
for word in self.words[1:]:
pos += 1
if pos not in d[word]:
break
else:
found = True

if found:
return 100
return 0


Then you can use it like this:


phrase_query = query.Phrase("text", "foo fee foh fum".split())
results = searcher.search(phrase_query)
results.fragmenter = highlight.SentenceFragmenter()
results.scorer = CustomScorer(phrase_query)

Douglas Duhaime

unread,
Aug 2, 2014, 12:01:19 PM8/2/14
to who...@googlegroups.com
Thanks so much for this custom scorer, Matt, and for Whoosh. I'm really liking it so far, and will love it once I can highlight phrases properly. Unfortunately, I can't seem to get the custom scorer you provide above to work. Here's my toy example: I read in Project Gutenberg's edition of Gulliver's Travels and search for the exact phrase "government of reason" in the index. The phrase appears once in the text, but the search yields no results. Do you know how I can print the hit properly? I would be grateful for any advice you can offer on this question.

from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
from whoosh.query import Phrase
from whoosh.query import And, Or, Term
from whoosh import highlight
from whoosh import query
from collections import defaultdict

import os, codecs, nltk

#for printing utf-8 to console
def remove_non_ascii(s):
    return "".join(x for x in s if ord(x) < 128)

#create index dir
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

#create index
schema = Schema(content=TEXT(stored=True, phrase=True, analyzer=analysis.StandardAnalyzer(stoplist=None)))
ix = create_in("indexdir", schema)
writer = ix.writer()
gulliver = codecs.open("gulliver.txt","r","utf-8")
gulliver = gulliver.read().replace("_","")
writer.add_document(content=gulliver)
writer.commit()

#create custom scorer

class CustomScorer(highlight.FragmentScorer):
    def __init__(self, phrase):
        # Get the list of words from the phrase query
        self.words = phrase.words

    def __call__(self, f):
        # Create a dictionary mapping words to the positions the word
        # occurs at, e.g. "foo" -> [1, 5, 10]
        d = defaultdict(list)
        for token in f.matches:
            d[token.text].append(token.pos)

        # For each position the first word appears at, check to see if the
        # rest of the words appear in order at the subsequent positions
        firstword = self.words[0]
        for pos in d[firstword]:
            found = False
            for word in self.words[1:]:
                pos += 1
                if pos not in d[word]:
                    break
            else:
                found = True

            if found:
                return 100
        return 0

searcher = ix.searcher()
phrase_query = query.Phrase("text", "government of reason".split())
results = searcher.search(phrase_query)
results.fragmenter.charlimit = None

results.fragmenter = highlight.SentenceFragmenter()
results.scorer = CustomScorer(phrase_query)
for hit in results:
    print hit

Matt Chaput

unread,
Feb 17, 2015, 6:10:34 PM2/17/15
to who...@googlegroups.com

> On Aug 2, 2014, at 12:01 PM, Douglas Duhaime <douglas...@gmail.com> wrote:
>
> Thanks so much for this custom scorer, Matt, and for Whoosh. I'm really liking it so far, and will love it once I can highlight phrases properly. Unfortunately, I can't seem to get the custom scorer you provide above to work. Here's my toy example: I read in Project Gutenberg's edition of Gulliver's Travels and search for the exact phrase "government of reason" in the index. The phrase appears once in the text, but the search yields no results. Do you know how I can print the hit properly? I would be grateful for any advice you can offer on this question.

Hi, sorry for the incredibly late reply. If the answer still matters, the problem is a simple bug in your example code: you indexed the string in the "content" field, but searched in the "text" field.

Cheers,

Matt

Message has been deleted
Message has been deleted

Douglas Duhaime

unread,
Mar 15, 2015, 10:33:27 AM3/15/15
to who...@googlegroups.com
Phenomenal! I needed to change the fragmenter line to `results.fragmenter.charlimit = None` because I'm indexing big files, but this is perfect otherwise! I wrote up a similar function to extract proximity hits, but the expression is nowhere near as tight as this class-based method. Thank you again for all of your work with Whoosh--it's a phenomenal resource!
Reply all
Reply to author
Forward
0 new messages