Occurence position

328 views
Skip to first unread message

Aleksandr Plavin

unread,
Mar 2, 2012, 2:15:09 PM3/2/12
to who...@googlegroups.com
How to get position of search term occurrence in a document? Highlighting makes something similar, I think, but not the same.

Matt Chaput

unread,
Mar 2, 2012, 6:00:08 PM3/2/12
to who...@googlegroups.com
On 02/03/2012 2:15 PM, Aleksandr Plavin wrote:
> How to get position of search term occurrence in a document?
> Highlighting makes something similar, I think, but not the same.

The information is there, but it's not really exposed right now.
If you want to know because you want to score based on position, see
this recipe:

http://packages.python.org/Whoosh/recipes.html#score-results-based-on-the-position-of-the-matched-term

Currently the match positions are only available in the low-level
Matcher interface. A Matcher has a spans() method to return a list of
whoosh.spans.Span objects representing the positions in the document
where the query matched. The span has "start" and "end" properties
containing the start and end word numbers (starting from 0) of the match
position. If you set up the search field with chars=True, you can also
use Span.startchar and Span.endchar.


with myindex.searcher() as s:
matcher = myquery.matcher(s)

# See
# packages.python.org/Whoosh/api/matching.html#whoosh.matching.Matcher

# For each document matching the query...
while matcher.is_active():
print "Docnum:", matcher.id()
print "Score:", matcher.score()

# spans() is only meaningful for fields with position info
# (i.e. TEXT or a custom field type)
print "List of occurances:"
for span in matcher.spans():
print " Start word #", span.start, "End word #", span.end
# This prints "None" unless you used chars=True in the field
print " Start char #", span.startchar, "End char #", span.endchar

# Move to the next match
matcher.next()


The biggest problem with this functionality right now is that the Span
object can't tell you *which part of the query* matched at that
position. If you have a query across several fields (e.g. "title:aaa AND
body:bbb"), you can will get a single list of Span objects using
different numberings from multiple fields and not know how to separate them.

Matt

Aleksandr Plavin

unread,
Mar 5, 2012, 5:35:40 AM3/5/12
to who...@googlegroups.com
I have a large text file with many lines (about 10**6 lines). A search query should result in one or many lines in this file (only whole lines are referenced). At first I've tried to build index by adding lines as separate documents, but it became very large after some small part of this file. Now I have this file indexed as a single document and want to search through it using whoosh. So I need positions to determine which lines contain query result and I'll try the method you suggested. Or is there a better approach?

суббота, 3 марта 2012 г. 3:00:08 UTC+4 пользователь Matt Chaput написал:

Matt Chaput

unread,
Mar 5, 2012, 4:57:09 PM3/5/12
to who...@googlegroups.com
On 05/03/2012 5:35 AM, Aleksandr Plavin wrote:
> I have a large text file with many lines (about 10**6 lines). A search
> query should result in one or many lines in this file (only whole lines
> are referenced). At first I've tried to build index by adding lines as
> separate documents, but it became very large after some small part of
> this file. Now I have this file indexed as a single document and want to
> search through it using whoosh. So I need positions to determine which
> lines contain query result and I'll try the method you suggested. Or is
> there a better approach?

I think indexing each line as a separate document was the right
approach. I'd store the byte offsets of the beginning and end of the
line as stored fields on each line-document. I don't know why the index
would get very large... I'll try recreating this setup myself.

Matt

Aleksandr Plavin

unread,
Mar 6, 2012, 11:42:46 AM3/6/12
to who...@googlegroups.com
Index size for 10 ** 5 lines is almost 30 Mbytes. Whole file is tens times larger - and index size several hundred MBytes is too large for my needs.
P.S.: file itself is about 50...100 MBytes.

вторник, 6 марта 2012 г. 1:57:09 UTC+4 пользователь Matt Chaput написал:
Reply all
Reply to author
Forward
0 new messages