Search results and matched term highlighting

4 views
Skip to first unread message

Walter Lee Davis

unread,
Jul 8, 2019, 6:46:16 PM7/8/19
to thinkin...@googlegroups.com
I wonder if anyone knows how Sphinx goes about constructing the snippets that are returned along with the matches to a search term. This page illustrates a wild variety of examples of how one search term can be interpreted:

https://oll.libertyfund.org/search/results?q=power+corrupts

Note the first hit, from Alvis on Shakespeare. The exact phrase exists in the third line of the snippet (on a desktop screen, YMMV). It is not highlighted. In the third example, the result is from deep in the weeds of the footnotes, and hits on the word power, and actually highlights it. The fifth hit gets both power (twice) and corrupts, but misses the stem of corrupts in corrupt. The second-to-the-last hit on that page, in Liberty, Order, and Justice, goes on for several screens (208,135 words), with a single snippet that has grown to encompass 725 individual keyword hits in one "paragraph" of source text.

I'm using Thinking Sphinx 3.1.2, and Sphinx is version 2.2.9

Here's the controller method that constructed this page:

@results = ThinkingSphinx.search "\"#{ThinkingSphinx::Query.escape(params[:q].to_s)}\"",
:page => params[:page],
:star => true,
:excerpts => {
:limit => 1000,
:around => 40,
:force_all_words => true,
:chunk_separator => '</li><li>'
}.reject{ |r| r.class.to_s == 'NilClass' } rescue Kaminari::paginate_array []
@results.context[:panes] << ThinkingSphinx::Panes::ExcerptsPane
@hits = @results.total_entries rescue 0

And these results are from mostly titles, but some pages. Here's the definition for both:

# titles_index.rb
ThinkingSphinx::Index.define :title, :with => :active_record do
set_property :group_concat_max_len => 10.megabytes

indexes :title, :sortable => true
indexes teaser
indexes content.plain, :as => :plain_text
indexes author_name, :sortable => true
has roles(:person_id), :as => :people_ids
has :id, :as => :title_id
has author_id, created_at, updated_at
has set, :as => :title_set
where sanitize_sql(["publish", true])
end

#pages_index.rb
ThinkingSphinx::Index.define :page, :with => :active_record do

indexes :title, :sortable => true
indexes teaser
indexes body
has created_at, updated_at
end

In the view, I'm using this tortured bit of ERB:

<%= content_tag( :ol, "<li>#{result.excerpts.plain_contents}</li>".gsub(/<li>\s*<\/li>/,'').html_safe ) if result.respond_to?(:plain_contents) %>

And there's no way to explain why some results are wrapped in the <span class="match"> in the output from Sphinx, while others (nearby, in the same set of results) are not.

Thanks in advance if anyone can enlighten me or point me toward documentation of this feature. This is all very old code, maybe 6 or 8 years since I last touched it. I've moved it to a newer server since I wrote all this, but nothing much changed when I did that. My client would like to know, and I don't have any good answers.

Walter

Pat Allan

unread,
Jul 21, 2019, 2:36:28 AM7/21/19
to thinkin...@googlegroups.com
Hi Walter,

Sorry for the slow response… and to be honest, I don’t have a good answer for this behaviour. I’m really not sure what’s going on.

I did look over the available settings for excerpts:
… and anything that I feel would influence what you’re seeing (e.g. exact_phrase) defaults to what would be ideal in your site anyway.

I’m not sure if upgrading Sphinx would have any impact, but it may be worthwhile - at least to 2.2.11. That said, there’s nothing in the release notes for 2.2.10/11 that I can spot that suggests any change in behaviour.

If you really wanted to dig into it, I’d suggest building a test app that can reproduce the problem with a smaller dataset, and potentially share that here so I can have a look as well. Of course, it very much sounds like a Sphinx issue rather than anything to do with Thinking Sphinx, so whether I can actually fix things is not super likely.

Wish I could be more helpful!

— 
Pat

--
You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
To post to this group, send email to thinkin...@googlegroups.com.
Visit this group at https://groups.google.com/group/thinking-sphinx.
To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/AAECECFD-619C-49AC-B4E7-63A6C87C2595%40wdstudio.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages