Indexing tag names

10 views
Skip to first unread message

Walter Lee Davis

unread,
Feb 22, 2021, 10:41:41 PM2/22/21
to thinkin...@googlegroups.com
I'm using GutenTag to apply tags to individual pages in a CMS. The Document model uses TS5 with Real-Time Indexing. I've set up my index thusly:

# in the model
def tags_for_indexing
tag_names.join ' '
end

# in the index
ThinkingSphinx::Index.define :document, :with => :real_time do
scope { Document.where(id: Document.publicly.map{ |d| [d.id].concat(d.descendants.published.map(&:id)) }.flatten) }

indexes title
indexes teaser
indexes body_html
indexes author_display
indexes tags_for_indexing

has created_at, type: :timestamp
has updated_at, type: :timestamp
end

I've tested the method, and confirm that it outputs a space-delimited string of words for the tags.

I run rake ts:rt:rebuild and everything seems to go fine. But trying to search on some of these tag names is not returning the results I am imagining. The client has insisted on making some of these tags start with an octothorp, because she is writing about "hashtags" on Twitter. Most tags do not have punctuation in them. I am able to find other terms, even very obscure ones, when I don't use punctuation in the tag names.

Does this sound like something that I can fix, or should I advise the client to lay off the octothorps?

Walter

Pat Allan

unread,
Feb 22, 2021, 10:51:34 PM2/22/21
to 'jer...@shopittome.com' via Thinking Sphinx
Hi Walter,

I’m pretty sure Sphinx doesn’t index punctuation by default. If you want octothorps included, you’ll need to define a custom charset_table value (per environment in `config/thinking_sphinx.yml`) which includes that character. The Sphinx docs outline the default, so best to take that and then add in the octothorp (U+23).

Keep in mind that this will impact all uses of that character in all fields - there’s no way to have it apply to just some fields (or, in this case, words that only start with that character).

Once you’ve added this configuration, a full rebuild will be required.

Cheers,

— 
Pat

--
You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/EA71574B-9EBF-484E-A5FA-BF7CD53A10BC%40wdstudio.com.

Walter Lee Davis

unread,
Feb 22, 2021, 11:10:53 PM2/22/21
to thinkin...@googlegroups.com
Thanks for the speedy reply. I tried adding the charset table as recommended, but I am not seeing any difference in my search results. I did differ from the directions slightly, in that I put the character set in the default block at the top of my Yaml file, since it's then included in all of the environments. I figured that should work, but in case it doesn't can you explain why?

default: &default
morphology: stem_en
html_strip: true
batch_size: 300
charset_table: "0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+23"

development:
<<: *default

test:
<<: *default

production:
<<: *default

staging:
<<: *default
mysql41: 9320


I forced a full rebuild/reindex with rake ts:rt:rebuild. When that didn't seem to change things, I also ran rake ts:rebuild. My understanding is that the first of these should be done when you use the Real Time index. If I'm mistaken, please let me know.

Thanks again!

Walter
> To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/05B716CE-D5C7-40F6-BDE3-EC2859738632%40freelancing-gods.com.

Pat Allan

unread,
Feb 23, 2021, 12:02:37 AM2/23/21
to thinkin...@googlegroups.com
Having the setting in the default block should be fine - you should be able to see the charset_table setting in the generated Sphinx configuration files.

Also: I generally recommend just using ts:rebuild, as that handles both real-time indices and SQL-backed indices (i.e. it’s running the same things as ts:rt:rebuild) - if you’re finding ts:rebuild is not working well for you, I’m keen to hear why!

All that said, doesn’t sound like you’re doing anything wrong. I wonder if html_strip is somehow filtering out the octothorps? Though I’m pretty sure it’s looking just for HTML tags… still, may be worth turning that off to double-check.

And I’ve just run some quick tests locally - without the custom charset_table value, I find the string “#test” is found by Sphinx when searching by “#test” or “test” (because # is ignored, given it’s not an indexable character - so the two searches are actually identical). Adding in the charset_table setting, rebuilding - searching for #test returns a result, but test doesn’t (as that now doesn’t exist as a standalone word in what’s indexed).

I doubt it matters, but: which version of Sphinx are you using?


Pat
> To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/0822E7D4-08AD-48D6-8105-3CC26F937006%40wdstudio.com.

Walter Lee Davis

unread,
Feb 23, 2021, 9:48:31 AM2/23/21
to thinkin...@googlegroups.com


> On Feb 23, 2021, at 12:02 AM, Pat Allan <p...@freelancing-gods.com> wrote:
>
> Having the setting in the default block should be fine - you should be able to see the charset_table setting in the generated Sphinx configuration files.
>
> Also: I generally recommend just using ts:rebuild, as that handles both real-time indices and SQL-backed indices (i.e. it’s running the same things as ts:rt:rebuild) - if you’re finding ts:rebuild is not working well for you, I’m keen to hear why!

While I was fighting with this, and fiddling with the configuration to use has instead of indexes, I got myself into a state where ts:rebuild would blow up with a SQL error (I think it was a Sphinx SQL error) and ts:rt:rebuild would work fine. But with the current configuration that I shared with you, both work.

>
> All that said, doesn’t sound like you’re doing anything wrong. I wonder if html_strip is somehow filtering out the octothorps? Though I’m pretty sure it’s looking just for HTML tags… still, may be worth turning that off to double-check.
>
> And I’ve just run some quick tests locally - without the custom charset_table value, I find the string “#test” is found by Sphinx when searching by “#test” or “test” (because # is ignored, given it’s not an indexable character - so the two searches are actually identical). Adding in the charset_table setting, rebuilding - searching for #test returns a result, but test doesn’t (as that now doesn’t exist as a standalone word in what’s indexed).
>
> I doubt it matters, but: which version of Sphinx are you using?

Sphinx 2.2.11-id64-release (95ae9a6), TS 5.0.0.

It's definitely odd. I'm not sure if re-indexing is picking up the tag names when it runs en masse, and it seems to be something with GutenTag. If I find a document in console, the object that I get back has tag_names set to nil, but if I then call tag_names on that object, I get back the array of strings I am expecting. It's just the value that I see inside the <> brackets initially when to_s is called on the found object by irb, so I don't know if that's significant at all, or is getting in the way of Sphinx extracting the values. Again, when I test in console by calling my tags_for_indexing method on a found object, I get back the expected string value.

I've told the client that she may need to get rid of her beloved hashtags in the tagging interface, or use Gutentag in place of Sphinx to get "everything tagged with this tag". I'm not convinced that's a bad idea, either.

Walter
> To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/09329FD3-9473-4361-B9DE-C4A1847C882D%40freelancing-gods.com.

Pat Allan

unread,
Feb 26, 2021, 7:45:22 PM2/26/21
to thinkin...@googlegroups.com
If you’ve got tag names and their corresponding ids, I think it’d be better (and more accurate) to query Sphinx by the ids:

  # in the index:
  has tag_ids

  # when searching, maybe something like:
  tag = Tag.find_by(name: params[:tag_name])
  Document.search params[:query], :with => {:tag_ids => tag.id}

It doesn’t answer the question why octothorps aren’t being indexed/searched correctly, but this should mean better search results generally.

Cheers,

— 
Pat

Reply all
Reply to author
Forward
0 new messages