Searching multiple languages

37 views
Skip to first unread message

Satya Gautham

unread,
Jun 14, 2013, 1:25:09 PM6/14/13
to thinkin...@googlegroups.com
Hi,

Is there a way to support search on multiple languages?
For ex:  Comments entered in various languages. Is it possible to make those comments searchable?

If yes, what should the config be? 
How do we configure stopwords, stemmers, char-sets etc... ?

Thanks a lot.

Regards,
Gautham

Pat Allan

unread,
Jul 6, 2013, 8:44:23 AM7/6/13
to thinkin...@googlegroups.com
Hi Gautham

Sorry for the slow response.

It depends on how you have your database set up - do you have a column that indicates the language, or is it up to the readers of your site to figure that out?

If there's a language column, you could have specific indices for each language, and have different stemmers, charsets and stopwords. I can provide more detail for this if you let me know which version of Thinking Sphinx you're using.

But if there's no way to determine the language in each row, then you won't be able to customise any of those settings at that level - but certainly you could have global settings that cover common options across languages (providing they don't conflict).

Cheers,

--
Pat

> --
> You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
> To post to this group, send email to thinkin...@googlegroups.com.
> Visit this group at http://groups.google.com/group/thinking-sphinx.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



Satya Gautham

unread,
Jul 7, 2013, 11:35:57 AM7/7/13
to thinkin...@googlegroups.com
Thank you for your reply.

We don't have any Language column. 
Users can post their comments in any language.

How do we configure the global settings?

Regards,
Gautham

Pat Allan

unread,
Jul 10, 2013, 3:41:23 AM7/10/13
to thinkin...@googlegroups.com
Hi Gautham

Given all settings are applied for every record, not just those of certain languages, you may need to be careful with what each covers, but I'd imagine the following would work pretty well:

* whatever stopwords you want to be ignored.
* a charset_table value that covers a broad number of unicode characters. A good starting point is here: http://yob.id.au/2008/05/08/thinking-sphinx-and-unicode.html
* if you're getting a bunch of Chinese/Japanese/Korean/etc comments, then perhaps you want to read up on the ngram settings that Sphinx offers.
* As for stemmers, you can apply more than one. Not sure if that would make search results better or worse though - best to experiment :)

Good luck!

--
Pat

Satya Gautham

unread,
Jul 10, 2013, 2:20:43 PM7/10/13
to thinkin...@googlegroups.com
Thanks a lot pat.

Will try it out


You received this message because you are subscribed to a topic in the Google Groups "Thinking Sphinx" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/thinking-sphinx/RDaGl33kSD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to thinking-sphi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages