Empty search results with chinese characters

139 views
Skip to first unread message

Édouard Brière

unread,
Nov 17, 2009, 10:08:28 AM11/17/09
to Thinking Sphinx
Hi,

Is it possible to query Sphinx with any kind of characters?

I currently have the following indexed:

English: Pick a category
Russian: Выберите категорию
Chinese: 选择分类

ThinkingSphinx.search("Выберите категорию") find and return the
russian entry.

But ThinkingSphinx.search("选择分类") doesn’t find the chinese entry.

Is this a Sphinx/Thinking Sphinx problem or am I doing something
wrong?

Thanks,
Édouard

Édouard Brière

unread,
Nov 18, 2009, 1:20:33 PM11/18/09
to Thinking Sphinx
I figured it out!

My charset_table was not configured in sphinx.yml. It looks like so:

development:
charset_table: "U+00C0->a, U+00C1->a, U+00C2->a, ... U+01DE->a, U
+01DF->a, \\\n \
U+01E0->a, U+01E1->a, U+01FA->a, U+01FB->a, U+0200->a ... \\\n \
..."
production:
charset_table: "..."

There is a full list of charsets available there: http://pastie.org/204316.txt
I tested it for my app and it works fine for chinese, russian and
latin characters. I haven't tested other characters.

Then rake ts:config, rake ts:index and should be it!

Édouard

Pat Allan

unread,
Nov 18, 2009, 7:07:54 PM11/18/09
to thinkin...@googlegroups.com
Ah, great to know you figured it out - I've not had to deal with
Chinese characters before.

--
Pat
> --
>
> You received this message because you are subscribed to the Google
> Groups "Thinking Sphinx" group.
> To post to this group, send email to thinkin...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=
> .
>
>

Édouard Brière

unread,
Jan 15, 2010, 6:07:13 AM1/15/10
to Thinking Sphinx
Sorry to bump this old topic, but I still had one small problem with
search in chinese, and I just found a solution. Thought I would share.

The problem is that if the string “选择分类” was indexed, searching for
“选” didn’t yield any results.

I fixed this by adding this in sphinx.conf:

ngram_len: 1
ngram_chars: "U+00C6->U+00E6, U+01E2->U+00E6, U+01E3->U+00E6 ...

`ngram_chars` is basically the same than `charset_table`, without the
latin characters. we basically only want NCK characters in there.

Hope this will help,
Édouard

On Nov 19 2009, 1:07 am, Pat Allan <p...@freelancing-gods.com> wrote:
> Ah, great to know you figured it out - I've not had to deal with  Chinesecharacters before.


>
> --
> Pat
>
> On 19/11/2009, at 5:20 AM, Édouard Brière wrote:
>
>
>
> > I figured it out!
>
> > My charset_table was not configured in sphinx.yml. It looks like so:
>
> > development:
> >  charset_table: "U+00C0->a, U+00C1->a, U+00C2->a, ... U+01DE->a, U
> > +01DF->a,  \\\n \
> >  U+01E0->a, U+01E1->a, U+01FA->a, U+01FB->a, U+0200->a ... \\\n \
> >  ..."
> > production:
> >  charset_table: "..."
>
> > There is a full list of charsets available there:http://pastie.org/204316.txt

> > I tested it for my app and it works fine forchinese, russian and

Sting Tao

unread,
Jan 15, 2010, 8:39:40 AM1/15/10
to thinkin...@googlegroups.com
Hi....
You will encounter more problems when you do Chinese search... especially in "phrases stemming"(not very sure about translation, it's 斷詞斷字)
When people search "选择" and it's a phrase, your search might generate "个口不言的人送过去吧" as matched result, which makes no sense.
and those couldn't be solved by the current solution you are adopting.


Sting Tao

2010/1/15 Édouard Brière <edouard...@gmail.com>
To unsubscribe from this group, send email to thinking-sphi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en.




Reply all
Reply to author
Forward
0 new messages