charset_type = utf-8
chinese_dictionary = /path/to/xdict
ngram_len = 1
ngram_chars = U+3000..2FA1F
比如我的文档时预先用分词程序切分过,以空格作为词语的分割.所以我在用sphinx时,涉及到分词时,用空格分词即可,不需要使用1-gram(即一
元分词)!
请问这样的需求应该可以实现的吧.
当我用
ngram_len = 1
ngram_chars = U+3000..2FA1F
这个选项时,并且采用 SPH_MATCH_ALL 匹配
查询 "中"、"西安"、"国陕"等词都会有返回结果,而我希望只有查询 "中国"、"陕西"、"安全部"时才返回结果,即希望是按照空格分词
(我使用的是sphinx官方未修改的0.9.9-rc2版本)
On 9月4日, 上午9时15分, Zhuguo Shi <bluefl...@gmail.com> wrote:
> Hi bamboo.hey,
>
> I am sorry that I can not type Chinese for the moment. As for your question,
> you can define how the documents are indexed in the configuration file,
> either using Chinese segmentation or 1-gram. Just like the following:
>
> charset_type = utf-8
>
> > chinese_dictionary = /path/to/xdict
>
> These two lines tell sphinx-for-chinese to use Chinese segmentation.
>
> ngram_len = 1
>
> > ngram_chars = U+3000..2FA1F
>
> These two lines tell sphinx-for-chinese to use 1-gram method.
>
> But please notice that due to the internal settings of sphinx-for-chinese,
> Chinese segmentation method is of high priority if all the above definitions
> are presented in one configuration file.
>
> So far sphinx-for-chinese dose not support customizable delimiters when
> doing Chinese segmentation, but space is definitely a delimiter as default.
> So I don't see any problems if you use sphinx-for-chinese with your
> requirements.
>
> Feel free to ask if you still have questions about that.
>
> Thanks
>
> 2009/9/3 bamboo.hey <hewy...@126.com>
>
>
>
> > 比如我的文档时预先用分词程序切分过,以空格作为词语的分割.所以我在用sphinx时,涉及到分词时,用空格分词即可,不需要使用1-gram(即一
> > 元分词)!
> > 请问这样的需求应该可以实现的吧.- 隐藏被引用文字 -
>
> - 显示引用的文字 -
ngram_len = 1
ngram_chars = U+3000..2FA1F