如何支持自定义分隔符的词语切分

80 views
Skip to first unread message

bamboo.hey

unread,
Sep 3, 2009, 11:01:06 AM9/3/09
to sphinx-for-chinese
比如我的文档时预先用分词程序切分过,以空格作为词语的分割.所以我在用sphinx时,涉及到分词时,用空格分词即可,不需要使用1-gram(即一
元分词)!
请问这样的需求应该可以实现的吧.

Zhuguo Shi

unread,
Sep 3, 2009, 9:15:41 PM9/3/09
to sphinx-fo...@googlegroups.com
Hi bamboo.hey,

I am sorry that I can not type Chinese for the moment. As for your question, you can define how the documents are indexed in the configuration file, either using Chinese segmentation or 1-gram. Just like the following:

charset_type = utf-8
chinese_dictionary = /path/to/xdict

These two lines tell sphinx-for-chinese to use Chinese segmentation.

ngram_len = 1
ngram_chars = U+3000..2FA1F

These two lines tell sphinx-for-chinese to use 1-gram method.

But please notice that due to the internal settings of sphinx-for-chinese, Chinese segmentation method is of high priority if all the above definitions are presented in one configuration file.

So far sphinx-for-chinese dose not support customizable delimiters when doing Chinese segmentation, but space is definitely a delimiter as default. So I don't see any problems if you use sphinx-for-chinese with your requirements.

Feel free to ask if you still have questions about that.

Thanks




2009/9/3 bamboo.hey <hew...@126.com>
比如我的文档时预先用分词程序切分过,以空格作为词语的分割.所以我在用sphinx时,涉及到分词时,用空格分词即可,不需要使用1-gram(即一
元分词)!
请问这样的需求应该可以实现的吧.


bamboo.hey

unread,
Sep 3, 2009, 10:43:58 PM9/3/09
to sphinx-for-chinese
非常感谢你的答复!
比如我有这样一个句子(中国陕西安全部),我预先用分词程序已分词(以空格分隔),效果如下(存放在表的一个字段里)
中国 陕西 安全部

当我用


ngram_len = 1
ngram_chars = U+3000..2FA1F

这个选项时,并且采用 SPH_MATCH_ALL 匹配
查询 "中"、"西安"、"国陕"等词都会有返回结果,而我希望只有查询 "中国"、"陕西"、"安全部"时才返回结果,即希望是按照空格分词

(我使用的是sphinx官方未修改的0.9.9-rc2版本)


On 9月4日, 上午9时15分, Zhuguo Shi <bluefl...@gmail.com> wrote:
> Hi bamboo.hey,
>
> I am sorry that I can not type Chinese for the moment. As for your question,
> you can define how the documents are indexed in the configuration file,
> either using Chinese segmentation or 1-gram. Just like the following:
>
> charset_type = utf-8
>
> > chinese_dictionary = /path/to/xdict
>
> These two lines tell sphinx-for-chinese to use Chinese segmentation.
>
> ngram_len = 1
>
> > ngram_chars = U+3000..2FA1F
>
> These two lines tell sphinx-for-chinese to use 1-gram method.
>
> But please notice that due to the internal settings of sphinx-for-chinese,
> Chinese segmentation method is of high priority if all the above definitions
> are presented in one configuration file.
>
> So far sphinx-for-chinese dose not support customizable delimiters when
> doing Chinese segmentation, but space is definitely a delimiter as default.
> So I don't see any problems if you use sphinx-for-chinese with your
> requirements.
>
> Feel free to ask if you still have questions about that.
>
> Thanks
>

> 2009/9/3 bamboo.hey <hewy...@126.com>
>
>
>
> > 比如我的文档时预先用分词程序切分过,以空格作为词语的分割.所以我在用sphinx时,涉及到分词时,用空格分词即可,不需要使用1-gram(即一
> > 元分词)!
> > 请问这样的需求应该可以实现的吧.- 隐藏被引用文字 -
>
> - 显示引用的文字 -

Zhuguo Shi

unread,
Sep 3, 2009, 11:15:44 PM9/3/09
to sphinx-fo...@googlegroups.com
你好,

如果你想只有查询 "中国"、"陕西"、"安全部"时才返回结果,那就不要用一元切分,就是不要在配置文件里使用


ngram_len = 1
ngram_chars = U+3000..2FA1F

因为使用一元切分,相当于对每个词都做了索引,搜索效果会受到影响。这也是为什么在sphinx里加入中文分词的缘故。

你只要把ngram的设置在配置文件里去掉,然后按照sphinx-for-chinese里的说明,设置好中文分词的词典就可以了。

你可以试一下,有问题可以再联系我,谢谢

2009/9/4 bamboo.hey <hew...@126.com>

Zhuguo Shi

unread,
Sep 3, 2009, 11:18:24 PM9/3/09
to sphinx-fo...@googlegroups.com

补充一点,使用未修改版本的sphinx似乎很难达到你的要求,除非你专门修改source,不用分词,也不用ngram,只是用空格来区分。如果你有一点编程经验,修改起来不会很麻烦的

2009/9/4 bamboo.hey <hew...@126.com>
Reply all
Reply to author
Forward
0 new messages