ProtoplasmTokenFilter ?

Franz Allan Valencia See

unread,

Jan 19, 2010, 8:49:43 AM1/19/10

to cmecab-j...@googlegroups.com

Good day,

In my Lucene usage, I need to be able to match 引っ越し with 引越し, with 引っ越, and with 引越.

The idea that I have thought of to do this with CMeCab-Java is to use the mecab features, particularly 原形 or the Protoplasm (is this the right translation?).

If I'm not mistaken, 原形-- is the 'base word' and its derivatives are those in the 1st column - 表層形 (roughly translated to Shaped surface?). If so, then I'm thinking of creating a ProtoplasmTokenFilter which transforms the tokens taken from a MeCabAnalyzer, and converts the token values into its Protoplasm. In this, words such as 引き込む, 引き込ま, 引き込も, etc will all be transformed its 原形 which is 引き込む. This in turn would allow searches for 引き込む to match 引き込ま, き込も, and its other 表層形.

And in terms of my specific scenario, I would edit the entries for 引っ越し, 引越し, and 引越, and add an entry for 引っ越 so that their 原形 are all 引越

For example:
I'll edit

引っ越し,1283,1283,4483,名詞,サ変接続,*,*,*,*,引っ越し,ヒッコシ,ヒッコシ
引越し,1283,1283,4454,名詞,サ変接続,*,*,*,*,引越し,ヒッコシ,ヒッコシ
引越,1285,1285,5624,名詞,一般,*,*,*,*,引越,ヒッコシ,ヒッコシ

To become

引っ越し,1283,1283,4483,名詞,サ変接続,*,*,*,*,引越,ヒッコシ,ヒッコシ
引越し,1283,1283,4454,名詞,サ変接続,*,*,*,*,引越,ヒッコシ,ヒッコシ
引越,1285,1285,5624,名詞,一般,*,*,*,*,引越,ヒッコシ,ヒッコシ

And then I'll add another entry

引っ越,1283,1283,4454,名詞,サ変接続,*,*,*,*,引越,ヒッコシ,ヒッコシ

With this edit in the dictionary, plus the ProtoplasmTokenFilter, I would be able to match 引っ越し with 引越し, with 引っ越, and with 引越.

However, since I am not that familiar with Japanese linguistics, I am not sure if this makes sense in terms of the Japanese language :-)

Does this makes sense? ...or is there a better way to do this?

Thanks,
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Franz Allan Valencia See

unread,

Jan 23, 2010, 10:07:40 AM1/23/10

to cmecab-j...@googlegroups.com

Good day,

I've just created ProtoplasmTokenFilter.Which reads the token's type and extracts the protoplasm and use that as the token value.

Attached is the patched. Kindly review.

Thanks,

--

Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

2010/1/19 Franz Allan Valencia See <fran...@gmail.com>

ProtoplasmTokenFilter.patch

Franz Allan Valencia See

unread,

Jan 27, 2010, 10:21:26 AM1/27/10

to cmecab-j...@googlegroups.com

I've attached a newer version of the patch. This new one prevents another JNI call (#feature) and instead uses the existing token type.

Cheers,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

2010/1/23 Franz Allan Valencia See <fran...@gmail.com>

ProtoplasmTokenFilter.patch

Reply all

Reply to author

Forward