Good day,
In my Lucene usage, I need to be able to match 引っ越し with 引越し, with 引っ越, and with 引越.
The idea that I have thought of to do this with CMeCab-Java is to use the mecab features, particularly 原形 or the Protoplasm (is this the right translation?).
If I'm not mistaken, 原形-- is the 'base word' and its derivatives are those in the 1st column - 表層形 (roughly translated to Shaped surface?). If so, then I'm thinking of creating a ProtoplasmTokenFilter which transforms the tokens taken from a MeCabAnalyzer, and converts the token values into its Protoplasm. In this, words such as 引き込む, 引き込ま, 引き込も, etc will all be transformed its 原形 which is 引き込む. This in turn would allow searches for 引き込む to match 引き込ま, き込も, and its other 表層形.
And in terms of my specific scenario, I would edit the entries for 引っ越し, 引越し, and 引越, and add an entry for 引っ越 so that their 原形 are all 引越
For example:
I'll edit
引っ越し,1283,1283,4483,名詞,サ変接続,*,*,*,*,引っ越し,ヒッコシ,ヒッコシ
引越し,1283,1283,4454,名詞,サ変接続,*,*,*,*,引越し,ヒッコシ,ヒッコシ
引越,1285,1285,5624,名詞,一般,*,*,*,*,引越,ヒッコシ,ヒッコシ
To become
引っ越し,1283,1283,4483,名詞,サ変接続,*,*,*,*,引越,ヒッコシ,ヒッコシ
引越し,1283,1283,4454,名詞,サ変接続,*,*,*,*,引越,ヒッコシ,ヒッコシ
引越,1285,1285,5624,名詞,一般,*,*,*,*,引越,ヒッコシ,ヒッコシ
And then I'll add another entry
引っ越,1283,1283,4454,名詞,サ変接続,*,*,*,*,引越,ヒッコシ,ヒッコシ
With this edit in the dictionary, plus the ProtoplasmTokenFilter, I would be able to match 引っ越し with 引越し, with 引っ越, and with 引越.
However, since I am not that familiar with Japanese linguistics, I am not sure if this makes sense in terms of the Japanese language :-)
Does this makes sense? ...or is there a better way to do this?
Thanks,
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.comLinkedIn:
http://www.linkedin.com/in/franzseeTwitter:
http://www.twitter.com/franz_see