StandardMeCabAnalyzerTest failing in my machine.

10 views
Skip to first unread message

Franz Allan Valencia See

unread,
Jan 6, 2010, 6:52:09 AM1/6/10
to cmecab-j...@googlegroups.com
Good day,

I've just installed MeCab, protobuf, and CMeCab in my xubuntu 9.10 64-bit machine and ran StandardMeCabAnalyzerTest (with 'net.moraleboost.mecab.encoding' jvm property set to Shift_JIS and with java.library.path pointing to my installation directory of MeCab, protobuf, and CMeCab) using sun java 5 64-bit.

However, I am getting a test failure for both testAnalyze and testSearch.

testAnalyzer
java.lang.AssertionError: Index term not found.
    at org.junit.Assert.fail(Assert.java:91)
    at net.moraleboost.lucene.analysis.ja.StandardMeCabAnalyzerTest.testAnalyze(StandardMeCabAnalyzerTest.java:88)

testSearch
java.lang.AssertionError: No hit.
    at org.junit.Assert.fail(Assert.java:91)
    at net.moraleboost.lucene.analysis.ja.StandardMeCabAnalyzerTest.testSearch(StandardMeCabAnalyzerTest.java:118)

Is there anything else I need to configure to make my StandardMeCabAnalyzerTest pass?

My Investigation so far:

I also tried running StandardMeCabAnalyzerTest's japanese inputs against MeCab directly.

Given
本日は晴天なり。
晴天
本日も晴天

I got
本日    名詞,副詞可能,*,*,*,*,本日,ホンジツ,ホンジツ
は    助詞,係助詞,*,*,*,*,は,ハ,ワ
晴天    名詞,一般,*,*,*,*,晴天,セイテン,セイテン
なり    助動詞,*,*,*,文語・ナリ,基本形,なり,ナリ,ナリ
。    記号,句点,*,*,*,*,。,。,。
EOS
晴天    名詞,一般,*,*,*,*,晴天,セイテン,セイテン
EOS
本日    名詞,副詞可能,*,*,*,*,本日,ホンジツ,ホンジツ
も    助詞,係助詞,*,*,*,*,も,モ,モ
晴天    名詞,一般,*,*,*,*,晴天,セイテン,セイテン
EOS

Which means my MeCab installation was able to properly parsed the input Japanese strings properly (but somehow StandardMeCabAnalyzer didn't).

Also, doing a search against 'text:本*' gave me the following lucene searcher explanation:

Searching for 'text:本*':
    0.30685282 = (MATCH) fieldWeight(text:本日は晴天なり。 in 0), product of:
  1.0 = tf(termFreq(text:本日は晴天なり。)=1)
  0.30685282 = idf(docFreq=1, numDocs=1)
  1.0 = fieldNorm(field=text, doc=0)

Which means that there was only one document and '本日は晴天なり。' was tokenized as {'本日は晴天なり。'} (instead of {'本日', 'は', '晴天', 'なり', '。' } as my local MeCab did).

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Kohei TAKETA

unread,
Jan 6, 2010, 7:18:46 AM1/6/10
to cmecab-j...@googlegroups.com
Hello Franz,

The value of net.moraleboost.mecab.encoding must be the same as the
encoding used in your dictionary.

On linux systems, it's most likely to be "UTF-8" or "EUC-JP".
If you have installed MeCab with apt-get, the dictionary encoding is
set to "EUC-JP" by default (at least on my Debian box.)
If you have installed mecab-ipadic-utf8 or mecab-jumandic-utf8 package
in addition to mecab, the encoding is "UTF-8."

Regards,
Kohei Taketa

2010年1月6日20:52 Franz Allan Valencia See <fran...@gmail.com>:

Franz Allan Valencia See

unread,
Jan 6, 2010, 7:28:41 AM1/6/10
to cmecab-j...@googlegroups.com
I see. Thanks for the info.

I just did a quick recap of what I did and you're right.

When I used my linux command `mecab`, my input text was at EUC-JP. However, when I ran StandardMeCabAnalyzerTest, I was using Shift_JIS. Using Shift_JIS though for my linux command `mecab` did not allow me to generate a readable output (either that or I can't figure out the output encoding). And when I set my net.moraleboost.mecab.encoding to 'EUC-JP', StandardMeCabAnalyzerTest now passes :-)


Thanks,
--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


2010/1/6 Kohei TAKETA <taks...@gmail.com>
--
このメールは Google グループのグループ「cmecab-java-users」の登録者に送られています。
このグループに投稿するには、cmecab-j...@googlegroups.com にメールを送信してください。
このグループから退会するには、cmecab-java-us...@googlegroups.com にメールを送信してください。
詳細については、http://groups.google.com/group/cmecab-java-users?hl=ja からこのグループにアクセスしてください。





Reply all
Reply to author
Forward
0 new messages