LocalProtobufMeCabTokenizer

3 views
Skip to first unread message

Franz Allan Valencia See

unread,
Jan 12, 2010, 9:29:25 PM1/12/10
to cmecab-j...@googlegroups.com
Good day,

If I understood http://code.google.com/p/cmecab-java/wiki/TokenizerComparison correctly, LocalProtobufMeCabTokenizer is faster than StandardMeCabTokenizer but the search results would be the same.

Thus, I tried out using LocalProtobufMeCabTokenizer in my local machine. However, it seems it only accepts UTF-8 and I need it to be on EUC-JP. Is there any way to go around this?

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Franz Allan Valencia See

unread,
Jan 13, 2010, 9:37:52 AM1/13/10
to cmecab-j...@googlegroups.com
I'm now able to make Protobuf work. All I had to do was to configure my mecab dictionary to use utf8 and then reinstall it. After which, protobuf works find :-)

However, when running the StandardTaggerTest and the LocalProtobufTaggerTest, both executed at roughly around 3000ms for 1000 executions. I did not see any speed gains. Is this expected?

On a side note: compared to GoSen, CMeCab-Java's StandardMeCabAnalyzer seem to be slower for indexing (haven' tried searching though). In my current system, with GoSen indexing takes about 40seconds, while swapping that wtih StandardMeCabAnalyzer, it jumps up to 80secs. Is this expected as well?


Thanks,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Reply all
Reply to author
Forward
0 new messages