LocalProtobufMeCabAnalyzer speed improvement

15 views
Skip to first unread message

Franz Allan Valencia See

unread,
Jan 16, 2010, 8:41:28 AM1/16/10
to cmecab-j...@googlegroups.com
I've done some more testing.

Our GoSen indexing lasts for about 5 seconds.
And LocalProtobufMeCabAnalyzer indexing lasts for about 80 seconds.

Despite of this, I am still leaning towards a CMeCab solution to my problem since GoSen is a dead project and you can hardly find any information about it (i.e. maintaining its dictionary is not as straight forward as that of MeCab).

In light with this, I tried find some ways to improve the performance of LocalProtobufMeCabAnalyzer. The idea I had was to reduce the number of JNI invocations by adding a mechanism to LocalProtobufMeCabAnalyzer so that only one LocalProtobufTagger is created and it is reused for every invocation to #tokenStream(..).

However, I am getting an OutOfMemoryError (I'm guessing because of the std::bad_alloc& thrown in Java_net_moraleboost_mecab_impl_LocalProtobufTagger__1parse).

Any idea on how to fix this?

Or how to improve the performance?

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see
ReusableLocalProtobufTagger.patch

Franz Allan Valencia See

unread,
Jan 16, 2010, 11:04:33 AM1/16/10
to cmecab-j...@googlegroups.com
Just an FYI.

I just did some quick tests.

LocalProtobufTagger creation:
average execution for 100,000 times : 0.30ms
standard deviation for 100,000 times : 0.25ms

LocalProtobufTagger parse (using same test data as testPerf()):
average execution for 100,000 times : 2.6ms
standard deviation for 100,000 times : 1.1ms

Cheers,


--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


Kohei TAKETA

unread,
Jan 18, 2010, 9:06:52 AM1/18/10
to cmecab-j...@googlegroups.com
Hello,

Sorry for the late reply.
I've been away on a business trip last week.

And thank you for the patch. I'll merge your patch to the code tree.
Actually, I was trying to revamp the analyzers so that they reuse tokenizers.
CJKAnalyzer2 has already been modified to reuse the tokenizer it creates.
Please see
http://code.google.com/p/cmecab-java/source/browse/trunk/src/net/moraleboost/lucene/analysis/ja/CJKAnalyzer2.java

As for the bad performance of LocalProtobufMeCabAnalyzer, could you
tell me how much text you fed to the tokenizer?
Long text may cause OutOfMemoryError or std::bad_alloc because
LocalProtobufMeCabTokenizer buffers all tokens it creates.
If you want to tokenize very long text, I recommend that you use
StandardMeCabTokenizer or break the text into several chunks and
tokenize them one-by-one.

Regards,
Kohei Taketa

Franz Allan Valencia See

unread,
Jan 18, 2010, 10:57:09 AM1/18/10
to cmecab-j...@googlegroups.com
Good day,

No worries :-)

Re reusing analyzers: Cool, sounds nice :-)

Re OOME:
Hmm...I'm not sure if there's any part of our data that has a long text during indexing. I'll try to double check again tomorrow. However, what we do have our a LOT of text to be indexed. Curious, when does the buffer gets cleared?

Re performance:
I've attached some info regarding StandardTagger & LocalProtobufTagger performance. I haven't figured out a way though on how to improve the performance of these taggers :-)


Thanks,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

2010/1/18 Kohei TAKETA <taks...@gmail.com>
--
このメールは Google グループのグループ「cmecab-java-users」の登録者に送られています。
このグループに投稿するには、cmecab-j...@googlegroups.com にメールを送信してください。
このグループから退会するには、cmecab-java-us...@googlegroups.com にメールを送信してください。
詳細については、http://groups.google.com/group/cmecab-java-users?hl=ja からこのグループにアクセスしてください。




cmecab-java-1.6-perf.tar.gz

Franz Allan Valencia See

unread,
Feb 1, 2010, 8:25:50 AM2/1/10
to cmecab-j...@googlegroups.com
I found out why my GoSen lasts for 5 seconds only....because it was throwing an exception and it was not finishing. My bad :-P

Anyway, as it is, GoSen is only a bit faster than CMeCab-Java's StandardMeCabAnalyzer (around 80secs to 120secs). I am no longer using LocalProtobufMeCabAnalyzer because one of the environments I am going to be working on does not support UTF-8.

Cheers,


--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jan 16, 2010 at 9:41 PM, Franz Allan Valencia See <fran...@gmail.com> wrote:

Kohei TAKETA

unread,
Feb 3, 2010, 9:20:18 AM2/3/10
to cmecab-j...@googlegroups.com
Thank you very much for the profiling result. That is very valuable information.

Obviously, decoding bytestrings to Java strings (CharsetUtils.decode)
is taking very long time. This is reasonable because encoding
conversion between incompatible character sets is an inherently
expensive operation.

However, there may be room for optimization. For example, if native
code can convert encodings (with libiconv or something) faster than
Java, we can reduce total parsing time by moving encoding conversion
code into the native library. I'll try some possible solutions later.

By the way, I've revamped the trunk code to support the Lucene 2.9+
API. It also supports reusing tokenizers (see
StandardMeCabAnalyzer.reuseableTokenizer() .) It would be appreciated
if you could try it out.

Regards,
Kohei Taketa

Franz Allan Valencia See

unread,
Feb 3, 2010, 10:50:03 AM2/3/10
to cmecab-j...@googlegroups.com
Thanks, I will :-) However, I might not be able to soon since I'm still pre-occupied by something else..

2010/2/3 Kohei TAKETA <taks...@gmail.com>

Regards,
Kohei Taketa

--
このメールは Google グループのグループ「cmecab-java-users」の登録者に送られています。
このグループに投稿するには、cmecab-j...@googlegroups.com にメールを送信してください。
このグループから退会するには、cmecab-java-us...@googlegroups.com にメールを送信してください。
詳細については、http://groups.google.com/group/cmecab-java-users?hl=ja からこのグループにアクセスしてください。

Reply all
Reply to author
Forward
0 new messages