Updating KoNLPy twitter-korean-text integration to newest tkt version

91 views
Skip to first unread message

John Armstrong

unread,
Apr 22, 2016, 11:17:06 AM4/22/16
to KoNLPy

Annyeonghaseyo!

I’m a newbie here.  I searched the group but didn’t see anything about my question.  Apologies if I missed something.  I just learning Korean and the answer may be in a Korean thread that I can’t understand.

I’ve been working for several months with jhannanum and I’ve just set up konlpy.  My interest is in using konlpy to do cross-tagger accuracy tests in order to see how jhannanum compares to other taggers it supports and possibly switch to another tagger if I find one that’s significantly more accurate than jhannanum.   (By accuracy I mean primarily giving correct morpheme sequences and correct or close to correct tags for a given eojeol in context.)

Yesterday I set up the Scala and Java source code for twitter-korean-text in eclipse, and also the Java part of konlpy.  I found that I could not get the twitter-korean-text konlpy Java wrapper kr/lucypark/tkt/TktInterface.Java to link properly with the twitter-korean-text Java project that I had set up.  I figured out the reason: konlpy targets twitter-korean-text-2.4.3 and the version I got from github is twitter-korean-text-4.4.  There there have been non-trivial changes to the TwitterKoreanText file com/twitter/penguin/korean/TwitterKoreanProcessorJava.java that the konlpy file imports from. And that break the mkonlpy code.  (Most obviously the elimination of the Builder static inner class.)

I would like to use the latest TwitterKoreanText for my testing. I can do so by modifying TktInterface.Java (or making a separate version for the newer twitter-korean-text version), but I’m not sure of the best way to do it, and before I try I would like to ask others in this group whether they have already done this work and made their changes publicly available.

Thanks for any help you can give me, gomapseumnida!


-- John in Cambridge MA

John Armstrong

unread,
Apr 23, 2016, 9:36:52 PM4/23/16
to KoNLPy

I solved my problem.

The main difference between the twitter-korean-text 2.4 and 4.4 Java wrappers is that is better decomposed into atomic operations than the 2.4 one seems to be.   (I think; I never found the 2.4 source code.)   But the 4.4 operations are designed to be composed into more complex operations, and it turned out to be pretty easy to use them to implement the two APIs of Lucy Park’s TktInterface. 

(Some of the 4.4 APIs accept and/or return Scala objects, but these can be treated as magic cookies that you get from one method and pass to another.  The Java wrapper provides methods for converting them to Java objects for direct use in Java code.)

I will be happy to share the code if anyone is interested.   I wanted to be able to run both old and new tkt in the same JVM, and rather than mess with classloaders I simply changed the package names in the 4.4 source and recompiled.   Thus the imports in my 4.4-targeted TktInterface look like this:

import com.twitter.penguin.korean4.KoreanTokenJava;

import com.twitter.penguin.korean4.TwitterKoreanProcessorJava;

import com.twitter.penguin.korean4.phrase_extractor.KoreanPhraseExtractor;

import com.twitter.penguin.korean4.tokenizer.Sentence;

import com.twitter.penguin.korean4.tokenizer.KoreanTokenizer.KoreanToken;

To run with a standard twitter-korean-text-4.4.X.jar file you have to change korean4 back to korean.

-- John

Reply all
Reply to author
Forward
0 new messages