Mecab-ko-dic Download ##VERIFIED##

0 views
Skip to first unread message

Justina Sisti

unread,
Jan 20, 2024, 3:45:57 PM1/20/24
to trandumbbacpa

This dictionary was built with MeCab, it defines a format for the features adapted for the Korean language.
Since the Kuromoji tokenizer uses the same format for the morphological analysis (left cost + right cost + word cost) I tried to adapt the module to handle Korean with the mecab-ko-dic. I've started with a POC that copies the Kuromoji module and adapts it for the mecab-ko-dic.
I used the same classes to build and read the dictionary but I had to make some modifications to handle the differences with the IPADIC and Japanese.
The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC but mainly because the source is bigger and there are a lot of
compound and inflect terms that define a group of terms and the segmentation that can be applied.
I attached the patch that contains this new Korean module called godori nori. It is an adaptation of the Kuromoji module so currently
the two modules don't share any code. I wanted to validate the approach first and check the relevancy of the results. I don't speak Korean so I used the relevancy
tests that was added for another Korean tokenizer ( -4956) and tested the output against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
I had to simplify the JapaneseTokenizer, my version removes the nBest output and the decomposition of too long tokens. I also
modified the handling of whitespaces since they are important in Korean. Whitespaces that appear before a term are attached to that term and this
information is used to compute a penalty based on the Part of Speech of the token. The penalty cost is a feature added to mecab-ko to handle
morphemes that should not appear after a morpheme and is described in the mecab-ko page:
-ko
Ignoring whitespaces is also more inlined with the official MeCab library which attach the whitespaces to the term that follows.
I also added a decompounder filter that expand the compounds and inflects defined in the dictionary and a part of speech filter similar to the Japanese
that removes the morpheme that are not useful for relevance (suffix, prefix, interjection, ...). These filters don't play well with the tokenizer if it can
output multiple paths (nBest output for instance) so for simplicity I removed this ability and the Korean tokenizer only outputs the best path.
I compared the result with mecab-ko to confirm that the analyzer is working and ran the relevancy test that is defined in HantecRel.java included
in the patch (written by Robert for another Korean analyzer). Here are the results:

I find the results very promising so I plan to continue to work on this project. I started to extract the part of the code that could be shared with the
Kuromoji module but I wanted to share the status and this POC first to confirm that this approach is viable. The advantages of using the same model than
the Japanese analyzer are multiple: we don't have a Korean analyzer at the moment , the resulting dictionary is small compared to other libraries that
use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the lattice on the fly to select the best path efficiently.
The dictionary can be built directly from the godori module with the following command:
ant regenerate (you need to create the resource directory (mkdir lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) first since the dictionary is not included in the patch).
I've also added some minimal tests in the module to play with the analysis.

mecab-ko-dic download


DOWNLOADhttps://t.co/HMH4ZUudHW



The number of rules and exceptions in the language makes the development of a rule-based system quite complex so most of the successful implementations of Korean morphological analyzers are done through probabilistic modeling. The 21st Century Sejong Project, started in 1998 by the Korean government, initiated the creation of large-scale Korean corpus. With the publication of this corpus over the years that follow it became easier to build probabilistic modeling of Korean morphology. Today almost all the available morphological analyzers for Korean found their origin in the 21st Century Sejong Project. The mecab-ko-dic is one of them, it uses MeCab (pronounced Mekabu): a popular open source morphological analysis engine to train a probabilistic model of the Korean morphology from parts of the corpus created by the 21st Century Sejong Project.

There are 11,172 possible symbols in Korean so the FST is very dense at the root and becomes sparse very quickly. It encodes the 811,757 terms included in the mecab-ko-dic with 171,397 nodes and 826,926 arcs in less than 5.4MB.

Now that we have a well-constructed binary dictionary that can be used to lookup efficiently any terms that appear in the mecab-ko-dic dictionary, we can analyze text using the Viterbi algorithm to find the most likely segmentation (called the Viterbi path) of any input written in Korean. The figure below shows the Viterbi lattice built from the sentence 21세기 세종계획. (the 21st century Sejong plan):

The Seunjeon plugin also uses the mecab-ko-dic and has several options to customize the output. I used the latest version (6.1.1.1, as of May 29, 2018) with the default analyzer provided in the plugin and tested with and without the compress option (-Dseunjeon.compress=true) on two Elasticsearch configurations, one with a heap size of 512m and one with 4G.

df19127ead
Reply all
Reply to author
Forward
0 new messages