I'd like to announce that I have released two Japanese corpus readers
which you can use for NLTK:
http://lilyx.net/pages/nltkjapanesecorpus.html
The first one is for KNB Corpus (Annotated blog corpus), which is a
morphologically annotated, parsed corpus of Japanese blog articles.
The corpus reader supports basic methods such as words() and
parsed_sents() to display sentence dependency structure.
The second one, JEITA Public Morphologically Tagged Corpus (in ChaSen
format), is a public, automatically tagged (morphologically analyzed)
corpus of Project Sugita Genpaku (http://www.genpaku.org/) and Aozora
Bunko (http://www.aozora.gr.jp/), which themselves are freely
available text collections like Project Gutenberg. I've re-converted
the corpus data into ChaSen format, wrote its corpus reader, so that
the corpus can be accessed via the standard NTLK interface, such as
tagged_words() etc.
I'd appreciate it if you'd have a look and a try on these corpus data
and/or corpus readers, so that we can find problems or bugs before
they're actually included into the NLTK distribution (for example, as
I wrote in that page, the KNB corpus reader doesn't support other
information except for morphology and dependency, such as case and
anaphora.)
Thanks,
Hagiwara
--
Masato HAGIWARA
Product Planning & Development Department
Baidu Japan Inc.
http://www.baidu.jp/