NLTK Japanese Corpora

150 views

Skip to first unread message

Masato Hagiwara

unread,

Jul 27, 2010, 10:42:49 AM7/27/10

to nltk-j...@googlegroups.com

Hi All,

I'd like to announce that I have released two Japanese corpus readers
which you can use for NLTK:

http://lilyx.net/pages/nltkjapanesecorpus.html

The first one is for KNB Corpus (Annotated blog corpus), which is a
morphologically annotated, parsed corpus of Japanese blog articles.
The corpus reader supports basic methods such as words() and
parsed_sents() to display sentence dependency structure.

The second one, JEITA Public Morphologically Tagged Corpus (in ChaSen
format), is a public, automatically tagged (morphologically analyzed)
corpus of Project Sugita Genpaku (http://www.genpaku.org/) and Aozora
Bunko (http://www.aozora.gr.jp/), which themselves are freely
available text collections like Project Gutenberg. I've re-converted
the corpus data into ChaSen format, wrote its corpus reader, so that
the corpus can be accessed via the standard NTLK interface, such as
tagged_words() etc.

I'd appreciate it if you'd have a look and a try on these corpus data
and/or corpus readers, so that we can find problems or bugs before
they're actually included into the NLTK distribution (for example, as
I wrote in that page, the KNB corpus reader doesn't support other
information except for morphology and dependency, such as case and
anaphora.)

Thanks,

Hagiwara

--
Masato HAGIWARA
Product Planning & Development Department
Baidu Japan Inc.
http://www.baidu.jp/

Reply all

Reply to author

Forward

0 new messages