Hello.
I would like to extend NLTK and add new corpus: National Corpus of Polish (NKJP) ->
http://nkjp.pl/index.php?page=0&lang=1 .
My suggestion for name is pol_nkjp.
NKJP is freely available resource (see 'tools and resources' section), license: GNU GPL v.3.
Link to issues in nltk_data:
https://github.com/nltk/nltk_data/issues/14 .
Every source in this corpus is organized into few files, e.g. ann_morphosyntax.xml, text.xml, ann_words... I plan to create a reader, which would derive from
XMLCorpusReader. For every file in one source, I would like to create a view class derived from X
MLCorpusView (with overwritten handle_elt function).
ducki13