National Corpus of Polish

Gabriela Kaczka

unread,

Nov 11, 2014, 8:49:25 AM11/11/14

to nltk...@googlegroups.com

Hello.

I would like to extend NLTK and add new corpus: National Corpus of Polish (NKJP) -> http://nkjp.pl/index.php?page=0&lang=1 .
My suggestion for name is pol_nkjp.
NKJP is freely available resource (see 'tools and resources' section), license: GNU GPL v.3.
Link to issues in nltk_data: https://github.com/nltk/nltk_data/issues/14 .
Every source in this corpus is organized into few files, e.g. ann_morphosyntax.xml, text.xml, ann_words... I plan to create a reader, which would derive from XMLCorpusReader. For every file in one source, I would like to create a view class derived from XMLCorpusView (with overwritten handle_elt function).

ducki13

Gabriela Kaczka

unread,

Dec 4, 2014, 11:49:25 AM12/4/14

to nltk...@googlegroups.com

Hello,

I've wanted to do a pull request with NKJP corpus, but I met a problem (copy paste from my terminal):

remote: error: GH001: Large files detected.
remote: error: Trace: f219e4c978f8903838ffee0f0357bc

22
remote: error: See http:/above described/git.io/iEPt8g for more information.
remote: error: File packages/corpora/pol_nkjp.zip is 166.49 MB; this exceeds GitHub's file size limit of 100 MB
To https://github.com/ducki13/nltk_data.git
! [remote rejected] gh-pages -> gh-pages (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/ducki13/nltk_data.git'

What is the best solution in this case? Should I try to divide corpus into several parts or remove files in sources? Maybe there is another solution?
Thank you in advance!

ducki13

Steven Bird

unread,

Dec 6, 2014, 12:01:31 AM12/6/14

to nltk-dev

Hi Gabriela,

Would you please post a link to the corpus itself, so that I can take a look at it?

Thanks,

-Steven

--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.