National Corpus of Polish

233 views
Skip to first unread message

Gabriela Kaczka

unread,
Nov 11, 2014, 8:49:25 AM11/11/14
to nltk...@googlegroups.com
Hello.

I would like to extend NLTK and add new corpus: National Corpus of Polish (NKJP) -> http://nkjp.pl/index.php?page=0&lang=1 .
My suggestion for name is pol_nkjp.
NKJP is freely available resource (see 'tools and resources' section), license: GNU GPL v.3.
Link to issues in nltk_data: https://github.com/nltk/nltk_data/issues/14 .
Every source in this corpus is organized into few files, e.g. ann_morphosyntax.xml, text.xml, ann_words... I plan to create a reader, which would derive from XMLCorpusReader. For every file in one source, I would like to create a view class derived from XMLCorpusView (with overwritten handle_elt function).

ducki13

Gabriela Kaczka

unread,
Dec 4, 2014, 11:49:25 AM12/4/14
to nltk...@googlegroups.com
Hello,

I've wanted to do a pull request with NKJP corpus, but I met a problem (copy paste from my terminal):

remote: error: GH001: Large files detected.
remote: error: Trace: f219e4c978f8903838ffee0f0357bc
22
remote: error: See http:/above described/git.io/iEPt8g for more information.
remote: error: File packages/corpora/pol_nkjp.zip is 166.49 MB; this exceeds GitHub's file size limit of 100 MB
To https://github.com/ducki13/nltk_data.git
 ! [remote rejected] gh-pages -> gh-pages (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/ducki13/nltk_data.git'

What is the best solution in this case? Should I try to divide corpus into several parts or remove files in sources? Maybe there is another solution?
Thank you in advance!

ducki13

Steven Bird

unread,
Dec 6, 2014, 12:01:31 AM12/6/14
to nltk-dev
Hi Gabriela,

Would you please post a link to the corpus itself, so that I can take a look at it?

Thanks,
-Steven


--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mmpastuszka

unread,
Dec 6, 2014, 12:18:03 AM12/6/14
to nltk...@googlegroups.com
Hi Steven,

The latest version of the NKJP corpus (ver. 1.2) is available for download here:


Regards,

Maciej.

Gabriela Kaczka

unread,
Dec 6, 2014, 1:28:55 AM12/6/14
to nltk...@googlegroups.com
Of course :)

link to nkjp site -> http://nkjp.pl/index.php?page=0&lang=1
link to corpus download  -> http://clip.ipipan.waw.pl/NationalCorpusOfPolish

ducki13


Reply all
Reply to author
Forward
0 new messages