NLTK data download

Jeff Vanderdoes

unread,

Aug 22, 2018, 8:25:26 PM8/22/18

to nltk-users

Hi,

Due to network restrictions I have to download and unzip the corpa dataset manually. Once downloaded, I unzip the files into c:\nltk_data which is in path. I structure the directory as c:\nltk_data\corpa\.... as a web link mentioned it needs to be that way. However, when in python it can't seem to find the files. Is there some documentation that explains how the data needs to be set up for it to be found in python? IE does Python look into subdirectories to find the files it needs?

Thanks for insight.

Jeff

Steven Bird

unread,

Aug 22, 2018, 8:49:52 PM8/22/18

to nltk-users

It should be nltk_data\corpora. Please see nltk.org/data.html for instructions.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff VanderDoes

unread,

Aug 23, 2018, 12:49:58 PM8/23/18

to nltk-...@googlegroups.com

Thanks for feedback!

Sorry a typo on my part. The data is in c:\nltk_data\corpora. However in testing I have put the files at c:\nltk_data as well just because I'm not having success. I still get a message that I'm missing the files.

Just having tough time with this. Thanks for any ideas.

Jeff

---

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
    - 'c:\\nltk_data'
    - 'C:\\Users\\uname/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Local\\Continuum\\anaconda3\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Local\\Continuum\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Local\\Continuum\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

Jeff Vanderdoes

unread,

Aug 24, 2018, 1:39:51 PM8/24/18

to nltk-users

All,

So I've been able to determine it finds the directories but when I download manually there must not be the same directories and doing it via download(). Specifically I think it had troubles because it was looking for a taggers and a tokenizers directory. I tried as best I could to get a couple of files in these directories but figure I haven't got them set up complete. In my search I haven't found anything about setting up directories after downloading the zip files manually. Any pointers to where I can learn more?

Thanks,

Jeff

Steven Bird

unread,

Aug 24, 2018, 11:05:26 PM8/24/18

to nltk-users

Hi Jeff,

Sorry to hear about these difficulties.

Under nltk_data there should be folders with names: chunkers, corpora, grammars, help, misc. models, sentiment, stemmers, taggers, tokenizers. The individual packages live inside these folders. In some cases, they need to be unzipped, and this is specified in the XML file that comes with each corpus (also found here: https://github.com/nltk/nltk_data/tree/gh-pages/packages).

-Steven

--

Jeff Vanderdoes

unread,

Aug 27, 2018, 11:31:24 AM8/27/18

to nltk-users

So in general I understand what GitHub is but any hints on how to get the files from there to my machine?

Thanks!

Jeff VanderDoes

unread,

Aug 27, 2018, 1:17:58 PM8/27/18

to nltk-...@googlegroups.com

I was able to download and extract zip files to c:\nltk_data... however running a simple example of

import nltk

# from nltk.util import ngrams
text = "This is a test of ngrams"
tokenize = nltk.word_tokenize(text, 3)
print(tokenize)
bigrams = ngrams(tokenize, 2)
print(bigrams)

I get the following error. However, punkt is in c:\nltk_data\tokenizers\punkt... sigh sometimes the simplest things are difficult.

Any ideas?

Thanks,

Jeff

---

runfile('C:/Users/vandeje1/Documents/python/spyder/untitled7.py', wdir='C:/Users/vandeje1/Documents/python/spyder')
Traceback (most recent call last):

File "<ipython-input-1-0800495a64bc>", line 1, in <module>
runfile('C:/Users/vandeje1/Documents/python/spyder/untitled7.py', wdir='C:/Users/vandeje1/Documents/python/spyder')

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
execfile(filename, namespace)

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/vandeje1/Documents/python/spyder/untitled7.py", line 11, in <module>
tokenize = nltk.word_tokenize(text, 3)

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\data.py", line 836, in load
opened_resource = _open(resource_url)

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\data.py", line 954, in _open
return find(path_, path + ['']).open()

File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\data.py", line 675, in find
raise LookupError(resource_not_found)

LookupError:
**********************************************************************

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
- 'c:\\nltk_data'

- 'C:\\Users\\vandeje1/nltk_data'

    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'

    - 'C:\\Users\\vandeje1\\AppData\\Local\\Continuum\\anaconda3\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Local\\Continuum\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Local\\Continuum\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

Jeff Vanderdoes

unread,

Aug 27, 2018, 4:14:02 PM8/27/18

to nltk-users

working this further... I went to another machine not behind a firewall and did the ntkl.download('all'). Then took the entire nltk_data directory and moved it to the computer behind the firewall into c:\nltk_data

which copied

chunkers

corpora

grammars

help

misc

models

sentiment

stemmers

taggers

tokenizers of which punkt is a subdirectory with the pickle files

hmm thought that would have done it but it still gets error saing it can't find punkt... I get the feeling nltk doesn't like me :) and that something else is wrong but don't know what else to try...

Thanks for any ideas...

Jeff

To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward