NLTK data download

2,096 views
Skip to first unread message

Jeff Vanderdoes

unread,
Aug 22, 2018, 8:25:26 PM8/22/18
to nltk-users
Hi,

Due to network restrictions I have to download and unzip the corpa dataset manually.  Once downloaded, I unzip the files into c:\nltk_data which is in path.  I structure the directory as c:\nltk_data\corpa\....  as a web link mentioned it needs to be that way.  However, when in python it can't seem to find the files.  Is there some documentation that explains how the data needs to be set up for it to be found in python?  IE does Python look into subdirectories to find the files it needs?

Thanks for insight.
Jeff 

Steven Bird

unread,
Aug 22, 2018, 8:49:52 PM8/22/18
to nltk-users
It should be nltk_data\corpora. Please see nltk.org/data.html for instructions. 

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff VanderDoes

unread,
Aug 23, 2018, 12:49:58 PM8/23/18
to nltk-...@googlegroups.com
Thanks for feedback!

Sorry a typo on my part.  The data is in c:\nltk_data\corpora.  However in testing I have put the files at c:\nltk_data as well just because I'm not having success.  I still get a message that I'm missing the files.

Just having tough time with this. Thanks for any ideas.

Jeff

---

  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
 
  Searched in:
    - 'c:\\nltk_data'
    - 'C:\\Users\\uname/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Local\\Continuum\\anaconda3\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Local\\Continuum\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Local\\Continuum\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\uname\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

Jeff Vanderdoes

unread,
Aug 24, 2018, 1:39:51 PM8/24/18
to nltk-users
All,

So I've been able to determine it finds the directories but when I download manually there must not be the same directories and doing it via download().  Specifically I think it had troubles because it was looking for a taggers and a tokenizers directory. I tried as best I could to get a couple of files in these directories but figure I haven't got them set up complete.  In my search I haven't found anything about setting up directories after downloading the zip files manually.  Any pointers to where I can learn more?

Thanks,
Jeff

Steven Bird

unread,
Aug 24, 2018, 11:05:26 PM8/24/18
to nltk-users
Hi Jeff,

Sorry to hear about these difficulties.

Under nltk_data there should be folders with names: chunkers, corpora, grammars, help, misc. models, sentiment, stemmers, taggers, tokenizers. The individual packages live inside these folders. In some cases, they need to be unzipped, and this is specified in the XML file that comes with each corpus (also found here: https://github.com/nltk/nltk_data/tree/gh-pages/packages).

-Steven

--

Jeff Vanderdoes

unread,
Aug 27, 2018, 11:31:24 AM8/27/18
to nltk-users
So in general I understand what GitHub is but any hints on how to get the files from there to my machine?

Thanks!

Jeff VanderDoes

unread,
Aug 27, 2018, 1:17:58 PM8/27/18
to nltk-...@googlegroups.com
I was able to download and extract zip files to c:\nltk_data... however running a simple example of
import nltk
# from nltk.util import ngrams
text = "This is a test of ngrams"
tokenize = nltk.word_tokenize(text, 3)
print(tokenize)
bigrams = ngrams(tokenize, 2)
print(bigrams)

I get the following error.  However, punkt is in c:\nltk_data\tokenizers\punkt... sigh sometimes the simplest things are difficult.
Any ideas?

Thanks,
Jeff

---

runfile('C:/Users/vandeje1/Documents/python/spyder/untitled7.py', wdir='C:/Users/vandeje1/Documents/python/spyder')
Traceback (most recent call last):
  File "<ipython-input-1-0800495a64bc>", line 1, in <module>
    runfile('C:/Users/vandeje1/Documents/python/spyder/untitled7.py', wdir='C:/Users/vandeje1/Documents/python/spyder')
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/vandeje1/Documents/python/spyder/untitled7.py", line 11, in <module>
    tokenize = nltk.word_tokenize(text, 3)
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\data.py", line 836, in load
    opened_resource = _open(resource_url)
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\data.py", line 954, in _open
    return find(path_, path + ['']).open()
  File "C:\Users\vandeje1\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\data.py", line 675, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************

  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
 
  Searched in:
    - 'c:\\nltk_data'
    - 'C:\\Users\\vandeje1/nltk_data'

    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Local\\Continuum\\anaconda3\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Local\\Continuum\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Local\\Continuum\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\vandeje1\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

Jeff Vanderdoes

unread,
Aug 27, 2018, 4:14:02 PM8/27/18
to nltk-users
working this further...  I went to another machine not behind a firewall and did the ntkl.download('all').   Then took the entire nltk_data directory and moved it to the computer behind the firewall into c:\nltk_data  

which copied
chunkers
corpora
grammars
help
misc
models
sentiment
stemmers
taggers
tokenizers of which punkt is a subdirectory with the pickle files

hmm thought that would have done it but it still gets error saing it can't find punkt...  I get the feeling nltk doesn't like me :) and that something else is wrong but don't know what else to try...

Thanks for any ideas...
Jeff
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages