nltk.download() paths and virtualenv

3,111 views
Skip to first unread message

Lachlan Musicman

unread,
May 16, 2015, 12:07:53 AM5/16/15
to nltk-...@googlegroups.com
I have a virtualenv set up within a well formed directory structure.

I'd like the nltk.download() to save the tokenizers and corpora within that directory structure, and I'd like it to *always* download to that directory structure.

I will be automatically calling the download function semi periodically via cron/web request and would like the downloads to *always* download to the same directory.

The cron/web request will be occuring within the virtualenv.

How do I go about making sure that nltk always downloads to that directory, and not to my home directory or anywhere else?

Ubuntu 14.04.2, Python 2.x (2.7 atm), venv 1.11.4

cheers
L.

Alexis Dimitriadis

unread,
May 16, 2015, 9:03:29 AM5/16/15
to nltk-...@googlegroups.com
The command nltk.download() has the following signature:

>>> help(nltk.download)
Help on method download in module nltk.downloader:

download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ',
    halt_on_error=True, raise_on_error=False)

So you can call it with the location of the nltk_data directory as the second argument (the first argument is the package to download, e.g. "book" or "reuters"). Alternately, in the source file nltk/downloader.py there's this information about commandline invocation, which you might find more convenient for scripting:

    python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or::

    python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

DATADIR is the nltk_data directory to use. The source provides other ways to control the destination of downloaded files when calling from python, but I trust these will do you. If your goal is to patch the nltk so that invocations of `nltk.download()` by others also go to the virtualenv, I'd patch nltk.data.path or nltk.downloader.Downloader.default_download_dir().)

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lachlan Musicman

unread,
May 17, 2015, 4:54:39 AM5/17/15
to nltk-...@googlegroups.com
Thanks Alexis, appreciated. How do I then get, say, nltk.corpus.stopwords to find the files in those directories?

cheers
L.

------
let's build quiet armies friends, let's march on their glass towers...let's build fallen cathedrals and make impractical plans

- GYBE

Lachlan Musicman

unread,
May 17, 2015, 5:11:12 AM5/17/15
to nltk-...@googlegroups.com
I tried using the corpus_root example from http://www.nltk.org/book/ch02.html but I'm still getting LookupErrors on this:

ignored_words = nltk.corpus.stopwords.words(corpus_root, 'english')

------
let's build quiet armies friends, let's march on their glass towers...let's build fallen cathedrals and make impractical plans

- GYBE

Lachlan Musicman

unread,
May 17, 2015, 5:15:54 AM5/17/15
to nltk-...@googlegroups.com
ah! nltk.data.path.append('path_to_nltk_data') works!

cheers
L.

------
let's build quiet armies friends, let's march on their glass towers...let's build fallen cathedrals and make impractical plans

- GYBE

Reply all
Reply to author
Forward
0 new messages