nltk_data under MS Windows

474 views
Skip to first unread message

Steven Bird

unread,
Jan 14, 2010, 3:27:18 PM1/14/10
to nltk-dev
I recently received the following query. Can anyone on this list with
Windows expertise comment? Should we use %PROGRAMFILES% instead of
%APPDATA% for nltk_data?

----
We're digging into NLTK, and we found that the default installation
location for ntlk_data under Windows is somewhat infelicitous.  By
default, it goes into %APPDATA%, which is a place designed to store
small amounts of program settings, not large data sets.  In a roaming
user profile setting (such as ours), this results in over 400 MB of
stuff having to be copied back and forth from server to client
whenever a user logs in or logs out.

Normally, under Windows, per-machine software and data go in
%PROGRAMFILES% and per-user files go in %HOMEDRIVE%%HOMEPATH%, which
is the place for large amounts of data and, in a roaming setup,
typically is on a file server.
----

Nigel Randsley

unread,
Jan 14, 2010, 7:58:28 PM1/14/10
to nltk...@googlegroups.com
Steven,

If you use %PROGRAMFILES% for nltk_data that will raise quite a few
issues on vista and above due to write_permission for the directory
being administrator only. If a user wished to install nltk_data or
update it then nltk would have to be started with administrator
privileges or else install and or update will fail. The correct path is
%APPDATA% for a multiuser single nltk_data install or one of the
subdirectories of %USERPROFILE%\AppData for a per-user install of nltk_data.
Personally I prefer to use the NLTKDATA environment variable and install
nltk_data under %HOMEDRIVE%\nltk_data as I've been doing since way back :)

Hope that helps,

Nigel Randsley


Steven Bethard

unread,
Jan 14, 2010, 8:03:56 PM1/14/10
to nltk...@googlegroups.com

For what it's worth, I changed the default a few months ago from
%USERPROFILE%, which is what it originally used, to %APPDATA%, which
is the right place for per-user settings (you should pretty much never
use %USERPROFILE%).

Note that in theory, the downloader tries first to "check if we have
sufficient permissions to install in a variety of system-wide
locations" (see default_download_dir() in nltk.downloader, and the
path variable in nltk.data) before it tries to install in %APPDATA%.
Among these are::

r'C:\nltk_data', r'D:\nltk_data', r'E:\nltk_data',
os.path.join(sys.prefix, 'nltk_data'),
os.path.join(sys.prefix, 'lib', 'nltk_data'),

So if they're getting stuff installed in %APPDATA%, then it's because
they already can't write to any of these directories, right?

In response to Nigel, I'd just point out that on Vista, %APPDATA%
refers to %USERPROFILE%\AppData\Roaming, so I think %APPDATA% is
always the right answer for per-user, and you don't have to look at
%USERPROFILE% at all.

Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
--- The Hiphopopotamus

mcovington

unread,
Jan 14, 2010, 9:11:09 PM1/14/10
to nltk-dev
On Jan 14, 7:58 pm, Nigel Randsley <nigel.rands...@gmail.com> wrote:
> If you use %PROGRAMFILES% for nltk_data that will raise quite a few
> issues on vista and above due to write_permission for the directory
> being administrator only. If a user wished to install nltk_data or
> update it then nltk would have to be started with administrator
> privileges or else install and or update will fail.

Good point. It makes sense for nltk_data to be in the user's file
space, preferably in Documents.

> The correct path is
> %APPDATA% for  a multiuser single nltk_data install or one of the
> subdirectories of %USERPROFILE%\AppData for a per-user install of nltk_data.
> Personally I prefer to use the NLTKDATA environment variable and install
> nltk_data under %HOMEDRIVE%\nltk_data as I've been doing since way back :)

%APPDATA% is not multiuser, and that's the point. %APPDATA% is per-
user and roams in a roaming user profile environment, and what's more
it's hidden.

%USERPROFILE% also roams, but at least is not hidden from the user's
view.

Best would be to use the operating system call to find out where the
user's Documents or My Documents folder is. There is no environment
variable for this, unfortunately. Second best would be to use
%HOMEDRIVE% %HOMEPATH%.

mcovington

unread,
Jan 14, 2010, 9:13:56 PM1/14/10
to nltk-dev
> For what it's worth, I changed the default a few months ago from
> %USERPROFILE%, which is what it originally used, to %APPDATA%, which
> is the right place for per-user settings (you should pretty much never
> use %USERPROFILE%).
>
> Note that in theory, the downloader tries first to "check if we have
> sufficient permissions to install in a variety of system-wide
> locations" (see default_download_dir() in nltk.downloader, and the
> path variable in nltk.data) before it tries to install in %APPDATA%.
> Among these are::
>
>     r'C:\nltk_data', r'D:\nltk_data', r'E:\nltk_data',
>     os.path.join(sys.prefix, 'nltk_data'),
>     os.path.join(sys.prefix, 'lib', 'nltk_data'),
>
> So if they're getting stuff installed in %APPDATA%, then it's because
> they already can't write to any of these directories, right?

Not my experience. The path to %APPDATA% was what was pre-written
into the box and I didn't try to change it. I definitely had
permission to write in other places.

What you describe sounds like a good idea -- it just isn't what I
observed happening.

> In response to Nigel, I'd just point out that on Vista, %APPDATA%
> refers to %USERPROFILE%\AppData\Roaming, so I think %APPDATA% is
> always the right answer for per-user, and you don't have to look at
> %USERPROFILE% at all.

But users cannot see %APPDATA% or delete files in it easily.

Steven Bethard

unread,
Jan 14, 2010, 10:49:27 PM1/14/10
to nltk...@googlegroups.com
On Thu, Jan 14, 2010 at 6:13 PM, mcovington <m...@uga.edu> wrote:
> But users cannot see %APPDATA% or delete files in it easily.

Why do you need to see the downloaded corpora and delete files?

On Thu, Jan 14, 2010 at 6:11 PM, mcovington <m...@uga.edu> wrote:
> Best would be to use the operating system call to find out where the
> user's Documents or My Documents folder is.  There is no environment
> variable for this, unfortunately.  Second best would be to use
> %HOMEDRIVE% %HOMEPATH%.

These are both definitely wrong according to the Windows Vista standards:

http://blogs.msdn.com/amitava/archive/2007/07/16/certified-for-windows-vista-logo-test-case-faq-test-case-15.aspx

I'd personally be pretty pissed if nltk started writing to my
Documents folder by default. I'd be fine though with an option to
downloader that let people request a particular directory manually.

On Thu, Jan 14, 2010 at 6:13 PM, mcovington <m...@uga.edu> wrote:
>> Note that in theory, the downloader tries first to "check if we have
>> sufficient permissions to install in a variety of system-wide
>> locations" (see default_download_dir() in nltk.downloader, and the
>> path variable in nltk.data) before it tries to install in %APPDATA%.
>

> Not my experience. The path to %APPDATA% was what was pre-written
> into the box and I didn't try to change it. I definitely had
> permission to write in other places.
>
> What you describe sounds like a good idea -- it just isn't what I
> observed happening.

Sounds like a bug then - if you can figure out what's going wrong
there and fix it, I'd certainly have no problem with that.

Michael A. Covington

unread,
Jan 14, 2010, 10:54:44 PM1/14/10
to nltk...@googlegroups.com
Steven Bethard wrote:
> On Thu, Jan 14, 2010 at 6:13 PM, mcovington <m...@uga.edu> wrote:
>> But users cannot see %APPDATA% or delete files in it easily.
>
> Why do you need to see the downloaded corpora and delete files?

Is there a way to uninstall the downloaded corpora? (I'm sorry, I
haven't been using NLTK very long and simply don't know.) We don't want
to compel people to have the files on their computers forever.

In any case, why hide the files from the user?

--
Michael A. Covington, Associate Director www.ai.uga.edu/mc
Institute for Artificial Intelligence
The University of Georgia, Athens, GA 30602-7415 U.S.A.

Steven Bird

unread,
Jan 14, 2010, 11:10:58 PM1/14/10
to nltk...@googlegroups.com
2010/1/15 Steven Bethard <steven....@gmail.com>:

> I'd be fine though with an option to
> downloader that let people request a particular directory manually.

There's already a "-d" flag:

$ python -m nltk.downloader --help
Usage: downloader.py [options]

Options:
-h, --help show this help message and exit
-d DIR, --dir=DIR download package to directory DIR
-q, --quiet work quietly
-f, --force download even if already installed
-e, --exit-on-error exit if an error occurs

2010/1/15 Michael A. Covington <m...@uga.edu>:


> Is there a way to uninstall the downloaded corpora?

No. I guess we could add uninstall functionality to the downloader.
It would need to behave nicely if a user had downloaded multiple
instances of a corpus.

-Steven

Steven Bethard

unread,
Jan 14, 2010, 11:13:22 PM1/14/10
to nltk...@googlegroups.com
On Thu, Jan 14, 2010 at 7:54 PM, Michael A. Covington <m...@uga.edu> wrote:
> Steven Bethard wrote:
>>
>> On Thu, Jan 14, 2010 at 6:13 PM, mcovington <m...@uga.edu> wrote:
>>>
>>> But users cannot see %APPDATA% or delete files in it easily.
>>
>> Why do you need to see the downloaded corpora and delete files?
>
> Is there a way to uninstall the downloaded corpora?  (I'm sorry, I haven't
> been using NLTK very long and simply don't know.)  We don't want to compel
> people to have the files on their computers forever.

If there isn't I'm all for someone adding a way to do so.

> In any case, why hide the files from the user?

I'm certainly fine with them being visible somewhere. But we really
shouldn't be writing stuff where Windows tells us not to. Of the
options in the link I sent, it seems like %ProgramFiles% is the only
choice that both follows best practices and also is visible by
default.

Michael A. Covington

unread,
Jan 14, 2010, 11:20:06 PM1/14/10
to nltk...@googlegroups.com
Steven Bird wrote:

> 2010/1/15 Michael A. Covington <m...@uga.edu>:
>> Is there a way to uninstall the downloaded corpora?
>
> No. I guess we could add uninstall functionality to the downloader.
> It would need to behave nicely if a user had downloaded multiple
> instances of a corpus.

Or simply place them where the user can easily delete them, or where
they will go away if Python is uninstalled.

Not in %APPDATA%, which is user space but is hidden.

Michael A. Covington

unread,
Jan 14, 2010, 11:29:31 PM1/14/10
to nltk...@googlegroups.com
Steven Bethard wrote:

> I'm certainly fine with them being visible somewhere. But we really
> shouldn't be writing stuff where Windows tells us not to. Of the
> options in the link I sent, it seems like %ProgramFiles% is the only
> choice that both follows best practices and also is visible by
> default.

Well, there are two ways to go.

If you view the files as data belonging to the individual user, and want
the user to be able to delete them easily, then actually the user's
Documents or My Documents folder is best. But you have to perform a
system call to find out where it is and what it is called; there isn't
an environment variable, and in non-English editions it's not named
Documents. Second best is to put them one step up from there, in
%homedrive%%homepath%.

To punt, you could try %homedrive%%homepath\Documents if it exists
(English-language Vista and Win7),
%homedrive%%homepath%%My Documents as second choice (English Windows XP
and some XP-compatible Vista and Win7 setups), and %homedrive%%homepath%
as third choice (mainly non-English versions).

That might actually be a very good kluge, though it is a kluge.

If you view the files as being installed like software for all users of
the machine, then Program Files is right, and you should provide an
uninstall method registered with the operating system. (Which can get
complicated.) Or put them among the Python libraries so that
uninstalling Python will also uninstall them (if I'm right in thinking
this is the case). Putting anything in Program Files requires
administrator privileges.

One of the important design principles of Windows is that installations
are reversible, preferably through the Control Panel. There are
utilities such as Inno Setup (freeware) that generate Windows setup
files that install things and tell the OS how to uninstall them.

THANKS VERY MUCH for being willing to give this some consideration, and
let me know if I can help you with Windows testing. I am very new to
Python and NLTK. But Windows is a fine operating system, and if we want
to make computational linguistics tools available to people who aren't
computer geeks, providing good support of Windows is important.

(I am, btw, a UNIX old hand, definitely not a Windows-only user.)

Michael

Steven Bethard

unread,
Jan 14, 2010, 11:54:54 PM1/14/10
to nltk...@googlegroups.com
On Thu, Jan 14, 2010 at 8:29 PM, Michael A. Covington <m...@uga.edu> wrote:
> If you view the files as data belonging to the individual user, and want the
> user to be able to delete them easily...

Then you want the user to specify where they should be put. So if we
go this route, I recommend that we only give examples of using the
dowloader with the "-d" flag.

> If you view the files as being installed like software for all users of the
> machine, then Program Files is right, and you should provide an uninstall
> method registered with the operating system.  (Which can get complicated.)

Yep. It's pretty easy to determine if they're an administrator though:

import ctypes
print ctypes.windll.shell32.IsUserAnAdmin()

So we could at least find out if they're an administrator and explain
what they need to do to install to Program Files (start a command
prompt using "Run as administrator").

>  Or put them among the Python libraries so that uninstalling Python will
> also uninstall them (if I'm right in thinking this is the case).

I think this is actually not true. If you uninstall Python, it will
leave around the things in site-packages that you installed
separately. I've definitely run into this before.

> One of the important design principles of Windows is that installations are
> reversible, preferably through the Control Panel.  There are utilities such
> as Inno Setup (freeware) that generate Windows setup files that install
> things and tell the OS how to uninstall them.

Yeah, so another option would be to create a separate bdist_msi
installer for each dataset. Then they would install and uninstall like
normal Windows programs.

It would also be possible to create a single MSI where the user can
select which datasets they want. But having written the code for the
Python distutils bdist_msi command, I can tell you that it's not an
easy process.

Michael A. Covington

unread,
Jan 15, 2010, 12:38:01 AM1/15/10
to nltk...@googlegroups.com
Steven Bethard wrote:
> On Thu, Jan 14, 2010 at 8:29 PM, Michael A. Covington <m...@uga.edu> wrote:
>> If you view the files as data belonging to the individual user, and want the
>> user to be able to delete them easily...
>
> Then you want the user to specify where they should be put. So if we
> go this route, I recommend that we only give examples of using the
> dowloader with the "-d" flag.

Could they default to the user's My Documents folder or a reasonable
guess as to where it might be?

Steven Bethard

unread,
Jan 15, 2010, 2:02:42 AM1/15/10
to nltk...@googlegroups.com
On Thu, Jan 14, 2010 at 9:38 PM, Michael A. Covington <m...@uga.edu> wrote:
> Steven Bethard wrote:
>>
>> On Thu, Jan 14, 2010 at 8:29 PM, Michael A. Covington <m...@uga.edu> wrote:
>>>
>>> If you view the files as data belonging to the individual user, and want
>>> the
>>> user to be able to delete them easily...
>>
>> Then you want the user to specify where they should be put.  So if we
>> go this route, I recommend that we only give examples of using the
>> dowloader with the "-d" flag.
>
> Could they default to the user's My Documents folder or a reasonable guess
> as to where it might be?

I'm still pretty opposed to this. This is one of the mis-features I
hate most about non-standards-compliant Windows applications -
polluting my Documents folder with stuff I didn't ask to be put there.
But if there's consensus from the other Windows users that this is the
best place for NLTK data, I won't stand in the way.

Steven Bird

unread,
Jan 15, 2010, 6:28:46 AM1/15/10
to nltk-dev
Thanks for all this input.

Just to put things into perspective, note that we just need:

(a) reasonable default behaviour for a newbie with a single user setup

(b) flags supporting one or more alternative behaviours, for use in
other contexts (and by power users and sys admins who can be expected
to read installation notes before going ahead).

-Steven

Michael A. Covington

unread,
Jan 15, 2010, 7:31:17 AM1/15/10
to nltk...@googlegroups.com
Steven Bethard wrote:

>> Could they default to the user's My Documents folder or a reasonable guess
>> as to where it might be?
>
> I'm still pretty opposed to this. This is one of the mis-features I
> hate most about non-standards-compliant Windows applications -
> polluting my Documents folder with stuff I didn't ask to be put there.

I can't think of a better place to put files that are thought of as data
owned by the individual user. Of course we won't clutter My Documents
with individual files -- instead, put a directory there so the user sees
only one name unless he digs down into it.

There are two striking advantages of My Documents: (1) the user is sure
to be able to see it and write in it; (2) it will be stored in the right
way (in a place where there is room for large files without unnecessary
copying). As you know, in a roaming user setup, My Documents is usually
redirected to a file server, where Application Data and some other
things are copied from server to client upon login, and Program Files
just sits on the machine and doesn't move with the user.

Another piece of software that keeps substantial data files in My
Documents in TheSky, the astronomy package. So there is a precedent.

As I said, the alternative is that if one thinks of the files as being
(part of) a software package (that belongs to the machine rather than
the user), then they should reside in Program Files and they need an
installer and uninstaller, as well as admin privileges to install or
modify them.

I know this is a part of the Windows Weltanschauung that is quite
different from UNIX -- the file system is heterogeneous depending on the
intended use of the files. Thanks for being willing to accommodate
this; in the long run I think it will help the popularity of NLTK.

All the best,
Michael

Steven Bethard

unread,
Jan 15, 2010, 12:19:27 PM1/15/10
to nltk...@googlegroups.com
On Fri, Jan 15, 2010 at 4:31 AM, Michael A. Covington <m...@uga.edu> wrote:
> Steven Bethard wrote:
>>> Could they default to the user's My Documents folder or a reasonable
>>> guess as to where it might be?
>>
>> I'm still pretty opposed to this. This is one of the mis-features I
>> hate most about non-standards-compliant Windows applications -
>> polluting my Documents folder with stuff I didn't ask to be put there.
>
> I can't think of a better place to put files that are thought of as data
> owned by the individual user.  Of course we won't clutter My Documents with
> individual files -- instead, put a directory there so the user sees only one
> name unless he digs down into it.
>
> There are two striking advantages of My Documents: (1) the user is sure to
> be able to see it and write in it; (2) it will be stored in the right way
> (in a place where there is room for large files without unnecessary
> copying).  As you know, in a roaming user setup, My Documents is usually
> redirected to a file server, where Application Data and some other things
> are copied from server to client upon login, and Program Files just sits on
> the machine and doesn't move with the user.
>
> Another piece of software that keeps substantial data files in My Documents
> in TheSky, the astronomy package.  So there is a precedent.

Yeah, I'm still unconvinced that violating the Windows standards is a
good idea, even if other packages that I've never heard of are doing
it. Everything else that saves to My Documents (Word, Excel, etc.)
prompts the user first with a dialog that shows them that they're
saving to My Documents, lets them choose a name, etc. If the
downloader worked like this, where the user was prompted to choose a
directory to save to, and My Documents/nltk_data was the default, I'd
be fine with it. What I object to is putting stuff into My Documents
behind the user's back.

But, like I said, if there's consensus that we want to violate the
Windows recommendations here, then I won't stand in the way. BTW, for
whoever writes the patch, here's how to get the Documents folder using
ctypes:

import ctypes
MAX_PATH = 260 # from winapi C headers
CSIDL_PERSONAL = 0x0005 # CSIDL constants (from MSDN 2003)
buf = ctypes.create_unicode_buffer(MAX_PATH)
if ctypes.windll.shell32.SHGetSpecialFolderPathW(None,buf,CSIDL_PERSONAL,0):
documents_path = buf.value

> As I said, the alternative is that if one thinks of the files as being (part
> of) a software package (that belongs to the machine rather than the user),
> then they should reside in Program Files and they need an installer and
> uninstaller, as well as admin privileges to install or modify them.

Yep. This is probably my preference because it follows the Windows
application guidelines. And it wouldn't be difficult to write a
setup.py file for each corpus and generate a bunch of bdist_msis that
install them in the usual way.

Michael A. Covington

unread,
Jan 15, 2010, 1:02:26 PM1/15/10
to nltk...@googlegroups.com
Steven Bethard wrote:

As I said, the alternative is that if one thinks of the files as being (part
of) a software package (that belongs to the machine rather than the user),
then they should reside in Program Files and they need an installer and
uninstaller, as well as admin privileges to install or modify them.
    
Yep. This is probably my preference because it follows the Windows
application guidelines. And it wouldn't be difficult to write a
setup.py file for each corpus and generate a bunch of bdist_msis that
install them in the usual way.
  

I think putting them in Program Files with an .msi file accords best with the way the downloads are actually used (they are "installed" on the machine like software).


Michael A. Covington, Ph.D., Associate Director   www.ai.uga.edu/mc

Michael A. Covington

unread,
Jan 15, 2010, 2:03:25 PM1/15/10
to nltk...@googlegroups.com
A related question, though outside of NLTK's control. I am having
problems with the Matplotlib installer. It apparently won't install in
C:\Program Files\Python26 (though it tries to; then it crashes). Also,
even when successfully installed in C:\Python26 it does not register an
uninstaller with the OS.

Are these known problem with Matplotlib?

Anand Jeyahar

unread,
Jan 16, 2010, 12:53:15 PM1/16/10
to nltk...@googlegroups.com
On 01/15/2010 11:32 PM, Michael A. Covington wrote:
> Yep. This is probably my preference because it follows the Windows
> application guidelines. And it wouldn't be difficult to write a
> setup.py file for each corpus and generate a bunch of bdist_msis that
> install them in the usual way.
>
Hey guys,
I would be willing to do this as a learning exercise if we are
going ahead with this. Just would be glad if you can give a sample though.

--
Thanks and Regards
Anand Jeyahar
http://sites.google.com/a/cbcs.ac.in/students/anand

The man who is really serious,with the urge to find out what truth is, has no style at all. He lives only
in what is.
~Bruce Lee
Truth is a pathless land.
~Jiddu Krishnamurti

Steven Bethard

unread,
Jan 17, 2010, 1:02:17 PM1/17/10
to nltk...@googlegroups.com
On Sat, Jan 16, 2010 at 9:53 AM, Anand Jeyahar <anand....@gmail.com> wrote:
> On 01/15/2010 11:32 PM, Michael A. Covington wrote:
>> Yep. This is probably my preference because it follows the Windows
>> application guidelines. And it wouldn't be difficult to write a
>> setup.py file for each corpus and generate a bunch of bdist_msis that
>> install them in the usual way.
>
>     I would be willing to do this as a learning exercise if we are going
> ahead with this. Just would be glad if you can give a sample though.

I don't know if we're agreed on this yet, and I'm not exactly clear on
where/how the data is downloaded right now, but I assume the setup.py
code would end up looking something like:

setup(
name = "nltk-data-brown",
description = "Natural Language Toolkit Brown Corpus Sample",
...
packages=['nltk'],
package_dir={'nltk': 'path/to/corpora/directory'},
package_data = {'nltk': ['corpora/brown/*']},
)

Where the goal is that corpora get installed to
site-packages/nltk/corpora. My guess is that in addition to the
individual corpus setup.py files, we'd also want one setup.py file
that installed *all* the corpora (for people who don't want to install
each corpus separately).

Reply all
Reply to author
Forward
0 new messages