Problems creating a word list corpus reader

mschiller

unread,

Sep 17, 2011, 6:41:23 PM9/17/11

to nltk-...@googlegroups.com

First, I create a wordlist.txt file and save it in my nltk_data directory. Then I create the corpus reader:

reader = WordListCorpusReader('.', ['wordlist'])

Then I call the words method on the object:

reader.words()

But, instead of returning the list of words in the file, I get an error message that says:

Traceback (most recent call last):

File "<pyshell#41>", line 1, in <module>

reader.words()

File "C:\Python26\lib\site-packages\nltk\corpus\reader\wordlist.py", line 20, in words

return line_tokenize(self.raw(fileids))

File "C:\Python26\lib\site-packages\nltk\corpus\reader\wordlist.py", line 25, in raw

return concat([self.open(f).read() for f in fileids])

File "C:\Python26\lib\site-packages\nltk\corpus\reader\api.py", line 193, in open

stream = self._root.join(file).open(encoding)

File "C:\Python26\lib\site-packages\nltk\data.py", line 175, in join

return FileSystemPathPointer(path)

File "C:\Python26\lib\site-packages\nltk\data.py", line 155, in __init__

raise IOError('No such file or directory: %r' % path)

IOError: No such file or directory: 'C:\\Python26\\wordlist'

However, when I call reader.fileids(), it returns ['wordlist'] just fine, so it's recognizing the file...just can't read it. Anyone know why it's producing this error?

thnx

Mika

Jacob Perkins

unread,

Sep 18, 2011, 12:21:44 PM9/18/11

to nltk-users

Hi Mika,

The first argument to WordListCorpusReader is the directory containing
your file(s). This is not relative to your nltk_data directory, so you
must either invoke python within your nltk_data directory (if you want
to keep using ".") or pass a more complete path, like "C:\\path\to
\nltk_data". fileids() is just returning the list of filenames you
already gave it, so it's not really recognizing your file.

Jacob
---
http://streamhacker.com
http://twitter.com/japerk

mschiller

unread,

Sep 18, 2011, 4:28:09 PM9/18/11

to nltk-users

I can't seem to get around this error, and it should be the simplest
thing in the world. First I tried the file path:

reader = WordListCorpusReader('C:/Users/Mika/nltk_data/corpora/
cookbook/wordlist.txt', ['wordlist'])

Same error. Then, I though, maybe I don't need the file name in the
first argument:

reader = WordListCorpusReader('C:/Users/Mika/nltk_data/corpora/
cookbook', ['wordlist'])

Same error. Then I tried backslash instead of forwardslash because
that's the actual file path.

reader = WordListCorpusReader('C:\Users\Mika\nltk_data\corpora
\cookbook', ['wordlist'])

Same frustrating error. Any thoughts?

On Sep 18, 12:21 pm, Jacob Perkins <jap...@gmail.com> wrote:
> Hi Mika,
>
> The first argument to WordListCorpusReader is the directory containing
> your file(s). This is not relative to your nltk_data directory, so you
> must either invoke python within your nltk_data directory (if you want
> to keep using ".") or pass a more complete path, like "C:\\path\to
> \nltk_data". fileids() is just returning the list of filenames you
> already gave it, so it's not really recognizing your file.
>
> Jacob

> ---http://streamhacker.comhttp://twitter.com/japerk

Alexis Dimitriadis

unread,

Sep 18, 2011, 4:34:23 PM9/18/11

to nltk-...@googlegroups.com

> you must either invoke python within your nltk_data directory (if you want
> to keep using ".") or pass a more complete path, like "C:\\path\to
> \nltk_data".

For portability you can use the function nltk.data.find(). E.g.,
nltk.data.find('corpora') gives
you the full path to the folder nltk_data\corpora.

Alexis

Alexis Dimitriadis

unread,

Sep 18, 2011, 4:41:46 PM9/18/11

to nltk-...@googlegroups.com

If your file is named wordlist.txt, use that as the second argument. The
readers will not add .txt for you.

Incidentally, always use "raw strings" when you have backslashes in a
path: r"C:\Users\Mika\..."

Alexis

mschiller

unread,

Sep 18, 2011, 5:20:22 PM9/18/11

to nltk-users

Thanks, Alexis. Adding the full file name in the second argument did
it. However, it returns

['kim\r', 'robert\r', 'mary\r', 'jane']

Any idea why the \r occurs after the first three values?

On Sep 18, 4:41 pm, Alexis Dimitriadis <alexis.dimitria...@gmail.com>
wrote:

John K Pate

unread,

Sep 18, 2011, 5:43:46 PM9/18/11

to nltk-...@googlegroups.com

On Sun, 2011-09-18 at 14:20 -0700, mschiller wrote:
> Thanks, Alexis. Adding the full file name in the second argument did
> it. However, it returns
>
> ['kim\r', 'robert\r', 'mary\r', 'jane']
>
> Any idea why the \r occurs after the first three values?

The \r has to do with the Windows newline sequence. Unix-style newlines
are typically "\n", but Windows-style newlines are "\r\n". You can just
delete the "\r" at the end if it's there.

John

==
http://homepages.inf.ed.ac.uk/s0930006/

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

ekli...@gmail.com

unread,

Oct 15, 2017, 9:46:13 AM10/15/17

to nltk-users

Mika, Did you solve this problem, I have the same error as you.

Dimitriadis, A. (Alexis)

unread,

Oct 15, 2017, 10:50:49 AM10/15/17

to nltk-...@googlegroups.com

Eklil.Zia, maybe you should describe the exact problem you are having. I see two problems with the code in the ancient question you found:

1) If the first (“root”) argument of the corpus reader is a relative path, it is relative to the current directory of the script you run. (Your own corpora don’t belong in `nltk_data` anyway.)

2) `wordlist` and `wordlist.txt` is not the same filename.

If this doesn’t help, please let us know your own situation.

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward