Question about page 51 of NLTK with Python Book

48 views
Skip to the first unread message

max77

unread,
3 Jul 2017, 07:25:4003/07/2017
to nltk-users
Hello all,

I am on page 51 of the NLTK with Python Book but I am having trouble with some commands...

I am working on this on my Raspberry Pi 3 Jessie and don't know how to make the commands match my linux file system. 

This is what I have so far:

>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"/home/pi/nltk_data/corpora/penntreebank/parsed/mrg/wsj"
>>> 
>>> file_pattern = r".*/wsj_.*\.mrg"
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)

Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    ptb = BracketParseCorpusReader(corpus_root, file_pattern)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/bracket_parse.py", line 49, in __init__
    CorpusReader.__init__(self, root, fileids, encoding)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 84, in __init__
    root = FileSystemPathPointer(root)
  File "/usr/local/lib/python2.7/dist-packages/nltk/compat.py", line 221, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 303, in __init__
    raise IOError('No such file or directory: %r' % _path)
IOError: No such file or directory: '/home/pi/nltk_data/corpora/penntreebank/parsed/mrg/wsj'



since I am not doing this in windows and don't have a C: drive the bold line was changed. 

Any thoughts, tips, or suggestions as to how I can fix this?

-Thanks!

Dimitriadis, A. (Alexis)

unread,
3 Jul 2017, 08:32:0803/07/2017
to nltk-...@googlegroups.com
The format of the path you wrote is correct, so the message must be correct too: You don’t have a folder at the specified path. I assume you actually downloaded the Penn Treebank files? Use a non-Python method  (a bash terminal or a GUI navigator, if your environment provides it) to inspect the folder structure and find out where your files actually are. For example, after downloading the “Penn treebank sample” the `.mrg` files are in .../nltk_data/corpora/treebank/combined.

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

max77

unread,
8 Jul 2017, 15:14:5108/07/2017
to nltk-users
Hello again,

I've been experimenting with different approaches and still aren't making progress. The wsj files with the .mrg extension are in fact located in the folder combined like you said. So that was a good clue. Except now the code is throwing me back empty sets with no objects. Heres the code:
 from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"/home/pi/nltk_data/corpora/treebank/combined"
>>> file_pattern = r".*/wsj_.*\.mrg"
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()
[]
>>> len(ptb.sents())

Traceback (most recent call last):
  File "<pyshell#77>", line 1, in <module>
    len(ptb.sents())
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 414, in sents
    for fileid, enc in self.abspaths(fileids, True)])
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 422, in concat
    raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!

Dimitriadis, A. (Alexis)

unread,
8 Jul 2017, 15:43:5208/07/2017
to nltk-...@googlegroups.com
That’s pretty obvious, if you’ll forgive me for saying so: Your `file_pattern` includes a slash, which effectively requires the `mrg` files to be in a subdirectory— but they are not. Just write `file_pattern = r”wsj_.*\.mrg”` and you’re in business.

Incidentally, defining your own reader instance is good practice, but this dataset can be accessed with `from nltk.corpus import treebank`.

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

max77

unread,
9 Jul 2017, 08:46:2709/07/2017
to nltk-users
Terrific! It worked finally. Thanks Alexis for your help. :)
Reply all
Reply to author
Forward
0 new messages