Natural Language Processing with Python Text Exercise 3.42 Creating a Semantic Index

Bio

unread,

Jun 10, 2012, 10:27:55 AM6/10/12

to nltk-users

Hello, I am having difficulty understanding exercise 42 from chapter 3
of the Natural Language Processing with Python text. I am wondering if
anybody understands this exercise and might be able to give me some
guidance. I am not taking an NLTK class or anything I'm just working
my way through the nltk book trying to learn nltk. Exercise 42 says:
Use WordNet to create a semantic index for a text collection. Extend
the concordance search program in Example 3-1, indexing each word
using the offset of it's first synset, e.g., wn.synsets('dog')
[0].offset ...

This is Example 3-1:

class IndexedText(object):

def __init__(self, stemmer, text):
self._text = text
self._stemmer = stemmer
self._index = nltk.Index((self._stem(word), i)
#for (i, word) in enumerate(text))

def concordance(self, word, width=40):
key = self._stem(word)
wc = width/4 # words of context
for i in self._index[key]:
lcontext = ' '.join(self.text[i-wc:i])
rcontext = ' '.join(self._text[i:i+wc])
ldisplay = '%*s' % (width, lcontext[-width:])
rdisplay = '%-*s' % (width, rcontext[:width])
print ldisplay, redisplay

def _stem(self, word):
return self._stemmer.stem(word).lower()

To run the module you set the stemmer to the Porter Stemmer, then set
the text to grail.txt from the nltk corpus and call IndexedText()
using the stemmer and text as your arguments.

porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)

This produces a list of the index locations of every word and its
associated stem. The example then goes on to build a concordance of
this indexed list but that portion of the example is not relevant to
the exercise.

So I wrote an amended class to try and create a semantic index for
WordNet.

class SymanticIndexedText(object):

def __init__(self, stemmer):
self._stemmer = stemmer
from nltk.corpus import wordnet as wn
self._wn = wn

# The following self.index is the original from the exercise
#self._index = nltk.Index((self._stem(word), i)
#for (i, word) in enumerate(text))

# The following self._index statements are my unsuccessful
attempts to create the WordNet semantic index
#self._index = nltk.Index((self._stem(word),i)
#for (i, word) in
enumerate(self._wn))
#self._index = nltk.Index((self._stem(word),i)
#for (i, word) in
self._wn.synsets(word)[0].offset)
#self._index = nltk.Index((self._stem(word),i)
for self._wn.synsets(word)[0].offset
in self._wn)

# The following is the _stem def from the original exercise
def _stem(self, word):
return self._stemmer.stem(word).lower()

To run this I just set the stemmer to the Porter Stemmer and call
SymanticIndexedText() with porter as the argument.

porter = nltk.PorterStemmer()
text = SymanticIndexedText(porter)

I get that what I am trying to do is create a list of every word in
WordNet and and the indexed location of each occurrence of the word in
the WordNet synset. At least I think that is what I am being asked to
do. Maybe I'm supposed to take a random text and create an index of
each WordNet location of every synset of every word in the text. I
believe some of my difficulty is arising from uncertainties around the
nltk.Index(pairs) method. It seems to me that what nltk.Index() does
is take every word in an iterable object ( in the case of the example
grail.txt) and builds an index of every word in that iterable object
as well as the index of every matched word from the function that is
paired with that word (in the case of the example self._stem(word)). I
have been trying to better understand the nltk.Index method. When I
try the help(nltk.Index) function the results don't seem to really to
give me any useful information. I have also been unable to locate any
information on nltk.Index in the nltk book.

What I am assuming is that I should iterate through each word in
WordNet and create an index of the synset of each word and each words
stem. However when I try to iterate through WordNet in the nltk.Index
method I get error messages telling me:

File "/Users/georgeorton/textproc3.py", line 590, in __init__
for (i, word) in enumerate(self._wn))
TypeError: 'LazyCorpusLoader' object is not iterable

or if I try to iterate through the WordNet synsets I get an error
message telling me I can't access each of the words in WordNet's
sunset:

File "/Users/georgeorton/textproc3.py", line 592, in __init__
for (i, word) in self._wn.synsets(word)[0].offset)
NameError: global name 'word' is not defined

It's probably pretty clear from the questions I'm asking that I am not
only confused about how nltk.Index() works but also exactly what the
exercise is asking me to do in the first place. If anybody has any
ideas on how to proceed I would certainly appreciate the help.
Sincerely, George

Bio

unread,

Jun 10, 2012, 1:42:58 PM6/10/12

to nltk-users

Hello, As I think more about this exercise I am beginning to believe
that the exercise is asking me to find the offset of the first synset
of each word in a text rather than the offset of the first sunset of
each word in WordNet. That being the case I changed my code to:

class SymanticIndexedText(object):

def __init__(self):

self._text = nltk.corpus.webtext.words('grail.txt')

from nltk.corpus import wordnet as wn
self._wn = wn

self._index = nltk.Index((self._wn.synsets(word)[0].offset, i)
for (i, word) in
enumerate(self._text))

Here I used the same text as in Example 3-1
(nltk.corpus.webtext.words('grail.txt')). When I run this I get the
following error:

Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
SymanticIndexedText()
File "/Users/georgeorton/textproc3.py", line 598, in __init__
for (i, word) in enumerate(self._text))
File "/Library/Python/2.7/site-packages/nltk/util.py", line 118, in
__init__
for key, value in pairs:
File "/Users/georgeorton/textproc3.py", line 598, in <genexpr>
for (i, word) in enumerate(self._text))
IndexError: list index out of range

It seems to me that the argument for nltk.Index that I am using
(self._wn.synsets(word)[0].offset, i)
for (i, word) in
enumerate(self._text)
is similar to a list comprehension where (self._wn.synsets(word)
[0].offset, i) is being returned for each word in the enumerated text.
I don't see how my list index is out of range if I am just stepping
through each word in my text. I may just be holding a conversation
with myself here but if anybody has any thoughts I'd appreciate
hearing them. Sincerely, George

Morten Minde Neergaard

unread,

Jun 11, 2012, 9:17:37 AM6/11/12

to nltk-...@googlegroups.com

At 10:42, Sun 2012-06-10, Bio wrote:
[…]

> self._index = nltk.Index((self._wn.synsets(word)[0].offset, i)
> for (i, word) in enumerate(self._text))

[…]

> IndexError: list index out of range
>
> It seems to me that the argument for nltk.Index that I am using
> (self._wn.synsets(word)[0].offset, i)
> for (i, word) in enumerate(self._text)
> is similar to a list comprehension where (self._wn.synsets(word)
> [0].offset, i) is being returned for each word in the enumerated text.

It's a generator expression[0], which is indeed very similar to a list
comprehension. You can view it as a lazy, thus potentially more
efficient, way of doing a comprehension.

> I don't see how my list index is out of range if I am just stepping
> through each word in my text.

You may get faster respones asking these questions on e.g. on one of the
Python IRC channels[1]. When you get an IndexError, it's because you try
to access an indexable object, but with an index number that doesn't
exist. Without having looked too closely at the code, I'd suspect that
self._wn.synsets(word) is returning an empty list at some point:

>>> import nltk
>>> nltk.corpus.wordnet.synsets('python')
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.03')]
>>> nltk.corpus.wordnet.synsets('javascript')
[]
>>> nltk.corpus.wordnet.synsets('python')[0]
Synset('python.n.01')
>>> nltk.corpus.wordnet.synsets('javascript')[0]

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

IndexError: list index out of range

[0]: http://www.python.org/dev/peps/pep-0289/
[1]: http://www.python.org/community/irc/

Kind regards
--
Morten Minde Neergaard

Bio

unread,

Jun 12, 2012, 10:17:42 AM6/12/12

to nltk-...@googlegroups.com

Hello, Thank you for your reply Morten, Your comment "Without having looked too closely at the code, I'd suspect that

self._wn.synsets(word) is returning an empty list at some point: " led me to write a loop that checked for the first WordNet synset of each word in the text I was creating the Semantic Index for. The result showed that if the word was present in WordNet then the offset was returned but if the word (or often punctuation) was not in WordNet then an "IndexError: list index out of range" error was returned. It was a fairly straightforward exercise to remove all the punctuation from the text being indexed however removing the words in the text that are not in WordNet is more problematic. My first thought was just to run a check to see if the word was in WordNet:

if w in wn:

unfortunately this raised an error:

TypeError: argument of type 'WordNetCorpusReader' is not iterable

My next though was to compare every word in the text being indexed to the words in nltk.corpus.words.words() and remove any word from the text that is not in nltk.corpus.words.words(). This removed several of the words not in WordNet but not all.

So it seems I can create a semantic index of a text provided it does not contain any words that are not in WordNet, but if the text being indexed contains a word not in WordNet then the indexing operation fails. I don't suppose anybody knows of a way to test WordNet for the presence of a specified word?

Here is the self._index code that works given the stipulation noted above:

self._index = nltk.Index((self._wn.synsets(re.findall(r'[a-z]+',word[0]))[0].offset, i)