wordnet problem: adding synonyms to synsets

1,372 views
Skip to first unread message

Ben

unread,
Mar 8, 2012, 1:40:18 PM3/8/12
to nltk-users
I'm using the wordnet module to do some synonym analysis. It works
great as it is, but when I try to add my own new synonyms I start
getting errors. I think it has specifically to do with the data.*pos*
files, because I can add things to the index.sense and index.*pos*
files without any problems. Below is an example of what I'm doing and
what the errors are (forgive the silly example).

I want to add the word "thingy" to the "entity" synset,
Synset('entity.n.01'). So I add 'thingy' to the index.noun file with
this line:

thingy n 1 1 ~ 1 1 00001740

Then, I add it to the index.sense file with this line:

thingy%1:03:00:: 00001740 1 11

At this point, there are no errors and I can see that the new word is
in the 'entity.n.01' synset

>>> from nltk.corpus import wordnet
>>> wordnet.synsets('thingy')
[Synset('entity.n.01')]

But, the word is not listed as a lemma yet because I haven't added it
to the data.noun file (i.e. it just lists ['entity'], not
['entity','thingy']):

>>> wordnet.synset('entity.n.01').lemma_names
['entity']

So, I add 'thingy' to the data.noun file by changing the first line
from this:

00001740 03 n 01 entity 0 003 ~ 00001930 n 0000 ~ 00002137 n 0000 ~
04424418 n 0000 | that which is perceived or known or inferred to have
its own distinct existence (living or nonliving)

to this:

00001740 03 n 02 entity 0 thingy 0 003 ~ 00001930 n 0000 ~ 00002137 n
0000 ~ 04424418 n 0000 | that which is perceived or known or inferred
to have its own distinct existence (living or nonliving)

Now, everything works fine for Synset('entity.n.01'):

>>> from nltk.corpus import wordnet
>>> wordnet.synsets('thingy')
[Synset('entity.n.01')]
>>> wordnet.synset('entity.n.01').lemma_names
['entity', 'thingy']

BUT, now every other noun fails! For example:

>>> wordnet.synsets('make')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/
wordnet.py", line 1220, in synsets
for offset in index[form].get(p, [])]
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/
wordnet.py", line 1078, in _synset_from_pos_and_offset
synset = self._synset_from_pos_and_line(pos, data_file_line)
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/
wordnet.py", line 1178, in _synset_from_pos_and_line
raise WordNetError('line %r: %s' % (data_file_line, e))
nltk.corpus.reader.wordnet.WordNetError: line 'unism" \n': need more
than 1 value to unpack

And another:

>>> wordnet.synsets('file')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/
wordnet.py", line 1220, in synsets
for offset in index[form].get(p, [])]
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/
wordnet.py", line 1078, in _synset_from_pos_and_offset
synset = self._synset_from_pos_and_line(pos, data_file_line)
File "/Library/Python/2.7/site-packages/nltk/corpus/reader/
wordnet.py", line 1178, in _synset_from_pos_and_line
raise WordNetError('line %r: %s' % (data_file_line, e))
nltk.corpus.reader.wordnet.WordNetError: line 'rench" \n': need more
than 1 value to unpack


I'm pretty sure it has something to do with the formatting of the
data.noun file. I've noticed that adding the 6-letter word 'thingy'
offsets everything else by 6 characters (From above: 'rench" \n' and
'unism" \n'). Could it really be so inflexible that I can't add
anything to it?

Can someone tell me what's going on please? This is driving me mad.

Any help is appreciated.
Thanks,
Ben

Ben

unread,
Mar 9, 2012, 2:10:55 PM3/9/12
to nltk-users
So I've found the problem. What I thought were indexes are actually
offsets. Each synset is defined by an 8-digit number that defines the
position of the synset data in the data.*pos* file. So adding
characters/spaces to a line in the data file requires that all synsets
following that line have a new offset that increased by the number of
added spaces.

Using my previous example, if you add the word 'thingy' to the
'entity' synset, the following changes need to be made:

then you have to add len('thingy 0')=8 to all offsets in the
succeeding lines. i.e.

00001740 03 n 01 entity 0 003 ~ 00001930 n ............
00001930 03 n 01 physical_entity 0 007 @ 00001740 n ............
00002137 03 n 02 abstraction 0 abstract_entity 0 010 @
00001740 ............
.
.
.

goes to

00001740 03 n 01 entity 0 thingy 0 003 ~ 00001930 n ............
00001938 03 n 01 physical_entity 0 007 @ 00001740 n ............
00002145 03 n 02 abstraction 0 abstract_entity 0 010 @
00001740 ............
.
.
.

because len('thingy 0')=8. And, of course, these new offsets must be
re-defined in the index.*pos* file and the index.sense file.

Coming up with an automated way to do this could be hard. Adding a new
synset should be relatively easy because all you have to do is add a
new line at the end of the data.*pos* file and a new line anywhere to
the index files (I don't think lines have to be in alphabetical
order). However, adding new words to a synset is very much non-
trivial, since all changes must be cascaded down the list from the
line that is edited.

Does anyone have any tips on how to automate this in an efficient way?
The data.noun file is 15MB!

Thanks,
Ben

Keith Stevens

unread,
Mar 9, 2012, 8:14:18 PM3/9/12
to nltk-...@googlegroups.com
Hey Ben,

I wanted to do the same thing a while back so I ended up writing my own java library for adding new data to Princeton WordNet.  The code for doing this is exceptionally complicated and error prone, so I wouldn't recommend doing it twice.  You're welcome to look through it, most of the code is located here: https://github.com/fozziethebeat/C-Cat/blob/master/wordnet/src/main/java/gov/llnl/ontology/wordnet/WordNetCorpusWriter.java

Alternatively, you could also use the library I wrote  to load up wordnet, make some changes, and then serialize it.  Then after serialization, you can just use the data files in nltk.  I haven't actually tried testing the serialized files with NLTK, but they should work correctly.  The core of the package is here: https://github.com/fozziethebeat/C-Cat/tree/master/wordnet.

Hope this helps!
--Keith


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.


Reply all
Reply to author
Forward
0 new messages