Sentence-level wordnet similarity measures

1,442 views
Skip to first unread message

Edward Grefenstette

unread,
Jun 26, 2009, 11:52:32 AM6/26/09
to nltk-users
I've been wrestling with trying to create a halfway-decent wordnet-
based similarity measure for sentences.
Here's a sample function which attempts to do this. It takes two
strings and a wordnet metric from nltk.corpus.wordnet (for example,
nltk.corpus.wordnet.path_similarity) as arguments. It tokenizes each
strings into two respective lists of tokens.
It then creates a list of synsets for each list of tokens.
It then considers all pairs of synonyms (one taken from each of the
synset lists) and averages the similarity scores, and returns the
average.

The aim is to extend wordnet-based metrics from words to sentences. I
don't expect amazing results, but I was at least hoping it would be
capable of recognising how close to copies of a simple sentence would
be, which isn't happening.

Code and example are as follows:

=============================
>>> from nltk.corpus import wordnet as wn
>>> from nltk import word_tokenize as wt
>>> from nltk.corpus import stopwords
>>>
>>> def scoreline(line1,line2,metric,ic=None):
>>> sw = stopwords.words('english') # import stopwords
>>> t1 = wt(line1) # tokenize line1
>>> t2 = wt(line2) # tokenize line2
>>> syns1 = reduce(lambda x,y:x+y,[wn.synsets(x) for x in t1 if x not in sw]) # get list of synsets for tokens of line1
>>> syns2 = reduce(lambda x,y:x+y,[wn.synsets(x) for x in t2 if x not in sw]) # get list of synsets for tokens of line2
>>> runningscore = 0.0
>>> runningcount = 0
>>> print "syns1: ", syns1
>>> print "syns2: ", syns2
>>> for syn1 in set(syns1): # get Wordnet similarity score for <metric> for each pair created from both synset lists
>>> for syn2 in set(syns2):
>>> if ic is not None:
>>> try:
>>> mark = metric(syn1,syn2)
>>> except:
>>> mark = 0.0
>>> runningscore += mark
>>> else:
>>> try:
>>> mark = metric(syn1,syn2)
>>> except:
>>> mark = 0.0
>>> runningcount += 1
>>>
>>> score = runningscore/runningcount # add up individual scores, divide by number of individual scores
>>> return score # return overall scores
=============================

Running this, however, produces cruddy results:
===
>>> scoreline("The dog bit the man","The dog bit the man",wn.path_similarity)
0.086090771697353574
===

I think the problem has to do with the fact that each token has quite
a few synonyms, most of which are not necessarily related to the
meaning intended in the sentence, and that overall this cascades and
produces poor results. However, if anyone has any bright ideas as to
how to get around, or at least mitigate this, I'd be very happy to
hear them.

Best,
Edward

Jordan Boyd-Graber

unread,
Jun 26, 2009, 12:22:59 PM6/26/09
to nltk-...@googlegroups.com
Averaging over the senses will result in poor results for the reasons
you describe. Intuitively, this makes sense; a sentence only
expresses one of the senses, not all of them. You might want to
consider disambiguating the senses before computing the similarity.
There are various ways you might do this:
1) First sense heuristic
2) Maximizing sentence internal similarity before computing similarity
3) Something more elaborate, e.g. with SVMs or an outside package

Failing this, you could use the max similarity for each word (rather
than averaging) to yield an implicit disambiguation.

Cheers,

Jordan
--
--------------------
Jordan Boyd-Graber
609.384.1417

AIM: ezubaric
j...@princeton.edu
http://www.cs.princeton.edu/~jbg
--------------------

"In theory, there is no difference between theory and practice. But,
in practice, there is."
- Jan L.A. van de Snepscheut

Edward Grefenstette

unread,
Jun 26, 2009, 5:43:20 PM6/26/09
to nltk-users


On Jun 26, 5:22 pm, Jordan Boyd-Graber <j...@princeton.edu> wrote:
> Averaging over the senses will result in poor results for the reasons
> you describe.  Intuitively, this makes sense; a sentence only
> expresses one of the senses, not all of them.  You might want to
> consider disambiguating the senses before computing the similarity.
> There are various ways you might do this:
> 1) First sense heuristic
You mean just something like wn.synsets(someword)[:1] instead of
wn.synsets(someword)[:1]

> 2) Maximizing sentence internal similarity before computing similarity
Could you expand on this a little? Do you mean something like using
synsets that are most likely to co-occur? Without a synset n-gram
corpus, this doesn't seem that easy to me, but I might be missing the
point.

> Failing this, you could use the max similarity for each word (rather
> than averaging) to yield an implicit disambiguation.

I might try this.

Best,
Ed
> j...@princeton.eduhttp://www.cs.princeton.edu/~jbg

Steven Bird

unread,
Jun 26, 2009, 6:11:49 PM6/26/09
to nltk-...@googlegroups.com
2009/6/27 Edward Grefenstette <egr...@gmail.com>:

>> 1) First sense heuristic
> You mean just something like wn.synsets(someword)[:1] instead of
> wn.synsets(someword)[:1]

wn.synsets(someword)[0]

>> 2) Maximizing sentence internal similarity before computing similarity
> Could you expand on this a little?

He means you could pick a single sense for each ambiguous word in such
a way as to maximize the overall similarity of the senses contained in
the sentence. You could do this by brute force, enumerating all
combinations of senses (e.g. sense 4 of w[0], sense 1 of w[1], ...,
then combine the n(n-1)/2 similarity scores somehow). There's also a
greedy algorithm for doing this called "lexical chaining".

-Steven Bird

Jordan Boyd-Graber

unread,
Jun 26, 2009, 6:18:26 PM6/26/09
to nltk-...@googlegroups.com
>> 1) First sense heuristic
> You mean just something like wn.synsets(someword)[:1] instead of
> wn.synsets(someword)[:1]

Rather, wn.synsets(word)[0] instead of wn.synsets(word). Limiting by
part of speech would also likely help.

>> 2) Maximizing sentence internal similarity before computing similarity
> Could you expand on this a little? Do you mean something like using
> synsets that are most likely to co-occur? Without a synset n-gram
> corpus, this doesn't seem that easy to me, but I might be missing the
> point.

One possible way of disambiguating a sentence is:
1. Enumerate all possible synset combinations
2. For each one, compute the sum of pairwise similarities
3. Take the synset combination that maximizes the internal similarity

(If you don't want to sample all synset combinations, one could use a
sampling approach.

Good luck,

Jordan

Edward Grefenstette

unread,
Jun 26, 2009, 9:07:50 PM6/26/09
to nltk-users
Cheers for the advice. Using the top 10% best matches for synsets
amongst word pairs improved results drastically in test cases. Running
a full-blown version of my experiment to see if it does the trick.
Will certainly give the other stuff you talked about a shot if I have
some time.

Once again, greatly appreciated.

Best,
Edward

Pedro Marcal

unread,
Jun 27, 2009, 7:41:31 AM6/27/09
to nltk-...@googlegroups.com, Edward Grefenstette
Hi Edward, Jordan,
I have been watching your discussions with interest. I have been wrestling with the disambiguation of words in a sentence for some time. I approach it from a context free parse. This resolves the need for a first heuristic.
I have gone down the similarity approach and rejected it for the following reasons. Tje singularities in wordnet can only be measured within each pos, ie nouns with nouns, verbs with verbs but not verbs with nouns. (in any case the latter has no meaning in a sentence construction.)
An approach that seems promising is best explained by a simple first order predicate discussion such as your dog bit man-N-V-N.
The nltk gives verbnet types eg. something-verb-something. There are 33 such types. At the top level hypernym for each type we can obtain.
statistics for a case such as
[dog-N-physical ent]-[bite-V-Force]-[man-N-phys. ent]
By including the next level of hypernym , we have
[dog-N-animal-phys ent]-[bite-V-Force]-[man-N-human-phys-ent]
This level is usually sufficient to disambiguate all the synsets.
Schanck suggested through Conceptual dependency that there are 15 types of verbs, I have extended it to the three hypernym top levels ,viz phys-ent,property ent. and mental ent. as a distinction so I have 45 verb types. sim we have 50 noun types so I needed to build sdtatistics for combinations of 45X45X45 =125,000.
In practice its not as bad as that since most of the combinations of N-V-N don't make sense. It looks like 10,000 combinations should do it.
I have recently developed a Japanese to English translator (which Jordan kindly helped me with some advice) and noticed that Japanese is almost sense free, that is the kanji bypasses any need for disambiguation (almost). I speculate that Chinese would be a better -one-to-one word translation. Japanese only use 1780 kanji characters- practical Chinese uses about 20,000-50,000. The Japanese legal restriction is referred to as Toyo Kanji- It is interesting to note that toyo Kanji enabled Japan to quickly develop a literate population.That probably means that we could use Japanese rather than Chinese as a neutral AI language.!
Regards,
Pedro

Jordan Boyd-Graber

unread,
Jul 3, 2009, 10:44:09 AM7/3/09
to nltk-users
Sorry this is such a slow reply. I did some work on creating
cross-part of speech WN similarity judgments that might be useful:

http://wordnet.cs.princeton.edu/downloads/evocation.zip

We have a larger dataset that we're preparing; we'll be publishing it soon.

Cheers,

Jordan

2009/6/27 Pedro Marcal <marc...@cox.net>:

anand j

unread,
Jul 3, 2009, 10:57:16 AM7/3/09
to nltk-...@googlegroups.com
Hi all,
        This is same as the current program i am working on.. 
For starters i have done the wordnet similarity approach (tried only nouns) used a variance of sum of wn similarity all possible paths in the sense graph(graph of all the senses in the sentence). This was part of the thesis, trying to differentiate puns... .. (which as you can imagine failed :P).  
Anyway, I proceeded to make a class out of this program and trying to build functions to incorporate a couple of things:
   1. ConceptnetDB
   2. Researchcyc DB
   3. Exploring the verbnet part, which is part of nltk.

P.S: Am planning to update the class files and code on some open source site... suggestions welcome. just doing last minute tests i can think of... can mail it if you are in a hurry..

Cheers.
==============================================
Anand J
http://sites.google.com/a/cbcs.ac.in/students/anand
==============================================
The man who is really serious,
with the urge to find out what truth is,
has no style at all. He lives only in what is.
                 ~Bruce Lee

Pedro Marcal

unread,
Jul 7, 2009, 11:30:51 AM7/7/09
to nltk-...@googlegroups.com, anand j
Hi Anand,
Your project sounds interesting.I suppose you have to look for unusual similarities. I am interested in your implementation of extracting info from conceptnet.
Regards,
Pedro

Sangeetha

unread,
Jul 9, 2009, 11:26:40 PM7/9/09
to nltk-users
Hello Edward

I noticed that while using the stop words, your application does not
include words that start with a capital letter.
In python there is a function called "title", so add such words to
your stop word set.
Regards
Sangeetha
Reply all
Reply to author
Forward
0 new messages