Using NLTK wordnet to cluster words based on similarity?

Republic

unread,

Mar 4, 2010, 10:57:14 AM3/4/10

to nltk-users

Hi,

I was wondering if it is possible for me to use NLTK + wordnet to
group (nouns) words together via similar meanings?

Assuming I have 2000 words or topics. Is it possible for me to group
them together according to similar meanings using NLTK?

So that at the end of the day I would have different groups of words
that are similar in meaning? Can that be done in NLTK? and possibly be
able to detect salient patterns emerging? (trend in topics etc...).

Is there a further need for a word classifier based on the CMU BOW
toolkit to classify words to get it into categories?or the above group
would be good enough? Is there a need to classify words further?

How would one classify words in NLTK effectively?

Really hope you can enlighten me?

FM

Steven Bird

unread,

Mar 7, 2010, 1:21:46 AM3/7/10

to nltk-users

2010/3/5 Republic <ngfo...@gmail.com>:

> Assuming I have 2000 words or topics. Is it possible for me to group
> them together according to similar meanings using NLTK?

You could compute WordNet similarity (pairwise), so that each
word/topic is represented as a vector of distances, which could then
be discretized, so each vector would have a form like this:
[0,2,3,1,0,0,2,1,3,...]. These vectors could then be clustered using
one of the methods in the NLTK cluster package.

> So that at the end of the day I would have different groups of words
> that are similar in meaning? Can that be done in NLTK? and possibly be
> able to detect salient patterns emerging? (trend in topics etc...).

This suggests a temporal dimension, which might mean recomputing the
clusters as more words or topics come in.

It might help to read the NLTK book sections on WordNet and on text
classification, and also some of the other cited material.

-Steven Bird

Foo Meng Ng

unread,

Mar 23, 2010, 2:07:02 AM3/23/10

to nltk-...@googlegroups.com

Hi,

I just want to double check that I am on the right track on
grouping/clustering similar document/sentences/words.

Hope you can clarify that I am shooting on the right track or I am off tangent?

Assuming that all my words are nouns. and I have 10 bags of words and
each bag may contain about 8 words.

I pool all the words together to create a master list of words.

Next I will compare the each of my 10 bags of words with the master
list of words. By going through each bag I will know what are the
words in the bags that correspond to master list. And I will be able
to create for each bag a vector of distance i.e.
[1,0,1,0,1,0,1,0,1,0,1,0]. The dimension of the vector is dependent on
the number of words that is in the master list of words.

At the end of the day I will have created a bag of words similarity matrix.

Am I right to say that with this matrix I can proceed to cluster it
using NLTK's clustering tool? or any clustering tool like cluto, weka
etc....

Hope someone can enlighten me further. Many thanks.

Next, another question, what about words that have the same semantic meaning?

prior to the creation of the vector matrix for clustering, is there
value to perhaps do a wordnet thing like finding the root hypernym for
all the words? So that I can literally reduce the number of words in
the master list thereby reducing the dimensionality of the vectors for
each of the bag of words?

hope you can share?

FM

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>

--
Ng Foo Meng

Foo Meng Ng

unread,

Mar 23, 2010, 2:47:58 AM3/23/10

to nltk-...@googlegroups.com

Cool! Many thanks for the code! I am glad that I am on the right track!

Just gonna keep mugging and playing with NLTK!

FM

On Tue, Mar 23, 2010 at 2:21 PM, Brian <refle...@gmail.com> wrote:

> Hi,
>
> I recently used the nltk wordnet interface to do some of the things you
> suggest. I computed a bunch of pairwise similarity metrics based on a set of
> words and output them in a matrix format that is suitable for clustering. I
> also do some hypernym stuff, like plot the hypernym hierarchy of these words
> using graphviz. I used my own clustering tool but I imagine nltk clustering
> facilities are compatible with the output. Maybe you will find it useful - I
> put it up on the web for you. No guarantees!
>
> http://grey.colorado.edu/mingus/index.php/Objrec_Wordnet.py
>
> Cheers,
>
> Brian Mingus
> Professional Research Assistant
> Computational Cognitive Neuroscience Lab
> University of Colorado at Boulder

Foo Meng Ng

unread,

Mar 24, 2010, 5:09:29 AM3/24/10

to nltk-...@googlegroups.com

forgot to ask am I right or am I doing it in another way that leads elsewhere?

FM

--
Ng Foo Meng

Reply all

Reply to author

Forward

Message has been deleted