Intro to Unsupervised Clustering: resources online?

Susan

unread,

Jan 4, 2013, 12:46:07 PM1/4/13

to nltk-...@googlegroups.com

Hi NLTK Users,

This is my first post.

ie: "college" and "schoolwork" and "academy" belong in the same cluster. The words "essay", "scholarships" , "money" also belong in the same cluster.

Ambarish Jash

unread,

Jan 4, 2013, 12:55:10 PM1/4/13

to nltk-...@googlegroups.com

Hi Susan

It depends what kind of relationship you are looking for for e.g Lexical, Ontological, Sentimental etc (positive/negative sentiment). Whatever be the relationship you should have a few words/phrases to train upon. If you do not have a training set you can always start with a seed set and use it to expand your list by using some bootstrap technique. Once you have somme training then you can start by creating a graph where each node is a word/phrase and two nodes are joined are if they have a relationship (you need not know relationship between each word/phrase). After that you can use label propagation (for labeling the unknown words/phrases) or use diffusion maps on the graph (nonlinear dimensionality reduction) for clustering.

Hope this helps.

--

--
Ambarish Jash

Nick R

unread,

Jan 4, 2013, 12:59:35 PM1/4/13

to nltk-...@googlegroups.com

Youre going to have to have a knowledge base in order to find where the items in the knowledge occur in the text. Once you have that, you can use this.

Import re
End_Location = [m.start() for m in text.read()]
Start_location = [m.end() for m in text.read()]

That will tell you where the knowledge occurs in the string. What you want is to find what occurs within the proximity. One approach to finding the most probable sequences occurring near the knowledge is to assign an arbitrary numerical parameter to look at. Another approach is pragmatic and requires a context base. I can explain the latter approach more if youre interested.

Nick

--

Nick R

unread,

Jan 4, 2013, 1:21:40 PM1/4/13

to nltk-...@googlegroups.com

The code I provided is wrong, excuse me.

[m.start() for m in re.finditer(text)]

I'll have to check if this is right too when I get home.

Nick R

unread,

Jan 4, 2013, 2:10:00 PM1/4/13

to nltk-...@googlegroups.com

Knowledge = ['scholarship']
For m in knowledge:
If m in text:
Start_location = [m.end() for m in re.finditer(text)]