Intro to Unsupervised Clustering: resources online?

74 views
Skip to first unread message

Susan

unread,
Jan 4, 2013, 12:46:07 PM1/4/13
to nltk-...@googlegroups.com
    Hi NLTK Users,
    This is my first post. 
    How do you write a program to classify related words into clusters? What are the online tutorial resources available that will point me in the right direction? I'm new to ML and NLP and to using the Python library nltk. Is this the right library for me to be using for this task? My sample data looks like this. If there's pre-written Python code for clustering functions, I'll like to know about it, but I'll also want to know the top-level process and steps (ie: the awesome math steps!) for writing this clustering function as well as the associated tools I should get familiar with. I'm excited to be learn more about clustering methods and the associated programming tools. My language of choice is Python.
    ie: "college" and "schoolwork" and "academy" belong in the same cluster. The words "essay", "scholarships" , "money" also belong in the same cluster. 


Ambarish Jash

unread,
Jan 4, 2013, 12:55:10 PM1/4/13
to nltk-...@googlegroups.com
Hi Susan
It depends what kind of relationship you are looking for for e.g Lexical, Ontological, Sentimental etc (positive/negative sentiment). Whatever be the relationship you should have a few words/phrases to train upon. If you do not have a training set you can always start with a seed set and use it to expand your list by using some bootstrap technique. Once you have somme training  then you can start by creating a graph where each node is a word/phrase and two nodes are joined are if they have a relationship (you need not know relationship between each word/phrase). After that you can use label propagation (for labeling the unknown words/phrases) or use diffusion maps on the graph (nonlinear dimensionality reduction) for clustering. 

Hope this helps.




--
 
 



--
Ambarish Jash

Nick R

unread,
Jan 4, 2013, 12:59:35 PM1/4/13
to nltk-...@googlegroups.com

Youre going to have to have a knowledge base in order to find where the items in the knowledge occur in the text. Once you have that, you can use this.

Import re
End_Location = [m.start() for m in text.read()]
Start_location = [m.end() for m in text.read()]

That will tell you where the knowledge occurs in the string. What you want is to find what occurs within the proximity. One approach to finding the most probable sequences occurring near the knowledge is to assign an arbitrary numerical parameter to look at. Another approach is pragmatic and requires a context base. I can explain the latter approach more if youre interested.

Nick

--
 
 

Nick R

unread,
Jan 4, 2013, 1:21:40 PM1/4/13
to nltk-...@googlegroups.com

The code I provided is wrong, excuse me.

[m.start() for m in re.finditer(text)]

I'll have to check if this is right too when I get home.

Nick R

unread,
Jan 4, 2013, 2:10:00 PM1/4/13
to nltk-...@googlegroups.com

Knowledge = ['scholarship']
For m in knowledge:
    If m in text:
        Start_location = [m.end() for m in re.finditer(text)]

And vica versa for the end location.

Hope that helps :)

Reply all
Reply to author
Forward
0 new messages