Hi all,
I found the S-Space package when I was looking for an implementation of the Clustering by Committee algorithm. I am very impressed with the scope of this project, but I'm afraid I don't know where to begin. I only know a little bit of Java (I usually program in Python), and S-Space looks a bit overwhelming.
My goal is to create clusters of similar adjectives. I'd like to categorize as many adjectives as possible, so that I can check for any two adjectives whether or not they fall in the same semantic domain. (It's no problem if words with other parts of speech are clustered as well, as long as I can perform that check. However, it might save some space if the algorithm disregards nouns and verbs.)
This is what I know from Pantel's thesis:
1. The algorithm works on a dependency-parsed corpus.
2. I need to create a feature vector, with the features ranked by their PMI score.
3. Based on this vector, lists of similar words are created.
4. Based on those lists and the vector, committees are formed and words are clustered.
5. I should use soft clustering, so that words can be in multiple clusters.
From the source code, it seems that ClusteringByCommittee.java takes care of steps 3 and 4 (though I can't yet figure out how to initialize this class). As for step 1 and 2, there are many files with the words "dependency", "matrix", or "vector" in them, and it is unclear to me which I should use to get everything up and running.
So: can anyone please help me, ideally with some basic example code, and show me how to load a corpus, run the CBC algorithm and produce a list of clusters? Of course I'd be happy to upload my final code to GitHub as an example for future users.