Semantic Representation

sam

unread,

Aug 23, 2013, 5:14:06 PM8/23/13

to semanti...@googlegroups.com

Hi there,

Im new in this group and first of all thank you for this great work..

I've gotten how to build a semantic vectors index and some other tasks such as, compare words, clustering and so on..

But something i couldnt understand is how to retrieve the clusters of the whole corpus.. in other words, I want to obtain the synonyms of all similar words and keep the centroid of each cluster/group of words as a key of them.. the main idea is to reduce the dimension of terms by using semantic terms instead of BOW..

Any idea please..

Thank you in advance..

Dominic

unread,

Aug 24, 2013, 7:35:45 PM8/24/13

to semanti...@googlegroups.com

Hi there,

There is code to write out centroids of clusters, but it's not very well integrated.

See this thread for the backstory:

https://groups.google.com/forum/#!searchin/semanticvectors/centroids/semanticvectors/lmwKQitif10/Yh7T8PTkpAAJ

To get this working, you should compile from source (see https://code.google.com/p/semanticvectors/wiki/InstallationInstructions#Compiling_from_Source_-_Package_Installation), and then enable the centroids printing by commenting-in the following line:
https://code.google.com/p/semanticvectors/source/browse/trunk/src/pitt/search/semanticvectors/ClusterVectorStore.java#203

Hope that helps. I will be on vacation from tomorrow for a week, so apologies if I'm slow in getting back to questions. I'll be much more responsive in September.

Best wishes,

Dominic

sam

unread,

Aug 25, 2013, 3:07:16 PM8/25/13

to semanti...@googlegroups.com

Thank you Dominic,

I've tried what you explain.. It works for docvectors but it looks like not working for termvectors. When I applied it for termvectors, I got this error:

Aug 26, 2013 3:04:38 AM pitt.search.semanticvectors.ClusterVectorStore main

INFO: Reading vectors into memory ...

Aug 26, 2013 3:04:38 AM pitt.search.semanticvectors.ClusterVectorStore main

INFO: Clustering vectors ...

Aug 26, 2013 3:04:38 AM pitt.search.semanticvectors.ClusterResults kMeansCluster

INFO: Initializing clusters ...

Aug 26, 2013 3:04:38 AM pitt.search.semanticvectors.ClusterResults kMeansCluster

INFO: Iterating k-means assignment ...

Aug 26, 2013 3:05:02 AM pitt.search.semanticvectors.ClusterResults kMeansCluster

INFO: Got to stable clusters ...

Exception in thread "main" java.lang.NullPointerException

at java.util.Hashtable.hash(Hashtable.java:262)

at java.util.Hashtable.put(Hashtable.java:547)

at pitt.search.semanticvectors.ClusterVectorStore.clusterOverlapMeasure(ClusterVectorStore.java:112)

at pitt.search.semanticvectors.ClusterVectorStore.main(ClusterVectorStore.java:205)

at BuildSemanticIndex.main(BuildSemanticIndex.java:65)

even when i give small number of clusters..

Thank you..

wish you enjoy your vacation :)

Dominic

unread,

Aug 26, 2013, 12:45:31 PM8/26/13

to semanti...@googlegroups.com

Uh oh. Try commenting out from the call to clusterOverlapMeasure the end of that code block

at pitt.search.semanticvectors.ClusterVectorStore.main(ClusterVectorStore.java:205) to line 212.

It looks like this is specific to some cluster comparison work I was doing on the King James Bible corpus, so it's presuming some hard-coded pathnames.

Best wishes,

Dominic

sam

unread,

Aug 27, 2013, 2:49:33 AM8/27/13

to semanti...@googlegroups.com

Hi Dominic,

Yes it is working.. Thank you..

I have one more question, Is using clusterOverlapMeasure can enhance the terms clustering?? if yes, so how to implement it for terms clustering, or it is only for docs clustering?! I tried to make some changes on the code but unfortunately I couldn't succeed..

Regards,,

sam

unread,

Aug 27, 2013, 3:43:56 AM8/27/13

to semanti...@googlegroups.com

also, why when I use idf for terms weighting I got no clusters using ClusterVectorStore with termvectors.bin gauged by LSA.. it puts all terms in only one clusters and the others are empty..

Dominic

unread,

Sep 3, 2013, 12:57:42 PM9/3/13

to semanti...@googlegroups.com

Hi Sam,

Apologies for the delayed response - I've been away on vacation.

clusterOverlapMeasure wasn't designed to enhance clustering itself, but to compare the results of clustering with a predefined / external notion of classes (currently hardcoded to the assumption that all the chapters from a particular book form a "class"). As with other forms of external supervision, I guess this could be used to enhance clustering as well, but I haven't really thought about this.

I can corroborate your other result: I also get all terms mapped to the same cluster when using ClusterVectorStore with LSA and idf. But when I run the same thing for clustering the document vectors, I get a variety of clusters. Also when I run with random projection rather than LSA. I don't know why this particular anomaly would occur with LSA, idf, and term vectors.

Best wishes,

Dominic

sam

unread,

Sep 9, 2013, 10:01:51 AM9/9/13

to semanti...@googlegroups.com

Hi Dominic,

so that is why..

Thanks for your response..

Regards..

Reply all

Reply to author

Forward