Hi,
I’ve been exploring the .sspace vectors I’ve built using LSA and Random Indexing, using the SemanticSpaceExplorer class. Is there a way to extend this so that I can determine document clusters? Also, if one has a document or a bag of words, find other closest documents that match a given document?
Essentially, I would like to do the following:
Initial creation phase:
[ corpus of documents] -> [create sspace]
Subsequent exploration phase:
[ document ] -> [ bag of words ]
[ bag of words ] -> [ other similar documents ]
Thanks.
To unsubscribe from this group, send email to s-space-users+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.
I’ve been exploring the .sspace vectors I’ve built using LSA and Random Indexing, using the SemanticSpaceExplorer class. Is there a way to extend this so that I can determine document clusters?
Also, if one has a document or a bag of words, find other closest documents that match a given document?
Essentially, I would like to do the following:
Initial creation phase:
[ corpus of documents] -> [create sspace]
Subsequent exploration phase:
[ document ] -> [ bag of words ]
[ bag of words ] -> [ other similar documents ]
Thanks.
To unsubscribe from this group, send email to s-space-users+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.
Hi David,
I have looked at the HierarchicalAgglomerativeClustering class in the S-Space package, which is very interesting. However, I noticed that the algorithm is at least O(N**2), and even for fairly small sizes (30K rows, 200 dimensions), the computeSimilarityMatrix() method never completes in reasonable time. Do you have recommended sizes and expected times for that portion of clustering? Are there options that may sacrifice a bit of accuracy but work faster?
Thanks,
-venkat
Hi Keith,
Thanks very much for your pointer on cluto clustering. It is exactly what we were looking for. When we tried to run it, it failed on the Java side with the following error:
Apr 8, 2010 9:44:25 AM edu.ucla.sspace.clustering.ClutoClustering cluster
WARNING: Cluto exited with error status. -1073741819 stderr:
Exception in thread "main" java.lang.Error: Clustering failed
at edu.ucla.sspace.clustering.ClutoClustering.cluster(ClutoClustering.java:171)
at edu.ucla.sspace.clustering.ClutoClustering.agglomerativeCluster(ClutoClustering.java:76)
at pitt.search.semanticvectors.HierarchicalClustering.cluster(HierarchicalClustering.java:47)
at pitt.search.semanticvectors.HierarchicalClusteringTest.main(HierarchicalClusteringTest.java:48)
We then created the cluto input matrix and ran the command from command line, and it seemed to run for a few seconds, and then exits without generating the output file. Do you know if we are using the right cluto download, and do you have any thoughts on what could be going wrong here?
I really appreciate all the help!
D:\dev\tools\cluto\cluto-2.1.1\Win32>vcluster -clmethod=agglo -clustfile=d:/temp/clust-out.matrix d:/temp/c.matrix 100
********************************************************************************
vcluster (CLUTO 2.1.1) Copyright 2001-03, Regents of the University of Minnesota
Matrix Information -----------------------------------------------------------
Name: d:/temp/c.matrix, #Rows: 36836, #Columns: 200, #NonZeros: 2167030
Options ----------------------------------------------------------------------
CLMethod=AGGLO, CRfun=UPGMA, SimFun=Cosine, #Clusters: 100
RowModel=None, ColModel=IDF, GrModel=SY-DIR, NNbrs=40
Colprune=1.00, EdgePrune=-1.00, VtxPrune=-1.00, MinComponent=5
CSType=Best, AggloFrom=0, AggloCRFun=UPGMA, NTrials=10, NIter=10
Solution ---------------------------------------------------------------------
-venkat
Thanks David. I changed the Cluto Input to MatrixIO.Format.CLUTO_DENSE which fails the same way. The first-argument input to the ClutoClustering.cluster() method is still a matrix created using the constructor - SparseOnDiskMatrix(). Is there a way to create a “dense format” matrix as input?
That is exactly what I tried, and it still failed, with no output and the error as shown below.
David,
I upgraded from Cluto 2.1.1 to Cluto 2.1.2 and used their MSWIN-x86_64 binary and that version seems to work fine.
D:\dev\tools\cluto\cluto-2.1.2\MSWIN-x86_64>vcluster -clmethod=agglo -clustfile=
cout.matrix d:\temp\c.matrix 10
********************************************************************************
vcluster (CLUTO 2.1.2) Copyright 2001-06, Regents of the University of Minnesota
Matrix Information -----------------------------------------------------------
Name: d:\temp\c.matrix, #Rows: 36836, #Columns: 200, #NonZeros: 2167030
Options ----------------------------------------------------------------------
CLMethod=AGGLO, CRfun=UPGMA, SimFun=Cosine, #Clusters: 10
RowModel=None, ColModel=IDF, GrModel=SY-DIR, NNbrs=40
Colprune=1.00, EdgePrune=-1.00, VtxPrune=-1.00, MinComponent=5
CSType=Best, AggloFrom=0, AggloCRFun=UPGMA, NTrials=10, NIter=10
Solution ---------------------------------------------------------------------
------------------------------------------------------------------------
10-way clustering: [UPGMA=0.00e+000] [36836 of 36836]
------------------------------------------------------------------------
cid Size ISim ISdev ESim ESdev |
------------------------------------------------------------------------
0 7119 +0.095 +0.074 -0.000 +0.013 |
1 1702 +0.106 +0.058 -0.002 +0.013 |
2 1845 +0.156 +0.069 +0.001 +0.013 |
3 1388 +0.148 +0.073 +0.001 +0.013 |
4 13621 +0.049 +0.033 +0.004 +0.017 |
5 382 +0.173 +0.074 -0.004 +0.013 |
6 358 +0.286 +0.078 -0.000 +0.011 |
7 568 +0.271 +0.125 -0.003 +0.012 |
8 4384 +0.124 +0.071 +0.004 +0.015 |
9 5469 +0.086 +0.060 +0.006 +0.016 |
------------------------------------------------------------------------
Timing Information -----------------------------------------------------------
I/O: 6.645 sec
Clustering: 636.870 sec
Reporting: 0.109 sec
Memory Usage Information -----------------------------------------------------
Maximum memory used: 13592952832 bytes
Current memory used: 20701104 bytes
********************************************************************************
From: Venkat Rangan
Sent: Thursday, April 08, 2010 6:44 AM
To: s-spac...@googlegroups.com
Subject: RE: Looking for a way to extend get-neighbors
That is exactly what I tried, and it still failed, with no output and the error as shown below.