Looking for a way to extend get-neighbors

32 views
Skip to first unread message

Venkat Rangan

unread,
Mar 19, 2010, 3:17:38 PM3/19/10
to s-spac...@googlegroups.com

Hi,

 

I’ve been exploring the .sspace vectors I’ve built using LSA and Random Indexing, using the SemanticSpaceExplorer class. Is there a way to extend this so that I can determine document clusters? Also, if one has a document or a bag of words, find other closest documents that match a given document?

 

Essentially, I would like to do the following:

 

Initial creation phase:

[ corpus of documents] -> [create sspace]

 

Subsequent exploration phase:

[ document ] -> [ bag of words ]

 

[ bag of words ] -> [ other similar documents ]

 

Thanks.

 

 

Keith Stevens

unread,
Mar 19, 2010, 3:33:07 PM3/19/10
to s-spac...@googlegroups.com
Hi,

Could you elaborate on this a little bit?  I'm not sure what you mean by a [bag of words].  Currently the sspaces don't store any of the document representations.  There is a Document Vector Builder that creates vector based document representations based on a sspace and a text fragment.  Although we don't have much code that utilizes this, you could do the following:

1) build a sspace from a document collection
2) Build a document space using the same document collection and the generated sspace.
3) Store it in something conforming to the SemanticSpace interface
4) Use the Word Comparator to compare two document vectors in the space, or just use the explorer.

The only issue with this is that it doesn't work so well if you want to compare a new document, i.e. one not in the original document space, to other documents in the space.  We could probably modify or make a class comparable to the Word Comparator that does have this functionality if it'd fit your needs.

To unsubscribe from this group, send email to s-space-users+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.

David Jurgens

unread,
Mar 19, 2010, 3:33:39 PM3/19/10
to s-spac...@googlegroups.com
Hi Venkat,

I’ve been exploring the .sspace vectors I’ve built using LSA and Random Indexing, using the SemanticSpaceExplorer class. Is there a way to extend this so that I can determine document clusters?

 There currently isn't much support for document-based operations in the S-Space package.  However, there are the tools to do some operations.  The LSA class supports getting the dimensionally reduced document vectors, which can be clustered.  We have a major clustering update coming in the next few days, so we'll have additional support for different clustering operations.  Did you have a specific kind of clustering in mind that you wanted to use?

For RandomIndexing spaces, we don't have any document-based operations.  Our use of Random Indexing is based all on terms.  However, if you're looking to do document-based projection, I highly recommend the Semantic Vectors package ( http://code.google.com/p/semanticvectors/ ), which is entirely focused on document-based operations and information retrieval. 

Also, if one has a document or a bag of words, find other closest documents that match a given document?

For LSA, we don't save the projection matrix, so once the SVD has been performed it's not possible to map a new bag of words into the reduced space.

However, for all the .sspaces, we do support constructing a new "document vector" based on a bag of words.  This essentially sums the word vectors using the .sspace to create a "gist" of what the average semantics are for the document.  The code to do this is in the edu.ucla.sspace.common.DocumentVectorBuilder class.

Let me know if this helps.

  Thanks,
  David
 

Essentially, I would like to do the following:

 

Initial creation phase:

[ corpus of documents] -> [create sspace]

 

Subsequent exploration phase:

[ document ] -> [ bag of words ]

 

[ bag of words ] -> [ other similar documents ]

 

Thanks.

 

 

To unsubscribe from this group, send email to s-space-users+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.


Venkat Rangan

unread,
Apr 8, 2010, 1:26:52 AM4/8/10
to s-spac...@googlegroups.com

Hi David,

 

I have looked at the HierarchicalAgglomerativeClustering class in the S-Space package, which is very interesting. However, I noticed that the algorithm is at least O(N**2), and even for fairly small sizes (30K rows, 200 dimensions), the computeSimilarityMatrix() method never completes in reasonable time. Do you have recommended sizes and expected times for that portion of clustering? Are there options that may sacrifice a bit of accuracy but work faster?

 

Thanks,

 

-venkat

Keith Stevens

unread,
Apr 8, 2010, 1:35:22 AM4/8/10
to s-spac...@googlegroups.com
If you know the number of clusters you want, the ClutoClustering class will probably provide results much more rapidly.  the ClutoClustering class is really just a wrapper around Cluto, one of the more complete clustering packages around.

If you wish to take this route, we suggest that you have the cluto binaries included in your PATH, in the same way that you include octave, matlab, or svdlibc in your PATH. 

To use it, just call
ClutoClustering.agglomerativeClustering(matrix, numClusters)
and the item assignments will be returned in an array.

If you do not know the number of clusters you want, we have some experimental clustering algorithms that we will be including in the trunk shortly that do this.  We will also look into modifications to the current HAC code to see if we can't speed it up.

Hopefully this helps!

Cheers
--Keith Stevens

Venkat Rangan

unread,
Apr 8, 2010, 3:53:21 AM4/8/10
to s-spac...@googlegroups.com

Hi Keith,

 

Thanks very much for your pointer on cluto clustering. It is exactly what we were looking for. When we tried to run it, it failed on the Java side with the following error:

 

Apr 8, 2010 9:44:25 AM edu.ucla.sspace.clustering.ClutoClustering cluster

WARNING: Cluto exited with error status.  -1073741819 stderr:

 

Exception in thread "main" java.lang.Error: Clustering failed

      at edu.ucla.sspace.clustering.ClutoClustering.cluster(ClutoClustering.java:171)

      at edu.ucla.sspace.clustering.ClutoClustering.agglomerativeCluster(ClutoClustering.java:76)

      at pitt.search.semanticvectors.HierarchicalClustering.cluster(HierarchicalClustering.java:47)

      at pitt.search.semanticvectors.HierarchicalClusteringTest.main(HierarchicalClusteringTest.java:48)

 

We then created the cluto input matrix and ran the command from command line, and it seemed to run for a few seconds, and then exits without generating the output file. Do you know if we are using the right cluto download, and do you have any thoughts on what could be going wrong here?

 

I really appreciate all the help!

 

 

D:\dev\tools\cluto\cluto-2.1.1\Win32>vcluster -clmethod=agglo -clustfile=d:/temp/clust-out.matrix d:/temp/c.matrix 100

********************************************************************************

vcluster (CLUTO 2.1.1) Copyright 2001-03, Regents of the University of Minnesota

 

Matrix Information -----------------------------------------------------------

  Name: d:/temp/c.matrix, #Rows: 36836, #Columns: 200, #NonZeros: 2167030

 

Options ----------------------------------------------------------------------

  CLMethod=AGGLO, CRfun=UPGMA, SimFun=Cosine, #Clusters: 100

  RowModel=None, ColModel=IDF, GrModel=SY-DIR, NNbrs=40

  Colprune=1.00, EdgePrune=-1.00, VtxPrune=-1.00, MinComponent=5

  CSType=Best, AggloFrom=0, AggloCRFun=UPGMA, NTrials=10, NIter=10

 

Solution ---------------------------------------------------------------------

 

-venkat

David Jurgens

unread,
Apr 8, 2010, 4:06:38 AM4/8/10
to s-spac...@googlegroups.com
Hi Venkat,

  We've seen this happen quite often with CLUTO.  I think the program is segfaulting, which is why the output matrix isn't being written.  Do you know whether your input matrix is sparse or dense?  We've seen cluto segfault for fairly large sparse matrices, so perhaps that might be what's going on (a partial solution might be to write the matrix in the dense format).

  Thanks,
  David

David Jurgens

unread,
Apr 8, 2010, 4:15:48 AM4/8/10
to s-spac...@googlegroups.com
Hi Venkat,

  Also, regarding your comment on the Hierarchical Agglomerative Clusterting, computing the similarity matrix is unfortunately an O(n^2) operation where each individual operation takes O(m) where m is the length of the vectors.  For sparse vectors this improves a bit to be O(nz*log(nz)), where nz in the number of non-zero elements in the vector.  I have a patch coming down the line that should improve this to just O(nz) for the Cosine Similarity (the default for the H.A. Clustering), which may improve your performance some.  However, as Keith suggests, CLUTO is much faster, so if you need to scale to larger datasets, it's worth using that.

  Thanks,
  David

Venkat Rangan

unread,
Apr 8, 2010, 4:32:39 AM4/8/10
to s-spac...@googlegroups.com

Thanks David. I changed the Cluto Input to MatrixIO.Format.CLUTO_DENSE which fails the same way. The first-argument input to the ClutoClustering.cluster() method is still a matrix created using the constructor - SparseOnDiskMatrix(). Is there a way to create a “dense format” matrix as input?

David Jurgens

unread,
Apr 8, 2010, 4:41:02 AM4/8/10
to s-spac...@googlegroups.com
I'm not sure if this is the fix you just did, but try changing lines 107-111 in ClutoClustering.java from

        MatrixIO.writeMatrix(m, matrixFile,
                             ((m instanceof SparseMatrix)
                              ? MatrixIO.Format.CLUTO_SPARSE
                              : MatrixIO.Format.CLUTO_DENSE));

to

        MatrixIO.writeMatrix(m, matrixFile, MatrixIO.Format.CLUTO_DENSE);


This will write your data in the dense format, which causes CLUTO to use a different internal representation for it.  Since you're using a sparse matrix, the previous code would have written the matrix file in such a way to cause CLUTO to use a sparse representation internally.  I'm not sure if this is the solution to your problem though, but it's worth trying.

Venkat Rangan

unread,
Apr 8, 2010, 9:44:17 AM4/8/10
to s-spac...@googlegroups.com

That is exactly what I tried, and it still failed, with no output and the error as shown below.

Venkat Rangan

unread,
Apr 8, 2010, 1:00:37 PM4/8/10
to Venkat Rangan, s-spac...@googlegroups.com

David,

 

I upgraded from Cluto 2.1.1 to Cluto 2.1.2 and used their MSWIN-x86_64 binary and that version seems to work fine.

 

D:\dev\tools\cluto\cluto-2.1.2\MSWIN-x86_64>vcluster -clmethod=agglo -clustfile=

cout.matrix d:\temp\c.matrix 10

********************************************************************************

 

vcluster (CLUTO 2.1.2) Copyright 2001-06, Regents of the University of Minnesota

 

 

Matrix Information -----------------------------------------------------------

  Name: d:\temp\c.matrix, #Rows: 36836, #Columns: 200, #NonZeros: 2167030

 

Options ----------------------------------------------------------------------

  CLMethod=AGGLO, CRfun=UPGMA, SimFun=Cosine, #Clusters: 10

  RowModel=None, ColModel=IDF, GrModel=SY-DIR, NNbrs=40

  Colprune=1.00, EdgePrune=-1.00, VtxPrune=-1.00, MinComponent=5

  CSType=Best, AggloFrom=0, AggloCRFun=UPGMA, NTrials=10, NIter=10

 

Solution ---------------------------------------------------------------------

 

------------------------------------------------------------------------

10-way clustering: [UPGMA=0.00e+000] [36836 of 36836]

------------------------------------------------------------------------

cid  Size  ISim  ISdev   ESim  ESdev  |

------------------------------------------------------------------------

  0  7119 +0.095 +0.074 -0.000 +0.013 |

  1  1702 +0.106 +0.058 -0.002 +0.013 |

  2  1845 +0.156 +0.069 +0.001 +0.013 |

  3  1388 +0.148 +0.073 +0.001 +0.013 |

  4 13621 +0.049 +0.033 +0.004 +0.017 |

  5   382 +0.173 +0.074 -0.004 +0.013 |

  6   358 +0.286 +0.078 -0.000 +0.011 |

  7   568 +0.271 +0.125 -0.003 +0.012 |

  8  4384 +0.124 +0.071 +0.004 +0.015 |

  9  5469 +0.086 +0.060 +0.006 +0.016 |

------------------------------------------------------------------------

 

Timing Information -----------------------------------------------------------

   I/O:                                   6.645 sec

   Clustering:                          636.870 sec

   Reporting:                             0.109 sec

Memory Usage Information -----------------------------------------------------

   Maximum memory used:              13592952832 bytes

   Current memory used:                20701104 bytes

********************************************************************************

 

 

From: Venkat Rangan
Sent: Thursday, April 08, 2010 6:44 AM
To: s-spac...@googlegroups.com
Subject: RE: Looking for a way to extend get-neighbors

 

That is exactly what I tried, and it still failed, with no output and the error as shown below.

 

Reply all
Reply to author
Forward
0 new messages