Similarity Matrix

29 views
Skip to first unread message

Kalin Stefanov

unread,
Feb 9, 2011, 10:03:39 AM2/9/11
to S-Space Package Users
Hello!

Let me start with what a wonderful work you do with this package!
Thanks!

I have the following problem - I have a corpus and I want to cluster
it in order to limit the search space for users' queries. Reading the
discussion on similar topic (http://groups.google.com/group/s-space-
users/browse_thread/thread/87ef56425bc880ab) I understand that there
are several steps involved in completing a task like that using the
package:

1) Build the term-based s-space from the corpus;
2) Build the document-based s-space (http://code.google.com/p/airhead-
research/source/browse/trunk/sspace/src/edu/ucla/sspace/common/
DocumentVectorBuilder.java) using the corpus and the generated term-
based s-space in step 1);
3) Use the Word Comparator (http://code.google.com/p/airhead-research/
source/browse/trunk/sspace/src/edu/ucla/sspace/common/
WordComparator.java) to compare two document vectors (populate the
similarity matrix).

My first questions is whether this is a good approach to build a
document similarity matrix using the package.

My second question is whether there is any relation between the number
of terms in the corpus and the dimensions of the random vectors built
in step 1).

Thanks for your advice in advance!

Best,
Kalin

David Jurgens

unread,
Feb 9, 2011, 3:51:29 PM2/9/11
to s-spac...@googlegroups.com
Hi Kalin,

I have the following problem - I have a corpus and I want to cluster
it in order to limit the search space for users' queries. Reading the
discussion on similar topic (http://groups.google.com/group/s-space-
users/browse_thread/thread/87ef56425bc880ab
) I understand that there
are several steps involved in completing a task like that using the
package:

1) Build the term-based s-space from the corpus;
2) Build the document-based s-space (http://code.google.com/p/airhead-
research/source/browse/trunk/sspace/src/edu/ucla/sspace/common/

DocumentVectorBuilder.java) using the corpus and the generated term-
based s-space in step 1);
3) Use the Word Comparator (http://code.google.com/p/airhead-research/
source/browse/trunk/sspace/src/edu/ucla/sspace/common/

WordComparator.java) to compare two document vectors (populate the
similarity matrix).

If you're comparing the rows of a matrix, you'll probably want to use the edu.ucla.sspace.matrix.RowComparator class.  I don't think the WordComparator will do what you want in this case.
 

My first questions is whether this is a good approach to build a
document similarity matrix using the package.

It's certainly one way to do it :)  However, you might also consider using the document's tokens themselves, which is the Vector Space Model, or performing the SVD on the vector space model to have Latent Semantic Indexing-based documents.  We have support for both in our package.  

The way you mentioned is feasible, but non-traditional.  I would probably try one of the other models first and then test using the DocumentVectorBuilder as a second alternative.
 

My second question is whether there is any relation between the number
of terms in the corpus and the dimensions of the random vectors built
in step 1).

The vectors shouldn't be random (unless you're using Random Indexing), but yes there is a relation.  If you use either VSM, the dimensionality will be equal to the number of terms.  For LSI, you can specify the dimensionality manually LSI.  If you use the DocumentVectorBuilder, the dimensionality will depend on the dimensionality of the word vectors themselves (so it depends on which SemanticSpace algorithm and parameters).
 
Thanks for your advice in advance!

I hope this helps, and please let us know if you have any questions.  Also if you're looking into clustering documents, we have a large number of Clustering algorithms that you can try, rather than using the raw document similarity.

  Thanks,
  David
 
 

Kalin Stefanov

unread,
Feb 10, 2011, 10:47:55 AM2/10/11
to S-Space Package Users
Hi David,

Thanks for the valuable advices!

> If you're comparing the rows of a matrix, you'll probably want to use the
> edu.ucla.sspace.matrix.RowComparator class. I don't think the
> WordComparator will do what you want in this case.

Actually I need a way to compare the generated document vectors.
Currently I am using the Similarity class and calculate directly the
similarity between two documents based on one of the metrics supplied.
Then I enter the resulting similarity value in the appropriate place
in the matrix.

> It's certainly one way to do it :) However, you might also consider using
> the document's tokens themselves, which is the Vector Space
> Model<http://en.wikipedia.org/wiki/Vector_space_model>,
> or performing the SVD on the vector space model to have Latent Semantic
> Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>-based
> documents. We have support for both in our package.
>
> The way you mentioned is feasible, but non-traditional. I would probably
> try one of the other models first and then test using the
> DocumentVectorBuilder as a second alternative.

Thanks for clearing this out for me. I will certainly have a look at
the mentioned models.

> The vectors shouldn't be random (unless you're using Random Indexing), but
> yes there is a relation. If you use either VSM, the dimensionality will be
> equal to the number of terms. For LSI, you can specify the dimensionality
> manually LSI. If you use the DocumentVectorBuilder, the dimensionality will
> depend on the dimensionality of the word vectors themselves (so it depends
> on which SemanticSpace algorithm and parameters).

I am sorry for that, I should have started with this info first.
Indeed I want to use the Random Indexing for the terms in the corpus.
What is the relationship between the vectors length and the number of
terms in this case?

> I hope this helps, and please let us know if you have any questions. Also
> if you're looking into clustering documents, we have a large number of
> Clustering algorithms that you can try, rather than using the raw document
> similarity.

I just finished the non-traditional implementation of the procedure
and arrived at the similarity matrix. Now I will have a look at the
clustering offered by the lib. Thanks again for the help!

Regards,
Kalin

On Feb 9, 9:51 pm, David Jurgens <david.jurg...@gmail.com> wrote:
> Hi Kalin,
>
> I have the following problem - I have a corpus and I want to cluster
>
>
>
> > it in order to limit the search space for users' queries. Reading the
> > discussion on similar topic (http://groups.google.com/group/s-space-
> > users/browse_thread/thread/87ef56425bc880ab) I understand that there
> > are several steps involved in completing a task like that using the
> > package:
>
> > 1) Build the term-based s-space from the corpus;
> > 2) Build the document-based s-space (http://code.google.com/p/airhead-
> > research/source/browse/trunk/sspace/src/edu/ucla/sspace/common/
> > DocumentVectorBuilder.java) using the corpus and the generated term-
> > based s-space in step 1);
> > 3) Use the Word Comparator (http://code.google.com/p/airhead-research/
> > source/browse/trunk/sspace/src/edu/ucla/sspace/common/
> > WordComparator.java) to compare two document vectors (populate the
> > similarity matrix).
>
> If you're comparing the rows of a matrix, you'll probably want to use the
> edu.ucla.sspace.matrix.RowComparator class.  I don't think the
> WordComparator will do what you want in this case.
>
>
>
> > My first questions is whether this is a good approach to build a
> > document similarity matrix using the package.
>
> It's certainly one way to do it :)  However, you might also consider using
> the document's tokens themselves, which is the Vector Space
> Model<http://en.wikipedia.org/wiki/Vector_space_model>,
> or performing the SVD on the vector space model to have Latent Semantic
> Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>-based
Reply all
Reply to author
Forward
0 new messages