Hi David,
Thanks for the valuable advices!
> If you're comparing the rows of a matrix, you'll probably want to use the
> edu.ucla.sspace.matrix.RowComparator class. I don't think the
> WordComparator will do what you want in this case.
Actually I need a way to compare the generated document vectors.
Currently I am using the Similarity class and calculate directly the
similarity between two documents based on one of the metrics supplied.
Then I enter the resulting similarity value in the appropriate place
in the matrix.
> It's certainly one way to do it :) However, you might also consider using
> the document's tokens themselves, which is the Vector Space
> Model<
http://en.wikipedia.org/wiki/Vector_space_model>,
> or performing the SVD on the vector space model to have Latent Semantic
> Indexing <
http://en.wikipedia.org/wiki/Latent_semantic_indexing>-based
> documents. We have support for both in our package.
>
> The way you mentioned is feasible, but non-traditional. I would probably
> try one of the other models first and then test using the
> DocumentVectorBuilder as a second alternative.
Thanks for clearing this out for me. I will certainly have a look at
the mentioned models.
> The vectors shouldn't be random (unless you're using Random Indexing), but
> yes there is a relation. If you use either VSM, the dimensionality will be
> equal to the number of terms. For LSI, you can specify the dimensionality
> manually LSI. If you use the DocumentVectorBuilder, the dimensionality will
> depend on the dimensionality of the word vectors themselves (so it depends
> on which SemanticSpace algorithm and parameters).
I am sorry for that, I should have started with this info first.
Indeed I want to use the Random Indexing for the terms in the corpus.
What is the relationship between the vectors length and the number of
terms in this case?
> I hope this helps, and please let us know if you have any questions. Also
> if you're looking into clustering documents, we have a large number of
> Clustering algorithms that you can try, rather than using the raw document
> similarity.
I just finished the non-traditional implementation of the procedure
and arrived at the similarity matrix. Now I will have a look at the
clustering offered by the lib. Thanks again for the help!
Regards,
Kalin
On Feb 9, 9:51 pm, David Jurgens <
david.jurg...@gmail.com> wrote:
> Hi Kalin,
>
> I have the following problem - I have a corpus and I want to cluster
>
>
>
> > it in order to limit the search space for users' queries. Reading the
> > discussion on similar topic (
http://groups.google.com/group/s-space-
> > users/browse_thread/thread/87ef56425bc880ab) I understand that there
> > are several steps involved in completing a task like that using the
> > package:
>
> > 1) Build the term-based s-space from the corpus;
> > 2) Build the document-based s-space (
http://code.google.com/p/airhead-
> > research/source/browse/trunk/sspace/src/edu/ucla/sspace/common/
> > DocumentVectorBuilder.java) using the corpus and the generated term-
> > based s-space in step 1);
> > 3) Use the Word Comparator (
http://code.google.com/p/airhead-research/
> > source/browse/trunk/sspace/src/edu/ucla/sspace/common/
> > WordComparator.java) to compare two document vectors (populate the
> > similarity matrix).
>
> If you're comparing the rows of a matrix, you'll probably want to use the
> edu.ucla.sspace.matrix.RowComparator class. I don't think the
> WordComparator will do what you want in this case.
>
>
>
> > My first questions is whether this is a good approach to build a
> > document similarity matrix using the package.
>
> It's certainly one way to do it :) However, you might also consider using
> the document's tokens themselves, which is the Vector Space
> Model<
http://en.wikipedia.org/wiki/Vector_space_model>,
> or performing the SVD on the vector space model to have Latent Semantic
> Indexing <
http://en.wikipedia.org/wiki/Latent_semantic_indexing>-based