Getting "true" document vectors from a VectorSpaceModel

15 views
Skip to first unread message

Johann Petrak

unread,
Oct 3, 2014, 1:16:22 PM10/3/14
to s-spac...@googlegroups.com
Is there a way, and if yes, what is the most efficient way,  to get from a VectorSpaceModel a vector that represents the counts (or, after a transform, the tf*idf scores) of each term in a new or existing document? What DocumentVectorBuilder.buildVector creates is not really a document vector in that sense, but the linear (or some other) combination of all the term vectors that occur in a document.
What I would like to get instead is something that would give me for an existing document each of the tf*idf scores of all the terms that occur in the document or would create a new vector from a string of terms for a new document.
So the number of elements in the vector I want would be the total number of different terms that occur in the corpus for a dense vector and the number of different terms that occur in the document for a sparse vector. The number of elements the DocumentVectorBuilder.buildVector method gives me is, if I understood correctly, the number of documents in the corpus for the dense vector and the number of documents in which any of the terms occurs for the sparse vector.
Thanks and sorry if I am missing something blatantly obvious here,
 
   Johann

David Jurgens

unread,
Oct 3, 2014, 5:45:57 PM10/3/14
to s-spac...@googlegroups.com
Hi Johann,

Is there a way, and if yes, what is the most efficient way,  to get from a VectorSpaceModel a vector that represents the counts (or, after a transform, the tf*idf scores) of each term in a new or existing document?

As of five minutes ago, there is now a way to do this. :)  I added new functionality to the class to expose the underlying document vectors. Based on how we implemented the class, the vector values will represent the values after whatever transform has been applied.  If no transform is applied, these are the raw frequency values.  If you grab the latest code from github, it should have this new functionality.

I realize now that figuring out which vector dimensions correspond to which words isn't obvious as well.  To keep track of these, you'll need to construct the VSM with your own BasisMapping class that keeps track of the word's dimensions.  The code would look something like this:

        BasisMapping termToIndex = new StringBasisMapping();
        VectorSpaceModel vsm = new VectorSpaceModel(false, termToIndex, new SvdlibcSparseBinaryMatrixBuilder());
        // Process your documents here 
        vsm.processSpace(new java.util.Properties());
        DoubleVector docVec = vsm.getDocumentVector(0);
        // Use the basis mapping to figure out which dimension is assigned to each word
        int dimension = termToIndex.getDimension("test");
        double docFreq = docVec.get(dimension);


What I would like to get instead is something that would give me for an existing document each of the tf*idf scores of all the terms that occur in the document or would create a new vector from a string of terms for a new document.

I think this new functionality should support the former.  Creating the tf-idf scores for a new document would require a bit more hacking to keep the tf-idf values around.  However, if you really needs this, I can look into what it would take.
 
So the number of elements in the vector I want would be the total number of different terms that occur in the corpus for a dense vector and the number of different terms that occur in the document for a sparse vector.

The length of the vector should be the number of unique terms in the the entire corpus and the number of non-zero entities should be number of terms that appeared in the document.  I think the underlying DoubleVector implementation returned by the VSM should be a SparseDoubleVector so you can cast it and use getNonZeroIndices() to get these quickly.

The number of elements the DocumentVectorBuilder.buildVector method gives me is, if I understood correctly, the number of documents in the corpus for the dense vector and the number of documents in which any of the terms occurs for the sparse vector.

DocumentVectorBuilder is a simple aggregation function that create a document representation by summing the vectors of the words in that document (where the word vectors come from an existing SemantSpace).  The dimensions of the document's vector are entirely reliant on the SemanticSpace and may not correspond to the number of documents or number of words (as in the case of LSA). 
 
Thanks and sorry if I am missing something blatantly obvious here,

Not a problem!  This kind of functionality should have been better supported, so your questions are most welcomed!  Please let us know if we can improve the package further or if this change doesn't meet your needs.
 
  Thanks,
  David


 
   Johann

--

---
You received this message because you are subscribed to the Google Groups "S-Space Package Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages