questions about LSA representation

31 views
Skip to first unread message

Carmen Torres López

unread,
Nov 25, 2015, 4:31:21 PM11/25/15
to s-space-re...@googlegroups.com
Hello,

I'm getting a document vector from LSA semantic space representation.
I got negative values in the matrix, I would like to know what is the
meaning of this negative values? I'm using SVDLIBC as SVD
implementation. I changed Transform class but I keep getting negatives
values.

I show an example here:
Output for document vector matrix (4x4)

5.8129043355E-17 6.876023488999999E-16 3.6747604991999997E-16 2.23607

3.0794180147100003 -3.5372667954999995E-16 0.7085365055999999
-5.699496462299999E-17

4.1436065311000003E-16 2.91755 5.177445627200001E-16 -6.8374324853E-16

0.8338194216 -2.3803181781E-16 -2.61672974136 2.3510263586999997E-16

Also, I would like to use LSA matrix as input for a clustering
algorithm, could this negative values affect the clustering?
I made some tests with VSM and LSA representation, but I got the same
clusters for both methods. It shouldn´t be different both results due
to I´m using two different semantic spaces?

thanks in advanced,
Carmen

Zehner, Fabian

unread,
Nov 25, 2015, 11:39:06 PM11/25/15
to s-space-re...@googlegroups.com
Dear Carmen,

Hopefully it's okay me jumping in here. Anyone can feel free to correct me.

1) Negative values in an SVD are common and meaningful. Imagine the columns in your document vector matrix as (more or less) independent semantic concepts. E.g., the first dimension (column) refers to nature, the second one to motor sport etc. Then, you can interpret a positive value in your vector for the first dimension as "this document rather contains words related with nature" (the values are also referred to as "loadings": how much load of this semantic concept is carried in the document). In case of a zero-value, there are words that do not have something to do with nature but they could also occur in the context of nature (there is no relationship). A negative value, in turn, means that there are especially words in the document that deal with "the opposite" of nature (whatever that is), meaning that these are words that particularly come up in contexts that do n o t deal with nature and these words do n o t show up if the topic is nature. You can interpret the loadings more or less like correlation coefficients (it's of course somewhat more complex than in correlation coefficients because the SVD considers indirect relations across contexts).

2) Whether negative values affect your clustering depends on the distance metric you are using. Typically, in LSA you would use cosine (to put it more precisely: arccosine), which can perfectly deal with negative values. Right now, I don't even come up with a distance metric that can't deal with them, despite metrics for binary variables, that of course are not applicable here at all. But make sure you know which computations are carried out for your distance metric and you can decide whether this would be a problem.

3) I am not sure what you mean by VSM because LSA is also a vector space model. But I guess in the VSM you "only" use the term document frequency? In this case, I'd say it would rather be a special case that there is no difference between LSA and VSM and you should stick to the more parsimonious model without latent concepts. But maybe, there might be also ways to perform a better LSA so that the LSA model can improve.

Best regards,
Fabian

-----Ursprüngliche Nachricht-----
Von: s-space-re...@googlegroups.com [mailto:s-space-re...@googlegroups.com] Im Auftrag von Carmen Torres López
Gesendet: Mittwoch, 25. November 2015 22:31
An: s-space-re...@googlegroups.com
Betreff: questions about LSA representation
--
You received this message because you are subscribed to the Google Groups "Semantic Space Research - Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research...@googlegroups.com.
To post to this group, send email to s-space-re...@googlegroups.com.
Visit this group at http://groups.google.com/group/s-space-research-dev.
For more options, visit https://groups.google.com/d/optout.

carmentor...@gmail.com

unread,
Nov 26, 2015, 12:40:10 PM11/26/15
to Semantic Space Research - Development
Hello Fabian,

Thank you very much for your answer. I´m doing a research for master thesis and I found S-Space library very useful for researching in the text mining field.

1. I have now a clearer idea of the meaning of the values in the LSA matrix. Thanks.

2. I'm using cosine distance metric for clustering.

3. Yes, I know LSA is a vector space model (VSM), but I want to experiment with both implementations of S-Space: the VectorSpaceModel class and LatentSemanticAnalysis class. For VSM I only used the Tf-Idf Transform. I thought that if I give VSM matrix and LSA matrix as input to the HierarchicalAgglomerativeClustering algorithm (which is the one I'm using) I would get different clusters, maybe better clusters from the LSA representation because that model reduces dimension with SVD in opposite to VSM.
For LSA I only use two parameters: the number of dimensions and true for retain the document vector.
LatentSemanticAnalysis ss = new LatentSemanticAnalysis(dimensions, true);
I also try other transforms like PointWiseMutualInformationTransform but the results where the same for both models.

If anyone could suggest me which other parameter I could change to improve LSA model I'll really appreciate it.

Best regards,
Carmen

David Jurgens

unread,
Nov 26, 2015, 12:46:02 PM11/26/15
to s-space-re...@googlegroups.com
Hi Carmen,

 How many dimensions are you using, relative to the number of documents?  

  Thanks,
  David

carmentor...@gmail.com

unread,
Nov 26, 2015, 12:59:03 PM11/26/15
to Semantic Space Research - Development
Hello David,

Yes, the number of dimensions is equivalent to the number of documents. Otherwise I get an exception that SVDLIBC generated the incorrect number of dimensions.

Best regards,
Carmen

David Jurgens

unread,
Nov 26, 2015, 1:03:53 PM11/26/15
to s-space-re...@googlegroups.com
Ah!  That's what I was suspecting.  In this case, the SVD isn't actually projecting the term-document matrix into a lower number of dimensions, so its vectors should be roughly equivalent.  How many documents do you have?  Usually that SVDLIBC error happens when there are very few documents.

carmentor...@gmail.com

unread,
Nov 26, 2015, 1:58:20 PM11/26/15
to Semantic Space Research - Development
Hi David,

Well, the number of documents that I have are low (10 more less for each semantic space) because in my research I have a first study case where I consider a document as one text segment, each segment could have 1,2 or 3 sentences. The text in the documents I used are generally short.

So, if I understood you well SVDLIBC shouldn´t give me that error if I have more documents? how many approximatly? I have a corpus of 800 documents, what size do you recommend me to try? I saw that LSA default dimensions is 300, is this the minimum number? I have also a second study case where I think I could use LSA, where I´ll represent my whole corpus in one semantic space, then I´ll have near 8000 segments.

Then, for the first study case I think I´ll keep with VSM representation and I could try LSA for the second study case.

Thanks,
Carmen

Ahmed Jabbar Obaid

unread,
Oct 30, 2017, 10:22:59 PM10/30/17
to Semantic Space Research - Development
No, recently i used LSA matrices result as input to clustering algorithms , K-means, you know SQ Euclidean distance is pick square values, so don't worry about negative values, you should take care about evaluation metrics due to noise may exist in your data  

Ahmed Jabbar Obaid

unread,
Oct 30, 2017, 10:22:59 PM10/30/17
to Semantic Space Research - Development
Hello Carmen, 

recently i have used LSA in my matrices (SVD method) then result matrices as input to my clustering algorithm ,  you know if you are using Squared Euclidean distance as proximity measure then you should not worry about negative values ( square power eliminate negative) you should worry about evaluation metrics where in your data if outliers are exist, poor values will happen , 

hope this help you... 

Dr. Ahmed J.   


On Thursday, 26 November 2015 00:31:21 UTC+3, Carmen Torres López wrote:
Reply all
Reply to author
Forward
0 new messages