Hi Radim,
Sorry for replying to this ancient post, but it seemed like a good idea to keep topical things together.
Regarding the term-term similarity, I have the following questions:
1) In the below example, why are you using lsi.projection.u.T instead of (lsi.projection.u * lsi.projection.s).T, whereas in the document implementation, scaling seems to be False by default?
2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.
3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?
Regarding the term-term similarity, I have the following questions:
1) In the below example, why are you using lsi.projection.u.T instead of (lsi.projection.u * lsi.projection.s).T, whereas in the document implementation, scaling seems to be False by default?Yes, you're right, Deerwester et al recommended u*s for term-term comparison.Re. scaled=False: it simplifies processing of doc-doc comparisons. Instead of returning `lsi[doc] = v^-1 = s^-1 * u^-1 * doc`, and then doing `lsi[doc].T * s^2 * lsi[doc] to compare documents, gensim computes `lsi[doc] = s * v^-1 = u^-1 * doc` and `lsi[doc].T * lsi[doc]`. See also https://groups.google.com/d/msg/gensim/1pUz_CIMNIU/7-Fy5czjALsJ
2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.Sure, would be great!
3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?gensim doesn't use sparsesvd, what do you mean? And I'd recommend against using sparsesvd, as it relies on SVDLIBC, which has a serious bug (=has had for many many years or decades...). See https://github.com/piskvorky/sparsesvd/issues/3
2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.Sure, would be great!
OK, I'll start by reading https://github.com/piskvorky/gensim/wiki/Developer-page ;-). Should I start a new discussion on the mailing list to discuss the actual implementation?
3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?gensim doesn't use sparsesvd, what do you mean? And I'd recommend against using sparsesvd, as it relies on SVDLIBC, which has a serious bug (=has had for many many years or decades...). See https://github.com/piskvorky/sparsesvd/issues/3
Hmm, I'm running 0.8.5 and it has ut, s, vt = sparsesvd.sparsesvd(docs, k + 30) in lsimodel.py, line 126. This seems unchanged in version 0.8.6 except that it's now line 128.
Joris
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hi Radim,I was playing with term-term similarity and I noticed that I have a matrix "index" containing term by term with the cosine similarity between both terms in each cell, generated by:>>> index = gensim.similarities.MatrixSimilarity(termcorpus)
To query this matrix, I should get a list containing the term representation, such as:>>> query = list(termcorpus)[10]
where "10" is the index of the term, as the example in previous email. Thus, I can get the similarities vector to the query:
>>> sims = index[query]Now, Imagine that I have a lot of queries to do. So, I have to load a list in "query" and then load the vector "sims". This process, at least in my case, is very time consuming. Thus, I was wondering if I can generate a matrix like "index" (term by term) but containing the index of each row and column as the id of the term in the dictionary instead of a list containing the term representation.
>>> index = gensim.similarities.MatrixSimilarity(termcorpus)no, that doesn't create any matrix with cosine similarities between terms. It creates a matrix where each term is one row = one vector.
Sure. The syntax `for sims in index:` will go over ALL similarities of the first record (term), then second, then third etc. It is optimized, so that's what you're looking for.
If you want this matrix as numpy 2d array, you can do just `pairwise_sims = numpy.vstack(index)`Check out "special syntax" part in http://radimrehurek.com/gensim/similarities/docsim.html#how-it-works
Hi Radim,>>> index = gensim.similarities.MatrixSimilarity(termcorpus)no, that doesn't create any matrix with cosine similarities between terms. It creates a matrix where each term is one row = one vector.Hmmm, so I didn't get it right. I thought that when I do `print [row for row in index]`, it would print the vector of each term with the distances to the other terms, with the distance to itself as 1. So what is the meaning of the value in each cell?
Sure. The syntax `for sims in index:` will go over ALL similarities of the first record (term), then second, then third etc. It is optimized, so that's what you're looking for.
If you want this matrix as numpy 2d array, you can do just `pairwise_sims = numpy.vstack(index)`Check out "special syntax" part in http://radimrehurek.com/gensim/similarities/docsim.html#how-it-worksYeah, I think that `pairwise_sims = numpy.vstack(index)` is what I need. As I don't have to pass through all terms of the matrix, it would be easier to load the whole matrix in memory and access each index directly instead of iterate over all terms.
My question now is if each term of the matrix 'index' is one row, why can't I access directly the row of the term by its id? I mean, is there a reason why I cannot access the row of the term with id=10 just doing index[10]?
No, that is correct. That's exactly the `for sims in index:` syntax I pointed out in my previous email.I think there's confusion between "distances" (=computed on the fly, using `index[query]` or `for sims in index:` syntax) and input vectors (=stored in RAM with MatrixSimilarity, as 2d numpy matrix, inside `index.index`).
It's only a syntactic problem. The `index[something]` syntax is already reserved for queries: `sims = index[query]`. If you want the vector associated with document #10, you'd do `row = index.index[10]` (note the double "index"). And then you can use that row as query: `sims = index[row]`.