word-word similarities

1,530 views
Skip to first unread message

Radim

unread,
Feb 10, 2011, 3:08:45 PM2/10/11
to gensim
Currently gensim has no explicit word-word similarity support, only
document-document. Word-word similarities are probably the single most
common request in gensim at the moment. Definitely top of the to-do
list, I'll post an example here.

Note that I want to address this functionality explicitly in the 0.8.x
series, so the code below is just a "hack" for the current 0.7.x
version.

Continuing from the tutorial http://nlp.fi.muni.cz/projekty/gensim/tut2.html
:

First, create a corpus where each sparse vector represents a term.
This is perhaps confusing; when you iterate over standard corpora,
each vector represents a document in some space. Here, each vector
will represent a term in some space. For LSA, this is easy, the
(numTerms x numTopics) matrix of left singular vector `U` stores
exactly these term representations:

>>> termcorpus=gensim.matutils.Dense2Corpus(lsi.projection.u.T)
>>> print list(termcorpus)
[[(0, 0.049408586658972846), (1, -0.2928797224395232)],
[(0, 0.029696164515475736), (1, -0.28040380186841701)],
[(0, 0.035224170943406664), (1, -0.327504711863077)],
[(0, 0.059512387926986592), (1, -0.320496100456862)],
[(0, 0.18693110133806212), (1, -0.17065511337612463)],
[(0, 0.061357233545267967), (1, -0.46024665781206781)],
[(0, 0.059512387926986605), (1, -0.320496100456862)],
[(0, 0.058237243948987057), (1, -0.37268380161520187)],
[(0, 0.034908973567213772), (1, -0.33236749732611731)],
[(0, 0.70321089393783109), (1, 0.16115180214025807)],
[(0, 0.53773147948920075), (1, 0.075854931645231147)],
[(0, 0.40171367087101967), (1, 0.029409895802591044)]]

(there are 12 vectors=terms, each of 2 numbers)
From now on, treat it as a normal corpus, i.e. perform standard
queries as per tutorial:

>>> # create help structure for similarity queries
>>> # (this also stretches each corpus vector to unit length)
>>> index = gensim.similarities.MatrixSimilarity(termcorpus)

>>> def printsims(query):
>>> # get cosine similarity of the query to each one of the 12 terms
>>> sims = index[query]
>>> # print the result, converting ids (integers) to words (strings)
>>> fmt = ["%s(%f)" % (dictionary.id2word[idother], sim) for idother, sim in enumerate(sims)]
>>> print "the query is similar to", ', '.join(fmt)

>>> # let's use the first term ("computer") as the query
>>> query = list(termcorpus)[0]
>>> printsims(query)
the query is similar to computer(1.000000), human(0.998102),
interface(0.998201), response(0.999864), survey(0.787683),
system(0.999402), time(0.999864), user(0.999927), eps(0.998049),
trees(-0.058117), graph(0.026982), minors(0.093906)

>>> # now the same thing with "graph" (term id 10)
>>> query = list(termcorpus)[10]
>>> printsims(query)
the query is similar to computer(0.026982), human(-0.034621),
interface(-0.032993), response(0.043444), survey(0.637110),
system(-0.007608), time(0.043444), user(0.014870), eps(-0.035485),
trees(0.996378), graph(1.000000), minors(0.997752)

_________

So in the toy Deerwester example space, the terms "computer", "human",
"interface", ..., "eps" are nearly synonymous with each other. Same
for the terms "trees", "graph", "minors". The only ambivalent word is
"survey", which is somewhat similar to both of these clusters.

Now that's the technical side of things; but how useful this will be
on non-toy corpora in practice, I'm not sure. The same trick (making a
corpus of terms instead of documents) can be used with the other
transformation algorithms too, not just LSA.
Message has been deleted

Radim Řehůřek

unread,
Jan 12, 2013, 10:54:57 AM1/12/13
to gensim
Hello Nicole,

instead of `dictionary.id2word[id]`, you would now do just
`dictionary[id]`.

Gensim has been simplified since the post above (which was 2 years
ago).

HTH,
Radim


On Jan 11, 3:49 pm, Nicole <nicole.pan1...@gmail.com> wrote:
> I have been trying to run this code, but an error comes up and says that
> Dictionary has no attribute id2word. I also haven't found the function
> id2word anywhere.. Do you know what I am doing wrong?

Nicole

unread,
Jan 17, 2013, 2:20:46 PM1/17/13
to gen...@googlegroups.com
Hello Radim,

Thank you so much!

-Nicole

James Ryan

unread,
Mar 25, 2013, 10:26:44 PM3/25/13
to gen...@googlegroups.com
Hi, Radim,

How could one go about computing word-word similarity implementing Random Projection with gensim?

Thanks!

James
-

niefpaarschoenen

unread,
Aug 21, 2013, 6:48:10 AM8/21/13
to gen...@googlegroups.com
Hi Radim,

Sorry for replying to this ancient post, but it seemed like a good idea to keep topical things together.

Regarding the term-term similarity, I have the following questions:
1) In the below example, why are you using lsi.projection.u.T instead of (lsi.projection.u * lsi.projection.s).T, whereas in the document implementation, scaling seems to be False by default?
2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.
3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?

Greetings,

Joris

Op donderdag 10 februari 2011 21:08:45 UTC+1 schreef Radim:

Radim Řehůřek

unread,
Aug 22, 2013, 6:22:54 AM8/22/13
to gen...@googlegroups.com
Hello Joris!


On Wednesday, August 21, 2013 12:48:10 PM UTC+2, niefpaarschoenen wrote:
Hi Radim,

Sorry for replying to this ancient post, but it seemed like a good idea to keep topical things together.

Regarding the term-term similarity, I have the following questions:
1) In the below example, why are you using lsi.projection.u.T instead of (lsi.projection.u * lsi.projection.s).T, whereas in the document implementation, scaling seems to be False by default?

Yes, you're right, Deerwester et al recommended u*s for term-term comparison.

Re. scaled=False: it simplifies processing of doc-doc comparisons. Instead of returning `lsi[doc] = v^-1 = s^-1 * u^-1 * doc`, and then doing `lsi[doc].T * s^2 * lsi[doc] to compare documents, gensim computes `lsi[doc] = s * v^-1 = u^-1 * doc` and `lsi[doc].T * lsi[doc]`. See also https://groups.google.com/d/msg/gensim/1pUz_CIMNIU/7-Fy5czjALsJ

 
2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.

Sure, would be great!
 
3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?

gensim doesn't use sparsesvd, what do you mean? And I'd recommend against using sparsesvd, as it relies on SVDLIBC, which has a serious bug (=has had for many many years or decades...). See https://github.com/piskvorky/sparsesvd/issues/3

Best,
Radim

niefpaarschoenen

unread,
Aug 22, 2013, 8:06:07 AM8/22/13
to gen...@googlegroups.com
Radim,

Op donderdag 22 augustus 2013 12:22:54 UTC+2 schreef Radim Řehůřek:

Regarding the term-term similarity, I have the following questions:
1) In the below example, why are you using lsi.projection.u.T instead of (lsi.projection.u * lsi.projection.s).T, whereas in the document implementation, scaling seems to be False by default?

Yes, you're right, Deerwester et al recommended u*s for term-term comparison.

Re. scaled=False: it simplifies processing of doc-doc comparisons. Instead of returning `lsi[doc] = v^-1 = s^-1 * u^-1 * doc`, and then doing `lsi[doc].T * s^2 * lsi[doc] to compare documents, gensim computes `lsi[doc] = s * v^-1 = u^-1 * doc` and `lsi[doc].T * lsi[doc]`. See also https://groups.google.com/d/msg/gensim/1pUz_CIMNIU/7-Fy5czjALsJ

Yes, this makes sense and is in fact also why I suggest using lsi[term] = u*s, since this allows the use of the existing cossim Similarity classes for "correct" calculations (u * s^2 * u.T).
 
2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.

Sure, would be great!

OK, I'll start by reading https://github.com/piskvorky/gensim/wiki/Developer-page ;-). Should I start a new discussion on the mailing list to discuss the actual implementation?
 
 
3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?

gensim doesn't use sparsesvd, what do you mean? And I'd recommend against using sparsesvd, as it relies on SVDLIBC, which has a serious bug (=has had for many many years or decades...). See https://github.com/piskvorky/sparsesvd/issues/3

Hmm, I'm running 0.8.5 and it has ut, s, vt = sparsesvd.sparsesvd(docs, k + 30) in lsimodel.py, line 126. This seems unchanged in version 0.8.6 except that it's now line 128.

Joris

Radim Řehůřek

unread,
Aug 22, 2013, 11:36:34 AM8/22/13
to gen...@googlegroups.com
Hello,


2) You say that you want to address this functionality explicitly in the 0.8.x series. I don't want to complain, but gensim is currently at 0.8.6, has there been some work in this direction? If not, I wouldn't mind spending some time on this.

Sure, would be great!

OK, I'll start by reading https://github.com/piskvorky/gensim/wiki/Developer-page ;-). Should I start a new discussion on the mailing list to discuss the actual implementation?


Let's use github issues/pull requests for implementation discussions. The mailing list is more for announcements when something's ready and/or requested.
 

3) Not really limited to term-term similarity, but I noticed that the sparsesvd you use, actually outputs u.T instead of u, after which you still have to transpose it. This seems strange, do you know why this is the case?

gensim doesn't use sparsesvd, what do you mean? And I'd recommend against using sparsesvd, as it relies on SVDLIBC, which has a serious bug (=has had for many many years or decades...). See https://github.com/piskvorky/sparsesvd/issues/3

Hmm, I'm running 0.8.5 and it has ut, s, vt = sparsesvd.sparsesvd(docs, k + 30) in lsimodel.py, line 126. This seems unchanged in version 0.8.6 except that it's now line 128.


That code is conditional, under a branch that amounts to `if False: `, so it's never executed :)
It was used historically when comparing SVD accuracy between different implementations.

Best,
Radim



Joris

Roger Leitzke

unread,
Oct 7, 2013, 9:17:13 PM10/7/13
to gen...@googlegroups.com
Hi Radim,

I was playing with term-term similarity and I noticed that I have a matrix "index" containing term by term with the cosine similarity between both terms in each cell, generated by:

>>> index = gensim.similarities.MatrixSimilarity(termcorpus) 

To query this matrix, I should get a list containing the term representation, such as:

>>> query = list(termcorpus)[10]

where "10" is the index of the term, as the example in previous email. Thus, I can get the similarities vector to the query:

>>> sims = index[query]

Now, Imagine that I have a lot of queries to do. So, I have to load a list in "query" and then load the vector "sims". This process, at least in my case, is very time consuming. Thus, I was wondering if I can generate a matrix like "index" (term by term) but containing the index of each row and column as the id of the term in the dictionary instead of a list containing the term representation.

Thanks in advance,
Roger


2013/8/22 Radim Řehůřek <m...@radimrehurek.com>

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Radim Řehůřek

unread,
Oct 8, 2013, 12:13:41 PM10/8/13
to gen...@googlegroups.com
Hello Roger,


On Tuesday, October 8, 2013 3:17:13 AM UTC+2, Roger Leitzke wrote:
Hi Radim,

I was playing with term-term similarity and I noticed that I have a matrix "index" containing term by term with the cosine similarity between both terms in each cell, generated by:

>>> index = gensim.similarities.MatrixSimilarity(termcorpus) 

no, that doesn't create any matrix with cosine similarities between terms. It creates a matrix where each term is one row = one vector.
 

To query this matrix, I should get a list containing the term representation, such as:

>>> query = list(termcorpus)[10]

Don't do this; it will consume a lot of RAM and be very slow.
 

where "10" is the index of the term, as the example in previous email. Thus, I can get the similarities vector to the query:

>>> sims = index[query]

Now, Imagine that I have a lot of queries to do. So, I have to load a list in "query" and then load the vector "sims". This process, at least in my case, is very time consuming. Thus, I was wondering if I can generate a matrix like "index" (term by term) but containing the index of each row and column as the id of the term in the dictionary instead of a list containing the term representation.

Sure. The syntax `for sims in index:` will go over ALL similarities of the first record (term), then second, then third etc. It is optimized, so that's what you're looking for.

If you want this matrix as numpy 2d array, you can do just `pairwise_sims = numpy.vstack(index)`


Let me know if there's anything unclear. Best,
Radim

Roger Leitzke

unread,
Oct 8, 2013, 9:34:21 PM10/8/13
to gen...@googlegroups.com
Hi Radim,


>>> index = gensim.similarities.MatrixSimilarity(termcorpus) 

no, that doesn't create any matrix with cosine similarities between terms. It creates a matrix where each term is one row = one vector.

Hmmm, so I didn't get it right. I thought that when I do `print [row for row in index]`, it would print the vector of each term with the distances to the other terms, with the distance to itself as 1. So what is the meaning of the value in each cell?

 
Sure. The syntax `for sims in index:` will go over ALL similarities of the first record (term), then second, then third etc. It is optimized, so that's what you're looking for.

If you want this matrix as numpy 2d array, you can do just `pairwise_sims = numpy.vstack(index)`



Yeah, I think that `pairwise_sims = numpy.vstack(index)` is what I need. As I don't have to pass through all terms of the matrix, it would be easier to load the whole matrix in memory and access each index directly instead of iterate over all terms.

My question now is if each term of the matrix 'index' is one row, why can't I access directly the row of the term by its id? I mean, is there a reason why I cannot access the row of the term with id=10 just doing index[10]?

Thanks for helping me again! =)
Best,
Roger


Radim Řehůřek

unread,
Oct 9, 2013, 4:22:28 AM10/9/13
to gen...@googlegroups.com


On Wednesday, October 9, 2013 3:34:21 AM UTC+2, Roger Leitzke wrote:
Hi Radim,


>>> index = gensim.similarities.MatrixSimilarity(termcorpus) 

no, that doesn't create any matrix with cosine similarities between terms. It creates a matrix where each term is one row = one vector.

Hmmm, so I didn't get it right. I thought that when I do `print [row for row in index]`, it would print the vector of each term with the distances to the other terms, with the distance to itself as 1. So what is the meaning of the value in each cell?

No, that is correct. That's exactly the `for sims in index:` syntax I pointed out in my previous email.
 
I think there's confusion between "distances" (=computed on the fly, using `index[query]` or `for sims in index:` syntax) and input vectors (=stored in RAM with MatrixSimilarity, as 2d numpy matrix, inside `index.index`).


 
Sure. The syntax `for sims in index:` will go over ALL similarities of the first record (term), then second, then third etc. It is optimized, so that's what you're looking for.

If you want this matrix as numpy 2d array, you can do just `pairwise_sims = numpy.vstack(index)`



Yeah, I think that `pairwise_sims = numpy.vstack(index)` is what I need. As I don't have to pass through all terms of the matrix, it would be easier to load the whole matrix in memory and access each index directly instead of iterate over all terms.

My question now is if each term of the matrix 'index' is one row, why can't I access directly the row of the term by its id? I mean, is there a reason why I cannot access the row of the term with id=10 just doing index[10]?


It's only a syntactic problem. The `index[something]` syntax is already reserved for queries: `sims = index[query]`. If you want the vector associated with document #10, you'd do `row = index.index[10]` (note the double "index"). And then you can use that row as query: `sims = index[row]`.

Hope that helps :)
Radim

Roger Leitzke

unread,
Oct 9, 2013, 5:15:00 AM10/9/13
to gen...@googlegroups.com
Hi Radim,

No, that is correct. That's exactly the `for sims in index:` syntax I pointed out in my previous email.
 
I think there's confusion between "distances" (=computed on the fly, using `index[query]` or `for sims in index:` syntax) and input vectors (=stored in RAM with MatrixSimilarity, as 2d numpy matrix, inside `index.index`).

Ok, I got it! :-)
 

It's only a syntactic problem. The `index[something]` syntax is already reserved for queries: `sims = index[query]`. If you want the vector associated with document #10, you'd do `row = index.index[10]` (note the double "index"). And then you can use that row as query: `sims = index[row]`.


Perfect! It's even better because I don't have to load the whole matrix in memory, thus accessing each row directly from the disk.


Thanks again!
Best,
Roger
Reply all
Reply to author
Forward
0 new messages