Hello Radim and bluish green,
I looked up and you are right radim, the feature space is a little different. Not much though, there are in all 367 terms that are different. Nevertheless, getting such poor retrieval results requires more investigation. So I tried 3 variations of the basic gensim model for retrieval. Although they should in theory give me the same rankings, they are all giving me different rankings.
This is how I use gensim for retrieval. after creating an LSA model, the query is mapped to this space and similarity computed. The similarity serve as scores that are then used to find rankings. The rankings of the relevant documents gives an idea of how good the model and/or the similarity function is. The elements of the table are the ranks of the "relevant" documents for the query. The lower the element value the better the algorithm.
I tried playing around with the way I compute the scores of the documents vis-a-vis the query. Turns out there are several ways. TextCorpus.nm is the Matrix market format for the corpus and QueryFile is the location of the query, K is the number of topics
a) The most basic option.
myC = gensim.corpora.MmCorpus(resultsdir + 'TextCorpus.nm') #read the corpus
Queries = open(QueryFile).readlines() #read the query , a single document
Qtext = [[word for word in q.lower().split()] for q in Queries]
myQueries = [myCorpus.dictionary.doc2bow(text) for text in Qtext]
gensim.models.lsimodel.P2_EXTRA_ITERS = 4
lsi = gensim.models.LsiModel(corpus=myC, id2word=myCorpus.dictionary, num_topics=K, distributed=False, power_iters=4) # compute lsi
index = similarities.Similarity(resultsdir + "indexsim", lsi[myC], K) # create index
scores = index[lsi[myQueries]] # compute scores
b) Variation1: Compute the similarity without using the index
lsi_maps_corpus = gensim.matutils.corpus2dense(lsi[myC],K).T # compute the dense version of the mapped corpus
lsi_maps_q = gensim.matutils.corpus2dense(lsi[myQueries],K).T # compute the dense version of the mapped query
dotproduct = dot(lsi_maps_corpus,lsi_maps_q.T) #compute dot product
denom = diag(dot(lsi_maps_corpus,lsi_maps_corpus.T)).T # compute denominator for normalizing it
scores = numpy.divide(dotproduct.T,denom).T # normalize it
scores[numpy.where(denom==0)]=0 # take care of nans
c) Variation 2: Compute the mapping and similarity yourself by using gensim's compute lsi model
U = lsi.projection.u # Gensim's computed U
s = lsi.projection.s # Gensim's computed s
V = dot(dot(linalg.inv(diag(s)),U.transpose()),dense) # creating my own mapping of documents
q_lsi = dot(dot(linalg.inv(diag(s)),U.transpose()),q) #creating my own mapping of query
dotproduct = dot(V.T,q_lsi) # compute dot product
denom = diag(dot(V.T,V)) # normalization steps
scores = numpy.divide(dotproduct.T,denom).T
scores[numpy.where(denom==0)] =0 #get rid of nans
I have attached the zip file in this post, with the text corpus, the query file and the dictionary and the script to run the retrieval example
basic option: ranks of relevant documents
[ 60. 139. 380.]
Variation 1: ranks of relevant documents
[ 384. 199. 430.]
Variation 2: ranks of relevant documents
[ 72. 31. 94.]
I am not able to figure out why essentially the same pipeline gives 3 different answers.
More analysis will follow
Shivani