Hi,
I am new user of gensim and try the lda tutorial with french wikipedia data (and after that similarity tutorial).
Thanks you very much for gensim and all the material available !
I have processed and loaded lda model, also similarities.
I try now in oder to appreciate the results to print similarities for a sentence (a new doc) I have entered.
I manage to get the docs which are the most similar in a form of 2-tupe (id of word, score) but not the document of wikipedia.
My goal is to get the original document (before bag of word treatement).
The best I manage to do is print the words of each similar doc.
So, what should I change in my code (see the red part at the end) ?
Indeed, I do not have seen method like Id2doc (or something like that) which allow to retrieve the doc from its Id.
Herafter an extract of my code:
import logging, gensim, bz2file
import os, csv, codecs
from pprint import pprint
FoundFile=True
PathCorpus='/home//wikipedia/'
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
#il faut d'abord charger la liste des user...
File=raw_input("entrer le nom du fichier contenant le modèle LDA produit par gensim : \n")
print(File)
fname=PathCorpus+'/'+File
if os.path.isfile(fname):
print("le fichier existe")
else:
print("le fichier n'existe pas")
FoundFile=False
# load id->word mapping (the dictionary), one of the results of step 2 above
id2word = gensim.corpora.Dictionary.load_from_text(bz2file.BZ2File(PathCorpus+'_wordids.txt.bz2'))
print('idwords=%s'%id2word)
# load corpus iterator
mm = gensim.corpora.MmCorpus(PathCorpus+'_tfidf.mm')
print('mm=%s'%mm)
if (FoundFile):
# on charge le modèle
lda=gensim.models.LdaModel.load(fname)
lda.print_topics()
# test sur un texte
TrainAMEA1=['Avoir pour la zone MEA une comprehension globale du domaine de ses enjeux des architectures existantes des evolutions court/moyen terme et des architectures cibles']
FormTrainAMEA1='zone MEA comprehension globale domaine enjeux architectures evolutions court/moyen terme'
pprint(FormTrainAMEA1)
new_vec = id2word.doc2bow(FormTrainAMEA1.lower().split())
print(new_vec) # the word "court/moyen" does not appear in the dictionary and is ignored
my_projection=lda[new_vec]
print(lda.print_topic(my_projection[0][0]))
print(lda.print_topic(my_projection[1][0]))
print(lda[new_vec])
#on charge le fichier des similarités
index=gensim.similarities.Similarity.load(PathCorpus+'frwiki_lda_index')
print("done")
sims = index[my_projection]
print(list(enumerate(sims)))
for i in range(len(sims)):
mylist=mm[sims[i][0]]
print("*****************************")
for j in range(len(mylist)):
print(id2word[mylist[j][0]])
print("terminé")