How to retrieve document from corpus with its ID ?

boardw...@gmail.com

unread,

Jan 10, 2017, 9:42:22 AM1/10/17

to gensim

Hi,

I am new user of gensim and try the lda tutorial with french wikipedia data (and after that similarity tutorial).

Thanks you very much for gensim and all the material available !

I have processed and loaded lda model, also similarities.

I try now in oder to appreciate the results to print similarities for a sentence (a new doc) I have entered.

I manage to get the docs which are the most similar in a form of 2-tupe (id of word, score) but not the document of wikipedia.
My goal is to get the original document (before bag of word treatement).

The best I manage to do is print the words of each similar doc.
So, what should I change in my code (see the red part at the end) ?

Indeed, I do not have seen method like Id2doc (or something like that) which allow to retrieve the doc from its Id.

Herafter an extract of my code:

import logging, gensim, bz2file
import os, csv, codecs
from pprint import pprint


FoundFile=True
PathCorpus='/home//wikipedia/'

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#il faut d'abord charger la liste des user...
File=raw_input("entrer le nom du fichier contenant le modèle LDA produit par gensim : \n")
print(File)
fname=PathCorpus+'/'+File
if os.path.isfile(fname):
    print("le fichier existe")
else:
    print("le fichier n'existe pas")
    FoundFile=False


# load id->word mapping (the dictionary), one of the results of step 2 above
id2word = gensim.corpora.Dictionary.load_from_text(bz2file.BZ2File(PathCorpus+'_wordids.txt.bz2'))
print('idwords=%s'%id2word)
# load corpus iterator
mm = gensim.corpora.MmCorpus(PathCorpus+'_tfidf.mm')
print('mm=%s'%mm)

if (FoundFile):
    # on charge le modèle
    lda=gensim.models.LdaModel.load(fname)
    lda.print_topics()
    # test sur un texte
    TrainAMEA1=['Avoir pour la zone MEA une comprehension globale du domaine de ses enjeux des architectures existantes des evolutions court/moyen terme et des architectures cibles']
    FormTrainAMEA1='zone MEA comprehension globale domaine enjeux architectures evolutions court/moyen terme'
    pprint(FormTrainAMEA1)

    new_vec = id2word.doc2bow(FormTrainAMEA1.lower().split())
    print(new_vec) # the word "court/moyen" does not appear in the dictionary and is ignored
    my_projection=lda[new_vec]
    print(lda.print_topic(my_projection[0][0]))
    print(lda.print_topic(my_projection[1][0]))
    print(lda[new_vec])

    #on charge le fichier des similarités
    index=gensim.similarities.Similarity.load(PathCorpus+'frwiki_lda_index')

    print("done")
    sims = index[my_projection]
    print(list(enumerate(sims)))
    for i in range(len(sims)):
        mylist=mm[sims[i][0]]
        print("*****************************")
        for j in range(len(mylist)):
            print(id2word[mylist[j][0]])

    print("terminé")

boardw...@gmail.com

unread,

Jan 16, 2017, 10:00:40 AM1/16/17

to gensim

In ddition this is exactly the issue raised her in stackoverflow : http://stackoverflow.com/questions/28488714/getting-string-version-of-document-by-id-in-gensim

BR.

Lev Konstantinovskiy

unread,

Jan 19, 2017, 7:49:44 PM1/19/17

to gensim

Hi,

Unfortunately the full raw text of a document is not stored in a MmCorpus. MmCorpus is a matrix format that stores only bag-of-words representations of documents. As the StackOverflow post suggests the mapping of id-original doc is left to the user.