Hi,
I am trying to create a corpus from wikipedia, then creating some topics using LSA and/or LDA. I have downloaded a wikipedia xml.bz2 file and used your code here:
wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki
MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping
It worked perfectly and created the wiki-corpus.mm file and wiki-corpus.mm.index file. However, I can't find how to use the .mm.index file, which must be the mapping. I need an id2word variable for the next stage, to replace this line:
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
I am assuming I can use the wiki-corpus.mm.index file? but I can't find out how I can ?
Thank you for any guidance,
Best wishes,
Sarah
PPS This is for my MSc Project