wiki corpus

44 views
Skip to first unread message

Sarah Bloomfield

unread,
Jan 7, 2022, 5:01:38 AM1/7/22
to gen...@googlegroups.com
Hi,
I am trying to create a corpus from wikipedia, then creating some topics using LSA and/or LDA. I have downloaded a wikipedia xml.bz2 file and used your code here:
wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki
MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping
It worked perfectly and created the wiki-corpus.mm file and wiki-corpus.mm.index file. 
I can query the wiki-corpus.mm file using: 
path = '/users/sarahbloomfield/wiki-corpus.mm'
mm=MmCorpus(path)
However, I can't find how to use the .mm.index file, which must be the mapping. I need an id2word variable for the next stage, to replace this line:
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
I am assuming I can use the wiki-corpus.mm.index file? but I can't find out how I can ?
Thank you for any guidance, 
Best wishes,
Sarah
PS. I am trying to follow  "Experiments on the English Wikipedia" (https://radimrehurek.com/gensim/wiki.html) and mmcorpus documentation, and this  (https://radimrehurek.com/gensim/corpora/mmcorpus.html#gensim.corpora.mmcorpus.MmCorpus.serialize).
PPS This is for my MSc Project

Radim Řehůřek

unread,
Jan 8, 2022, 8:19:43 AM1/8/22
to Gensim
Hi Sarah,

the .mm.index file contains an index (byte offset) of each individual vector in the .mm corpus file.
It is completely unrelated to your dictionary (id2word) mapping.

However, I can't find how to use the .mm.index file, which must be the mapping. I need an id2word variable for the next stage, to replace this line:
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')

See the comment for that line:

# load id->word mapping (the dictionary), one of the results of step 2 above >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')

The "step 2 above" is a few lines higher = the call to the make_wiki script. If you run the command from step 2, it creates several files, including wiki_en_wordids.txt.

Hope that helps!
Radim

Sarah Bloomfield

unread,
Jan 24, 2022, 4:28:10 AM1/24/22
to gen...@googlegroups.com
Thank you so much for getting back to me so quickly!
Unfortunately I can still only find the .mm file and .mm.index file, I can't find any other files including the wiki_en_wordids.txt.
I have tried using the shorter wiki file so space is not an issue, changing folders, and running the whole script here 
The two lines below seem simple and run with no errors, so I am not sure what I am doing wrong!
wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki
MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping
Thank you and sorry for bothering you again, 
Sarah
PS I am using Anaconda/ Jupyter

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/e49cb600-d519-4f5a-bf3d-6f9f5f3bc221n%40googlegroups.com.

Radim Řehůřek

unread,
Jan 24, 2022, 3:52:44 PM1/24/22
to Gensim
Hi Sarah,

The line in the make_wiki.py script that creates that id2word.txt file is this one:

I see it compresses the output text file as .bz2 – maybe that's what tripped you up?

You can use either the compressed .txt.bz2 file, or decompress it into .txt. Gensim will accept either.

Hope that helps,
Radim
Reply all
Reply to author
Forward
0 new messages