wiki corpus

Sarah Bloomfield

unread,

Jan 7, 2022, 5:01:38 AM1/7/22

to gen...@googlegroups.com

Hi,

I am trying to create a corpus from wikipedia, then creating some topics using LSA and/or LDA. I have downloaded a wikipedia xml.bz2 file and used your code here:

wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki

MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping

It worked perfectly and created the wiki-corpus.mm file and wiki-corpus.mm.index file.

I can query the wiki-corpus.mm file using:

path = '/users/sarahbloomfield/wiki-corpus.mm'
mm=MmCorpus(path)

However, I can't find how to use the .mm.index file, which must be the mapping. I need an id2word variable for the next stage, to replace this line:

id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')

I am assuming I can use the wiki-corpus.mm.index file? but I can't find out how I can ?

Thank you for any guidance,

Best wishes,

Sarah

PS. I am trying to follow "Experiments on the English Wikipedia" (https://radimrehurek.com/gensim/wiki.html) and mmcorpus documentation, and this (https://radimrehurek.com/gensim/corpora/mmcorpus.html#gensim.corpora.mmcorpus.MmCorpus.serialize).

PPS This is for my MSc Project

Radim Řehůřek

unread,

Jan 8, 2022, 8:19:43 AM1/8/22

to Gensim

Hi Sarah,

the .mm.index file contains an index (byte offset) of each individual vector in the .mm corpus file.

It is completely unrelated to your dictionary (id2word) mapping.

However, I can't find how to use the .mm.index file, which must be the mapping. I need an id2word variable for the next stage, to replace this line:
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')

See the comment for that line:

# load id->word mapping (the dictionary), one of the results of step 2 above >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')

The "step 2 above" is a few lines higher = the call to the make_wiki script. If you run the command from step 2, it creates several files, including wiki_en_wordids.txt.

Hope that helps!

Radim

Sarah Bloomfield

unread,

Jan 24, 2022, 4:28:10 AM1/24/22

to gen...@googlegroups.com

Thank you so much for getting back to me so quickly!

Unfortunately I can still only find the .mm file and .mm.index file, I can't find any other files including the wiki_en_wordids.txt.

I have tried using the shorter wiki file so space is not an issue, changing folders, and running the whole script here

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/wikicorpus.py

The two lines below seem simple and run with no errors, so I am not sure what I am doing wrong!

wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki

MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping

Thank you and sorry for bothering you again,

Sarah

PS I am using Anaconda/ Jupyter

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/e49cb600-d519-4f5a-bf3d-6f9f5f3bc221n%40googlegroups.com.

Radim Řehůřek

unread,

Jan 24, 2022, 3:52:44 PM1/24/22

to Gensim

Hi Sarah,

The line in the make_wiki.py script that creates that id2word.txt file is this one:

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py#L92

I see it compresses the output text file as .bz2 – maybe that's what tripped you up?

You can use either the compressed .txt.bz2 file, or decompress it into .txt. Gensim will accept either.

Hope that helps,
Radim

Reply all

Reply to author

Forward