The follow code is used to compute corpus-wide word frequency in case some one else has a similar need:
import gensim, csv
from operator import itemgetter
# load id->word mapping (the dictionary)
id2word = gensim.corpora.Dictionary.load_from_text('enwiki_wordids.txt')
# Load BOW matrix
mm = gensim.corpora.MmCorpus('enwiki_bow.mm.bz2')
# Sum up frequency count of each token across documents
n = len(id2word)
freq = [0] * n
for vector in mm:
for element in vector:
freq[element[0]] += element[1]
# Sort the tokens alphabetically
freqlist = [None] * n
for i in range(n):
freqlist[i] = (i, id2word[i], freq[i])
freqlist = sorted(freqlist, key=itemgetter(1))
# Output the token list to a tab delimited text file
with open('wordfreq.txt', 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='\t')
for row in freqlist:
spamwriter.writerow(row)
On Tuesday, April 21, 2015 at 11:46:34 AM UTC-4, Radim Řehůřek wrote: