corpus word frequency

1,193 views
Skip to first unread message

Yisheng Xu

unread,
Apr 20, 2015, 11:27:29 AM4/20/15
to gen...@googlegroups.com
I created a corpus using the make_wiki script provided by Gensim:

python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2 enwiki

There is a enwiki_wordids.txt file generated by the script. What are the numbers contained in the third column of the text file?

The BOW matrix marketing file (enwiki_bow.mm) should contain the frequency count of each word in each document. I summed up the counts across documents and generated a word frequency file across the entire corpus. The results seem to correlate with the 3rd column in the enwiki_wordids.txt file. But the frequency count derived from the BOW is much larger than the number in the enwiki_wordids.txt file.

Would you help explain this difference?

Thanks!

Yisheng Xu

Radim Řehůřek

unread,
Apr 21, 2015, 11:46:34 AM4/21/15
to gen...@googlegroups.com
Hello Yisheng Xu,

the `wordids.txt` file is generate by the save_as_text() method of Dictionary:

So, the third column is the document frequency of the given word.

Hope that helps,
Radim

Yisheng Xu

unread,
Apr 23, 2015, 4:00:31 PM4/23/15
to gen...@googlegroups.com
Hi Radim,

Thanks for your clarification! This is the exact answer to my question. It is clear to me now since document frequency should be highly correlated with corpus-wide word frequency.

Regards,

Yisheng

Yisheng Xu

unread,
Apr 23, 2015, 4:19:04 PM4/23/15
to gen...@googlegroups.com
The follow code is used to compute corpus-wide word frequency in case some one else has a similar need:

import gensim, csv
from operator import itemgetter

# load id->word mapping (the dictionary)
id2word = gensim.corpora.Dictionary.load_from_text('enwiki_wordids.txt')
# Load BOW matrix
mm = gensim.corpora.MmCorpus('enwiki_bow.mm.bz2')

# Sum up frequency count of each token across documents
n = len(id2word)
freq = [0] * n
for vector in mm:
    for element in vector:
        freq[element[0]] += element[1]

# Sort the tokens alphabetically
freqlist = [None] * n
for i in range(n):
    freqlist[i] = (i, id2word[i], freq[i])
freqlist = sorted(freqlist, key=itemgetter(1))

# Output the token list to a tab delimited text file
with open('wordfreq.txt', 'w') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter='\t')
    for row in freqlist:
        spamwriter.writerow(row)




On Tuesday, April 21, 2015 at 11:46:34 AM UTC-4, Radim Řehůřek wrote:
Reply all
Reply to author
Forward
0 new messages