Wikipedia tfidf Model

388 views
Skip to first unread message

Michael Haus

unread,
Apr 29, 2014, 9:10:46 AM4/29/14
to gen...@googlegroups.com
Hi,
 
I dowloaded the Wikipedia articles and followed the tutorial (http://radimrehurek.com/gensim/wiki.html).
 
But one question is left (and takes a while to compute), how I get a tfidf model for further processing, for e.g. inference, new, unseen documents from trained model. The script
 
python -m gensim.scripts.make_wiki
 
creates the corpus_tfidf, the dictionary and so on, but I don't find a tfidf model. I would do it like that:
 
corpus_bow = corpora.MmCorpus('\wiki_bow.mm')                                                                              
                                                                              
tfidf
= models.TfidfModel(corpus_bow)                                                                              
                                                                              
models
.TfidfModel.save(tfidf, '\tfidf.model')

Is that correct? Or is the tfidf model already generated through the script and I didn't see it?

Another question, as already mentioned in the tutorial: Has someone a more clean corpus of the Wikipedia or a "better" script to compute the Wikipedia corpus?

Thanks in advance.

Radim Řehůřek

unread,
Apr 29, 2014, 11:09:52 AM4/29/14
to gen...@googlegroups.com

 

Is that correct? Or is the tfidf model already generated through the script and I didn't see it?



The tf-idf model is generated, and transformed tf-idf vectors are stored, but it seems the model itself is not:

 

Another question, as already mentioned in the tutorial: Has someone a more clean corpus of the Wikipedia or a "better" script to compute the Wikipedia corpus?


This wiki script from one of my recent blog posts is better, I think (it does better article filtering + faster):

I've seen nicely pre-processed Wikipedia dumps too, but these are usually horribly outdated. Googling around should provide more info.

HTH,
Radim


 

Thanks in advance.



--
Radim Řehůřek, Ph.D.
consultant @ machine learning, natural language processing, data mining
skype "radimrehurek"
 
Reply all
Reply to author
Forward
0 new messages