Wikipedia Dump filtering, preprocessing and TF-IDF creation.

461 views
Skip to first unread message

Karsten

unread,
Aug 31, 2012, 9:39:52 AM8/31/12
to gen...@googlegroups.com
Hi,

I am about to implement Explicit Semantic Analysis (http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html) for genism for my master thesis. There is a Python imeplementation. However, the structure of genism is more flexible and would allow TF-IDF, LSI or LDA transformations of Wikipedia as input.

Unfortunetaly I have some problems filtering the Wikipedia Dump to reduce the number of articles depending on their inter-article references. There is a Perl preprocessor called Wikiprep (http://sourceforge.net/apps/mediawiki/wikiprep/index.php?title=Main_Page) but it is not maintained anymore and crashes on my machine. I am looking into to it.

So does anybody know of a program that takes the Wikipedia Dump as an input resolves templates and produces some extra output such as inter-article references etc.?

Thx a lot,
Karsten

Radim Řehůřek

unread,
Aug 31, 2012, 11:58:06 AM8/31/12
to gensim
Hello Karsten,

there's DBpedia: http://dbpedia.org , and there are parsers at
http://www.mediawiki.org/wiki/Alternative_parsers .

I also saw some large datasets at http://select.cs.cmu.edu/code/graphlab/datasets.html
, but I don't think their Wikipedia files have what you need.

Other than that, good luck! MediaWiki is a very pleasant format to
work with :) (not)

Radim

Karsten

unread,
Sep 7, 2012, 10:37:01 AM9/7/12
to gen...@googlegroups.com
Thanks a lot. I found JWPL under alternative parsers. It is nearly perfect.
Reply all
Reply to author
Forward
0 new messages