Hello,
My name is Shreyas Pai. I am a Second year undergraduate student in Computer engineering at VJTI, Mumbai and
I am interested in working on the cassovary GSoC project (Wikipedia analysis and entity extraction)
I have a few questions,
1. Wikipedia provides a monthly sql dump of the page-to-page link records for the english version
(for example:
http://dumps.wikimedia.org/enwiki/20140304/)
(or more specifically:
http://dumps.wikimedia.org/enwiki/20140304/enwiki-20140304-pagelinks.sql.gz).
Is this the data format that will be used for the analysis or is a "non-sql version" available somewhere?
2. Are there any (much) smaller datasets available to test the prototypes first?
because working on the entire data will consume a lot of memory and time...
Thanks!