[GSoC2014] Questions about the project

Shreyas Pai

unread,

Mar 13, 2014, 4:13:48 AM3/13/14

to twitter-...@googlegroups.com

Hello,

My name is Shreyas Pai. I am a Second year undergraduate student in Computer engineering at VJTI, Mumbai and
I am interested in working on the cassovary GSoC project (Wikipedia analysis and entity extraction)

I have a few questions,

1. Wikipedia provides a monthly sql dump of the page-to-page link records for the english version
    (for example: http://dumps.wikimedia.org/enwiki/20140304/)
    (or more specifically: http://dumps.wikimedia.org/enwiki/20140304/enwiki-20140304-pagelinks.sql.gz).
    Is this the data format that will be used for the analysis or is a "non-sql version" available somewhere?

2. Are there any (much) smaller datasets available to test the prototypes first?
    because working on the entire data will consume a lot of memory and time...

Thanks!

Ajeet Grewal

unread,

Mar 14, 2014, 10:12:15 PM3/14/14

to twitter-...@googlegroups.com

On Thu, Mar 13, 2014 at 1:13 AM, Shreyas Pai <shreya...@gmail.com> wrote:

Hello,

My name is Shreyas Pai. I am a Second year undergraduate student in Computer engineering at VJTI, Mumbai and
I am interested in working on the cassovary GSoC project (Wikipedia analysis and entity extraction)

I have a few questions,

1. Wikipedia provides a monthly sql dump of the page-to-page link records for the english version
    (for example: http://dumps.wikimedia.org/enwiki/20140304/)
    (or more specifically: http://dumps.wikimedia.org/enwiki/20140304/enwiki-20140304-pagelinks.sql.gz).
    Is this the data format that will be used for the analysis or is a "non-sql version" available somewhere?

There is an xml dump of the pages. http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This might be a good place to start.

2. Are there any (much) smaller datasets available to test the prototypes first?
because working on the entire data will consume a lot of memory and time...

Hmm, not sure about a smaller version of the dataset. One possibility is to limit yourself to certain categories of pages.

Thanks!

--
You received this message because you are subscribed to the Google Groups "Cassovary" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twitter-cassov...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Regards,
Ajeet

Shreyas Pai

unread,

Mar 15, 2014, 9:43:06 AM3/15/14

to twitter-...@googlegroups.com

That's great!
The overall xml file is a combination of many smaller files [the ones below it]
Can't believe I overlooked it. :P
Thanks!

Reply all

Reply to author

Forward