[GSoC2014] Questions about the project

62 views
Skip to first unread message

Shreyas Pai

unread,
Mar 13, 2014, 4:13:48 AM3/13/14
to twitter-...@googlegroups.com
Hello,

My name is Shreyas Pai. I am a Second year undergraduate student in Computer engineering at VJTI, Mumbai and
I am interested in working on the cassovary GSoC project (Wikipedia analysis and entity extraction)

I have a few questions,

1.  Wikipedia provides a monthly sql dump of the page-to-page link records for the english version
    (for example: http://dumps.wikimedia.org/enwiki/20140304/)
    (or more specifically: http://dumps.wikimedia.org/enwiki/20140304/enwiki-20140304-pagelinks.sql.gz).
    Is this the data format that will be used for the analysis or is a "non-sql version" available somewhere?
   
2.  Are there any (much) smaller datasets available to test the prototypes first?
    because working on the entire data will consume a lot of memory and time...

Thanks!

Ajeet Grewal

unread,
Mar 14, 2014, 10:12:15 PM3/14/14
to twitter-...@googlegroups.com
On Thu, Mar 13, 2014 at 1:13 AM, Shreyas Pai <shreya...@gmail.com> wrote:
Hello,

My name is Shreyas Pai. I am a Second year undergraduate student in Computer engineering at VJTI, Mumbai and
I am interested in working on the cassovary GSoC project (Wikipedia analysis and entity extraction)

I have a few questions,

1.  Wikipedia provides a monthly sql dump of the page-to-page link records for the english version
    (for example: http://dumps.wikimedia.org/enwiki/20140304/)
    (or more specifically: http://dumps.wikimedia.org/enwiki/20140304/enwiki-20140304-pagelinks.sql.gz).
    Is this the data format that will be used for the analysis or is a "non-sql version" available somewhere?

There is an xml dump of the pages. http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This might be a good place to start.
 
   
2.  Are there any (much) smaller datasets available to test the prototypes first?
    because working on the entire data will consume a lot of memory and time...

Hmm, not sure about a smaller version of the dataset. One possibility is to limit yourself to certain categories of pages.
 

Thanks!

--
You received this message because you are subscribed to the Google Groups "Cassovary" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twitter-cassov...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Ajeet

Shreyas Pai

unread,
Mar 15, 2014, 9:43:06 AM3/15/14
to twitter-...@googlegroups.com
That's great!
The overall xml file is a combination of many smaller files [the ones below it]
Can't believe I overlooked it. :P
Thanks!
Reply all
Reply to author
Forward
0 new messages