TimeMachine on smaller Wikipedia subset

13 views
Skip to first unread message

Ivo Arabadjiev

unread,
Dec 31, 2018, 11:08:02 AM12/31/18
to jwpl-users
Dear all,

I am currently not working on NLP directly, but rather on the network that Wikipedia forms, with the articles being the nodes and the links being the vertices, and wanted to investigate a node formation when a new article is being created and later on the network evolution.

Would it be possible to use the TimeMachine on a smaller subset of wikipedia categories and not the complete datset? I was thinking of providing the tool with a modified version of the three main wikipedia dump files:

[LANGCODE]wiki-[DATE]-pages-meta-history.xml.bz2
[LANGCODE]wiki-[DATE]-pagelinks.sql.gz
[LANGCODE]wiki-[DATE]-categorylinks.sql.gz

What I need to obtain at the end are the list of articles and the list of hyperlinks in each article at various timeshots. Any advice would be most welcome!

Regards,
Ivo

Johannes Daxenberger

unread,
Jan 2, 2019, 2:35:22 AM1/2/19
to jw...@googlegroups.com

Hi Ivo,

 

you can limit the dump/database creation by timestamps, but not by articles (see https://dkpro.github.io/dkpro-jwpl/TimeMachine/). Once you have setup a (TimeMachine) database, you can of course limit the articles/pages and respective in-/outlinks to be processed programmatically (see https://github.com/dkpro/dkpro-jwpl/blob/master/de.tudarmstadt.ukp.wikipedia.tutorial/src/main/java/de/tudarmstadt/ukp/wikipedia/tutorial/api/T3_PageDetails.java).

 

Best regards,

Johannes

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages