Java (or other) resources for Wikipedia

5 views
Skip to first unread message

Santiago M. Mola

unread,
Jul 31, 2010, 11:06:58 AM7/31/10
to pan-works...@googlegroups.com
Hi all,

I am planning to rewrite my vandalism detection system from scratch.
It was previously written in Python with some bits of bash, C++ and
Java. A real mess. Now I plan to write it in Java so I can integrate
ir more tightly with Weka, Lucene and others.

So I've got a question for you: do you know about Java libraries for
handling Wikipedia? I have found just Wikipedia Miner and JWPL, and I
am discarding the later because I want to stick to free software. Do
you know about others? Any experiences with Wikipedia Miner?

There will be use of classes to model articles, revisions, edits and
users; a good wiki-syntax parser which is able to deal with templates,
a wiki-syntax to plain-text converter, a XML dump reader, Wikipedia
API interface, etc.

Thank you in advance,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com

Amir Hossein Jadidinejad

unread,
Jul 31, 2010, 11:25:02 AM7/31/10
to pan-works...@googlegroups.com
WikipediaMiner is great but it doesn't support revisions, users and different edit versions. It just a simple OO model of Articles, Categories and etc. Also a bunch of Perl scripts is provided to parse the Wikipedia XML dump. It also not very precise parser for templates and etc. Please see the following paper to know more:
http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenSourceToolkitForMiningWikipedia.pdf





From: Santiago M. Mola <cool...@gmail.com>
To: pan-works...@googlegroups.com
Sent: Sat, July 31, 2010 7:36:58 PM
Subject: [PAN'10] Java (or other) resources for Wikipedia
--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series
To unsubscribe send email to pan-workshop-series+unsub...@googlegroups.com.

dmtr

unread,
Aug 1, 2010, 5:31:39 PM8/1/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
> I am planning to rewrite my vandalism detection system from scratch.
> It was previously written in Python with some bits of bash, C++ and
> Java. A real mess. Now I plan to write it in Java so I can integrate
> ir more tightly with Weka, Lucene and others.

Have you considered using NLTK (http://www.nltk.org/book) instead of
Weka?

-- Dmitry

Fabien Poulard

unread,
Aug 2, 2010, 9:24:17 AM8/2/10
to pan-works...@googlegroups.com
Hi,

Le 31/07/2010 17:06, Santiago M. Mola a �crit :


> So I've got a question for you: do you know about Java libraries for
> handling Wikipedia? I have found just Wikipedia Miner and JWPL, and I
> am discarding the later because I want to stick to free software. Do
> you know about others? Any experiences with Wikipedia Miner?
>

I've developed a UIMA component to load articles from Wikipedia XML
dumps. This component allows you to load UIMA pages and revisions into a
UIMA cpe so that you can use all UIMA components. The old release is
presented here :

http://www.fabienpoulard.info/index.php?tpost/en/2010/03/04/Release-of-the-Wikipedia-collection-reader-v04

There is a new version coming (a bit more robust) and now developed
through Google code so that anybody can contribute :

http://code.google.com/p/uima-mediawiki-engine/

--
Merci de ne pas m'envoyer de documents Word, PowerPoint ou tout
autre format propri�taire. Privil�giez OpenDocument et PDF.
See http://www.gnu.org/philosophy/no-word-attachments.html

Fabien Poulard
LINA (UMR CNRS 6241) / Universit� de Nantes

fabien_poulard.vcf

Santiago M. Mola

unread,
Aug 2, 2010, 4:27:56 PM8/2/10
to pan-works...@googlegroups.com
Thanks for the replies!

On Mon, Aug 2, 2010 at 3:24 PM, Fabien Poulard
<fabien....@univ-nantes.fr> wrote:
>
> I've developed a UIMA component to load articles from Wikipedia XML dumps.
> This component allows you to load UIMA pages and revisions into a UIMA cpe
> so that you can use all UIMA components. The old release is presented here :
>

Great. I'll give it a shot. I didn't even know about UIMA in the first
place, sounds very interesting.

On Sun, Aug 1, 2010 at 11:31 PM, dmtr <dchi...@gmail.com> wrote:
>
> Have you considered using NLTK (http://www.nltk.org/book) instead of
> Weka?

Yes, I did, although until know I didn't realize it had a Weka
wrapper! However, there are other reasons for me to write it in Java,
one of them being that I feel that I write more robust code in
statically-typed languages than in dynamically-typed ones, and with
the recent discovery of Scala, I can use it on top of Java whenever I
want a more expressive language or an interactive shell. But, I
digress...

On Sat, Jul 31, 2010 at 5:25 PM, Amir Hossein Jadidinejad
<amir....@yahoo.com> wrote:
> WikipediaMiner is great but it doesn't support revisions, users and
> different edit versions. It just a simple OO model of Articles, Categories
> and etc. Also a bunch of Perl scripts is provided to parse the Wikipedia XML
> dump. It also not very precise parser for templates and etc. Please see the
> following paper to know more:
> http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenSourceToolkitForMiningWikipedia.pdf

Thanks. Finally, I've discarded Wikipedia Miner.

Best regards,

Dmitry Chichkov

unread,
Aug 2, 2010, 7:44:41 PM8/2/10
to pan-works...@googlegroups.com
Yes, I did, although until know I didn't realize it had a Weka
wrapper! However, there are other reasons for me to write it in Java,
one of them being that I feel that I write more robust code in
statically-typed languages than in dynamically-typed ones, and with
the recent discovery of Scala, I can use it on top of Java whenever I
want a more expressive language or an interactive shell. But, I
digress...


Well, in my opinion unknown/unpopular languages and static typing (read custom/unknown types) can be very detrimental to the (re)usability of your code. And while statically-typed code can be very robust, it tend to get bloated/unapproachable rather quickly and it is also less fun to work with.

-- Dmitry
Reply all
Reply to author
Forward
0 new messages