I am planning to rewrite my vandalism detection system from scratch.
It was previously written in Python with some bits of bash, C++ and
Java. A real mess. Now I plan to write it in Java so I can integrate
ir more tightly with Weka, Lucene and others.
So I've got a question for you: do you know about Java libraries for
handling Wikipedia? I have found just Wikipedia Miner and JWPL, and I
am discarding the later because I want to stick to free software. Do
you know about others? Any experiences with Wikipedia Miner?
There will be use of classes to model articles, revisions, edits and
users; a good wiki-syntax parser which is able to deal with templates,
a wiki-syntax to plain-text converter, a XML dump reader, Wikipedia
API interface, etc.
Thank you in advance,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com
Le 31/07/2010 17:06, Santiago M. Mola a �crit :
> So I've got a question for you: do you know about Java libraries for
> handling Wikipedia? I have found just Wikipedia Miner and JWPL, and I
> am discarding the later because I want to stick to free software. Do
> you know about others? Any experiences with Wikipedia Miner?
>
I've developed a UIMA component to load articles from Wikipedia XML
dumps. This component allows you to load UIMA pages and revisions into a
UIMA cpe so that you can use all UIMA components. The old release is
presented here :
There is a new version coming (a bit more robust) and now developed
through Google code so that anybody can contribute :
http://code.google.com/p/uima-mediawiki-engine/
--
Merci de ne pas m'envoyer de documents Word, PowerPoint ou tout
autre format propri�taire. Privil�giez OpenDocument et PDF.
See http://www.gnu.org/philosophy/no-word-attachments.html
Fabien Poulard
LINA (UMR CNRS 6241) / Universit� de Nantes
On Mon, Aug 2, 2010 at 3:24 PM, Fabien Poulard
<fabien....@univ-nantes.fr> wrote:
>
> I've developed a UIMA component to load articles from Wikipedia XML dumps.
> This component allows you to load UIMA pages and revisions into a UIMA cpe
> so that you can use all UIMA components. The old release is presented here :
>
Great. I'll give it a shot. I didn't even know about UIMA in the first
place, sounds very interesting.
On Sun, Aug 1, 2010 at 11:31 PM, dmtr <dchi...@gmail.com> wrote:
>
> Have you considered using NLTK (http://www.nltk.org/book) instead of
> Weka?
Yes, I did, although until know I didn't realize it had a Weka
wrapper! However, there are other reasons for me to write it in Java,
one of them being that I feel that I write more robust code in
statically-typed languages than in dynamically-typed ones, and with
the recent discovery of Scala, I can use it on top of Java whenever I
want a more expressive language or an interactive shell. But, I
digress...
On Sat, Jul 31, 2010 at 5:25 PM, Amir Hossein Jadidinejad
<amir....@yahoo.com> wrote:
> WikipediaMiner is great but it doesn't support revisions, users and
> different edit versions. It just a simple OO model of Articles, Categories
> and etc. Also a bunch of Perl scripts is provided to parse the Wikipedia XML
> dump. It also not very precise parser for templates and etc. Please see the
> following paper to know more:
> http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenSourceToolkitForMiningWikipedia.pdf
Thanks. Finally, I've discarded Wikipedia Miner.
Best regards,
Yes, I did, although until know I didn't realize it had a Weka
wrapper! However, there are other reasons for me to write it in Java,
one of them being that I feel that I write more robust code in
statically-typed languages than in dynamically-typed ones, and with
the recent discovery of Scala, I can use it on top of Java whenever I
want a more expressive language or an interactive shell. But, I
digress...