Using Wikipedia as a corpus

445 views
Skip to first unread message

Erick Fonseca

unread,
Jan 11, 2012, 10:32:23 PM1/11/12
to nltk-users
Greetings,

I would like to use the whole of the Wikipedia as a corpus for
training a language model. I downloaded the dump XML file but I'm
uncertain on how to remove the mediawiki tags.
I don't know if NLTK has any functions to help me specifically on
that, but since using Wikipedia as a corpus is a growing trend lately,
I thought I could find some directions here.
Thanks,

Erick Rocha Fonseca
M.Sc. Candidate
Univesity of São Paulo, Brazil

Richard Careaga

unread,
Jan 11, 2012, 11:53:09 PM1/11/12
to nltk-...@googlegroups.com
You'll want to preprocess with lxml.

Tim McNamara described the general approach with lxml on 2010-07-30, and it would work similarly for xml declogging. He said:

One of the neat features of PlainTextCorpusReader is that it knows about paragraphs. But it took me a little while to figure out be able to give it paragraphs correctly. As is often the case, it's trivial to implement.

I use lxml.html heavily, but I'm sure there are other libraries that offer a similar ability to iterate over elements in a tree. The main trick is to use the '\n\n'.join().

  >>> page = lxml.html.document_fromstring(urllib2.urlopen('http://the/internet.html').read())
  >>> '\n\n'.join(el.text_content() for el in page.cssselect('div.post').iter_descendants())
  u'...'
 


January 11, 2012 10:32 PM

Tim McNamara

unread,
Jan 12, 2012, 2:06:35 AM1/12/12
to nltk-...@googlegroups.com
Thanks for the mention Richard.

Erick-

I recommend taking some time to look at MediaWiki parsers that have been written: http://www.mediawiki.org/wiki/Alternative_parsers. Choose whichever you prefer, but I would go with mwlib (http://mwlib.readthedocs.org/en/latest/index.html). It is a little more complex than the others, but is well designed for transforming things to different output formats.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

compose-unknown-contact.jpg

Erick Fonseca

unread,
Jan 12, 2012, 4:30:41 PM1/12/12
to nltk-users
Thank you for the answers.
I'm pretty comfortable with parsing XML, and I thought of using the
xml module. My real problem is the mediawiki annotations, I'll have a
look on the parsers that Tim suggested.

On 12 jan, 05:06, Tim McNamara <mcnamara....@gmail.com> wrote:
> Thanks for the mention Richard.
>
> Erick-
>
> I recommend taking some time to look at MediaWiki parsers that have been
> written:http://www.mediawiki.org/wiki/Alternative_parsers. Choose
> whichever you prefer, but I would go with mwlib (http://mwlib.readthedocs.org/en/latest/index.html). It is a little more
> complex than the others, but is well designed for transforming things to
> different output formats.
>
> On 12 January 2012 17:53, Richard Careaga <leuc...@gmail.com> wrote:
>
>
>
>
>
>
>
> > You'll want to preprocess with lxml.
>
> > Tim McNamara described the general approach with lxml on 2010-07-30, and
> > it would work similarly for xml declogging. He said:
>
> > One of the neat features of PlainTextCorpusReader is that it knows about
> > paragraphs. But it took me a little while to figure out be able to give it
> > paragraphs correctly. As is often the case, it's trivial to implement.
>
> > I use lxml.html heavily, but I'm sure there are other libraries that offer
> > a similar ability to iterate over elements in a tree. The main trick is to
> > use the '\n\n'.join().
>
> >   >>> page = lxml.html.document_fromstring(urllib2.urlopen('
> >http://the/internet.html').read())
> >   >>> '\n\n'.join(el.text_content() for el in
> > page.cssselect('div.post').iter_descendants())
> >   u'...'
>
> >  Erick Fonseca <erickrfons...@gmail.com>
> >  January 11, 2012 10:32 PM
> > Greetings,
>
> > I would like to use the whole of the Wikipedia as a corpus for
> > training a language model. I downloaded the dump XML file but I'm
> > uncertain on how to remove the mediawiki tags.
> > I don't know if NLTK has any functions to help me specifically on
> > that, but since using Wikipedia as a corpus is a growing trend lately,
> > I thought I could find some directions here.
> > Thanks,
>
> > Erick Rocha Fonseca
> > M.Sc. Candidate
> > Univesity of São Paulo, Brazil
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > nltk-users+...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/nltk-users?hl=en.
>
>
>
>  compose-unknown-contact.jpg
> 1KExibirDownload

David Gerő

unread,
Jan 13, 2012, 11:24:02 AM1/13/12
to nltk-users
Hi,

i think you try the mediawiki API! The mediawiki api is very simple
and usefull

http://www.mediawiki.org/wiki/API
or
http://en.wiktionary.org/w/api.php

Best,
David
Reply all
Reply to author
Forward
0 new messages