Using Wikipedia as a corpus

Erick Fonseca

unread,

Jan 11, 2012, 10:32:23 PM1/11/12

to nltk-users

Greetings,

I would like to use the whole of the Wikipedia as a corpus for
training a language model. I downloaded the dump XML file but I'm
uncertain on how to remove the mediawiki tags.
I don't know if NLTK has any functions to help me specifically on
that, but since using Wikipedia as a corpus is a growing trend lately,
I thought I could find some directions here.
Thanks,

Erick Rocha Fonseca
M.Sc. Candidate
Univesity of São Paulo, Brazil

Richard Careaga

unread,

Jan 11, 2012, 11:53:09 PM1/11/12

to nltk-...@googlegroups.com

You'll want to preprocess with lxml.

Tim McNamara described the general approach with lxml on 2010-07-30, and it would work similarly for xml declogging. He said:

One of the neat features of PlainTextCorpusReader is that it knows about paragraphs. But it took me a little while to figure out be able to give it paragraphs correctly. As is often the case, it's trivial to implement.

I use lxml.html heavily, but I'm sure there are other libraries that offer a similar ability to iterate over elements in a tree. The main trick is to use the '\n\n'.join().

>>> page = lxml.html.document_fromstring(urllib2.urlopen('http://the/internet.html').read())

>>> '\n\n'.join(el.text_content() for el in page.cssselect('div.post').iter_descendants())

u'...'

Erick Fonseca

January 11, 2012 10:32 PM

Tim McNamara

unread,

Jan 12, 2012, 2:06:35 AM1/12/12

to nltk-...@googlegroups.com

Thanks for the mention Richard.

Erick-

I recommend taking some time to look at MediaWiki parsers that have been written: http://www.mediawiki.org/wiki/Alternative_parsers. Choose whichever you prefer, but I would go with mwlib (http://mwlib.readthedocs.org/en/latest/index.html). It is a little more complex than the others, but is well designed for transforming things to different output formats.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

compose-unknown-contact.jpg

Erick Fonseca

unread,

Jan 12, 2012, 4:30:41 PM1/12/12

to nltk-users

Thank you for the answers.
I'm pretty comfortable with parsing XML, and I thought of using the
xml module. My real problem is the mediawiki annotations, I'll have a
look on the parsers that Tim suggested.

On 12 jan, 05:06, Tim McNamara <mcnamara....@gmail.com> wrote:
> Thanks for the mention Richard.
>
> Erick-
>
> I recommend taking some time to look at MediaWiki parsers that have been
> written:http://www.mediawiki.org/wiki/Alternative_parsers. Choose
> whichever you prefer, but I would go with mwlib (http://mwlib.readthedocs.org/en/latest/index.html). It is a little more
> complex than the others, but is well designed for transforming things to
> different output formats.
>

> On 12 January 2012 17:53, Richard Careaga <leuc...@gmail.com> wrote:
>
>
>
>
>
>
>
> > You'll want to preprocess with lxml.
>
> > Tim McNamara described the general approach with lxml on 2010-07-30, and
> > it would work similarly for xml declogging. He said:
>
> > One of the neat features of PlainTextCorpusReader is that it knows about
> > paragraphs. But it took me a little while to figure out be able to give it
> > paragraphs correctly. As is often the case, it's trivial to implement.
>
> > I use lxml.html heavily, but I'm sure there are other libraries that offer
> > a similar ability to iterate over elements in a tree. The main trick is to
> > use the '\n\n'.join().
>
> > >>> page = lxml.html.document_fromstring(urllib2.urlopen('
> >http://the/internet.html').read())
> > >>> '\n\n'.join(el.text_content() for el in
> > page.cssselect('div.post').iter_descendants())
> > u'...'
>

> > Erick Fonseca <erickrfons...@gmail.com>

> > January 11, 2012 10:32 PM
> > Greetings,
>
> > I would like to use the whole of the Wikipedia as a corpus for
> > training a language model. I downloaded the dump XML file but I'm
> > uncertain on how to remove the mediawiki tags.
> > I don't know if NLTK has any functions to help me specifically on
> > that, but since using Wikipedia as a corpus is a growing trend lately,
> > I thought I could find some directions here.
> > Thanks,
>
> > Erick Rocha Fonseca
> > M.Sc. Candidate
> > Univesity of São Paulo, Brazil
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > nltk-users+...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/nltk-users?hl=en.
>
>
>

> compose-unknown-contact.jpg
> 1KExibirDownload

David Gerő

unread,

Jan 13, 2012, 11:24:02 AM1/13/12

to nltk-users

Hi,

i think you try the mediawiki API! The mediawiki api is very simple
and usefull

http://www.mediawiki.org/wiki/API
or
http://en.wiktionary.org/w/api.php

Best,
David

Reply all

Reply to author

Forward