Thank you for the answers.
I'm pretty comfortable with parsing XML, and I thought of using the
xml module. My real problem is the mediawiki annotations, I'll have a
look on the parsers that Tim suggested.
> On 12 January 2012 17:53, Richard Careaga <
leuc...@gmail.com> wrote:
>
>
>
>
>
>
>
> > You'll want to preprocess with lxml.
>
> > Tim McNamara described the general approach with lxml on 2010-07-30, and
> > it would work similarly for xml declogging. He said:
>
> > One of the neat features of PlainTextCorpusReader is that it knows about
> > paragraphs. But it took me a little while to figure out be able to give it
> > paragraphs correctly. As is often the case, it's trivial to implement.
>
> > I use lxml.html heavily, but I'm sure there are other libraries that offer
> > a similar ability to iterate over elements in a tree. The main trick is to
> > use the '\n\n'.join().
>
> > >>> page = lxml.html.document_fromstring(urllib2.urlopen('
> >
http://the/internet.html').read())
> > >>> '\n\n'.join(el.text_content() for el in
> > page.cssselect('
div.post').iter_descendants())
> > u'...'
>
> > Erick Fonseca <
erickrfons...@gmail.com>
> > January 11, 2012 10:32 PM
> > Greetings,
>
> > I would like to use the whole of the Wikipedia as a corpus for
> > training a language model. I downloaded the dump XML file but I'm
> > uncertain on how to remove the mediawiki tags.
> > I don't know if NLTK has any functions to help me specifically on
> > that, but since using Wikipedia as a corpus is a growing trend lately,
> > I thought I could find some directions here.
> > Thanks,
>
> > Erick Rocha Fonseca
> > M.Sc. Candidate
> > Univesity of São Paulo, Brazil
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "nltk-users" group.
> > To post to this group, send email to
nltk-...@googlegroups.com.
> > To unsubscribe from this group, send email to
> >
nltk-users+...@googlegroups.com.
> > For more options, visit this group at
> >
http://groups.google.com/group/nltk-users?hl=en.
>
>
>
> compose-unknown-contact.jpg
> 1KExibirDownload