Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

How do I correctly download Wikipedia pages?

1 view

Skip to first unread message

Steven D'Aprano

unread,

Nov 25, 2009, 10:45:19 PM11/25/09

I'm trying to scrape a Wikipedia page from Python. Following instructions
here:

http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://en.wikipedia.org/wiki/Special:Export

I use the URL "http://en.wikipedia.org/wiki/Special:Export/Train" instead
of just "http://en.wikipedia.org/wiki/Train". But instead of getting the
page I expect, and can see in my browser, I get an error page:

>>> import urllib
>>> url = "http://en.wikipedia.org/wiki/Special:Export/Train"
>>> print urllib.urlopen(url).read()
...
Our servers are currently experiencing a technical problem. This is
probably temporary and should be fixed soon
...

(Output is obviously truncated for your sanity and mine.)

Is there a trick to downloading from Wikipedia with urllib?

--
Steven

ShoqulKutlu

unread,

Nov 25, 2009, 10:58:57 PM11/25/09

Hi,

Try not to be caught if you send multiple requests :)

Have a look at here: http://wolfprojects.altervista.org/changeua.php

Regards
Kutlu

On Nov 26, 5:45 am, Steven D'Aprano

<ste...@REMOVE.THIS.cybersource.com.au> wrote:
> I'm trying to scrape a Wikipedia page from Python. Following instructions
> here:
>

> http://en.wikipedia.org/wiki/Wikipedia:Database_downloadhttp://en.wikipedia.org/wiki/Special:Export

Steven D'Aprano

unread,

Nov 25, 2009, 11:59:06 PM11/25/09

On Wed, 25 Nov 2009 19:58:57 -0800, ShoqulKutlu wrote:

> Hi,
>
> Try not to be caught if you send multiple requests :)
>
> Have a look at here: http://wolfprojects.altervista.org/changeua.php

Thanks, that seems to work perfectly.

--
Steven

Cousin Stanley

unread,

Nov 26, 2009, 12:38:00 PM11/26/09

> I'm trying to scrape a Wikipedia page from Python.

> ....

On occasion I use a program under Debian Linux
called wikipedia2text that is very handy
for downloading wikipedia pages as plain text files ....

Description: displays Wikipedia articles on the command line

This script fetches Wikipedia articles (currently supports
around 30 Wikipedia languages) and displays them as plain text
in a pager or just sends the text to standard out. Alternatively
it opens the Wikipedia article in a (possibly GUI) web browser
or just shows the URL of the appropriate Wikipedia article.

Example directed through the lynx browser ....

wp2t -b lynx gorilla > gorilla.txt

--
Stanley C. Kitching
Human Being
Phoenix, Arizona

0 new messages