Scraper exiting with 'redirect error' on 301 responses

723 views
Skip to first unread message

Will Abson

unread,
Feb 7, 2012, 12:57:14 PM2/7/12
to scrap...@googlegroups.com
Hi,

I'm getting some odd errors from my scraper which runs weekly to
collate the results of a running event that myself and my friends take
part in, part of the parkrun series
(http://www.parkrun.org.uk/richmond).

https://scraperwiki.com/scrapers/richmond_parkrun_results/

The scraper initially ran fine but now exits with the following error

--BEGIN ERROR--

*** Exception ***
Line 76: main();
Line 14: scrape_results_html('richmond') -- main(())
Line 18: results_html =
lxml.html.fromstring(scraperwiki.scrape(race_url)) --
scrape_results_html((race_id='richmond'))
/home/scraperwiki/python/scraperwiki/utils.py:86 --
scrape((url='http://www.parkrun.org.uk/richmond/results/latestresults',
params=None))
/usr/lib/python2.7/urllib2.py:126 --
urlopen((url='http://www.parkrun.org.uk/richmond/results/latestresults',
data=None, timeout=<object object at 0x19b30a0>))

HTTPError: HTTP Error 301: The HTTP server returned a redirect error
that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

EXECUTIONSTATUS: 1.127 seconds elapsed,

--END ERROR--

I understand what this means, but I don't see this behaviour when I
hit the same URL in my web browser or using wget from the command
line, which both return the page as expected.

Without any visibility of the exact HTTP requests and responses
returned, I'm unable to troubleshoot the error any further. I'm not
sure whether it's related to something within the ScraperWiki
infrastructure, or whether the site is somehow refusing requests
originating from SW (seems unlikely, given I am only making one single
request per week).

Can anyone suggest how I can progress this issue?

Cheers,
Will

'Dragon' Dave McKee

unread,
Feb 8, 2012, 8:28:48 AM2/8/12
to scrap...@googlegroups.com
http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect
lead me to make
https://scraperwiki.com/scrapers/debug-redirect/

which gives me:
"301: Moved permamently",
"Location: http://www.parkrun.org.uk/scraper.html"

that website contains:
Scraping detected
parkrun results are copyright parkrun, and made available for
personal, non-commercial use.
We have detected that you appear to be using an automated
tool/framework to access our results, which is not currently supported
or permitted.
We would request that you please get in touch to discuss your requirements.
Regards,

parkrun technical team
techs...@parkrun.com

Using:
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
response =urllib2.Request("http://www.parkrun.org.uk/richmond/results/latestresults",headers=headers)
print urllib2.urlopen(response).read()

works, but is rather naughty.

Also, let them know that redirecting the redirect page is rather
antisocial when you talk to them :P

Dave.

Will Abson

unread,
Feb 8, 2012, 10:46:07 AM2/8/12
to scrap...@googlegroups.com
Thanks for the help! I had a nasty feeling that might be the case, but
since it's hardly an unreasonable volume of requests (1/week) I was
surprised that they seem to have blocked ScraperWiki completely.

Anyway, a nicely-worded request has been sent to them. I'll follow-up
with their response, since it raises a few issues around scraping of
non-public sites which I'm sure others will be interested in.

Cheers,
Will.

Will Abson

unread,
Feb 10, 2012, 2:17:12 PM2/10/12
to scrap...@googlegroups.com
I received the following response from one of the parkrun admins today
- not good news unfortunately, so I have therefore stopped the scraper
from running again and updated the description text to reflect the
current status - see
https://scraperwiki.com/scrapers/richmond_parkrun_results/

They cite load issues, but my suspicion is that it's more about
ownership/rights around the data. There may be an API available in the
future, but probably not any time soon, and will likely have a
restrictive set of terms from what I can infer. So Iesson learned, and
I'm putting this one to bed.

Hi Will

Sorry about the infinite loop thing - unforeseen effect of a url rewriter.

In terms of screenscraping, we've got to be very strict about this. Even
though you don't envisage what you've done will attract a heavy load, we
must be consistant about it i.e. either allow everyone or no-one. Since
allowing everyone could be potentially ruinous (we've had to step in on a
number of occasions so far), it has to be no-one.

We realise there are ways to give the appearance of a human-originating
request (i.e. browser), but I'm urging everyone not to do this - recently
as I say it's gotten to be an issue, and we're now looking for tell-tale
patterns of requests from IP addresses etc.

On a positive note, please see my column (top right) in this week's
newsletter (http://www.parkrun.com/about/news in case you've opted out of
the email). By coincidence, I addressed this exact issue and what we plan
to do in the future. Search for "API" in the newsletter. Can't give any firm
timescales yet though.

Also, as the contact for your club, you should be receiving the club
email, which should contain the same info it sounds like you were trying
to mine out of the results page. Let me know if you're not getting that
email. Suspect you possibly are but it's not giving you all the
fields/info you would have liked.

Thanks for your understanding.

<name removed>
parkrun HQ

I did send a reply back, politely disagreeing with their blanket
policy, but agreeing to not scrape the site myself again, and offering
to work with them to help define their API when/if it eventually
comes. I won't bother spamming this list with the full text but if
anyone's interested I'm happy to forward it on to you individually,
along with any further response that I get.

Thanks,
Will

Thomas Levine

unread,
Feb 10, 2012, 5:30:30 PM2/10/12
to scrap...@googlegroups.com
How many pages would you have downloaded had we not been blocked?

Will Abson

unread,
Feb 10, 2012, 5:47:27 PM2/10/12
to scrap...@googlegroups.com
It's actually the same page[1], which they update each week with fresh results.

But anyway, according to the log it first failed on 30/01 and then
again the week after on 07/02. So it would have run twice, or three
times if you include the manual run I did later on on 07/02 trying to
work out what on earth was going on.

Prior to it getting blocked it had 5 successful runs, all with the
same one page.

Does that help?

Cheers,
Will.

[1] http://www.parkrun.org.uk/richmond/results/latestresults

Thomas Levine

unread,
Feb 11, 2012, 1:28:19 AM2/11/12
to scrap...@googlegroups.com
If you visit that web page to check scores anyway and you just run the downloaded page through your parser, you won't be increasing their load.

Tom
Reply all
Reply to author
Forward
0 new messages