Global Exception Handling - HttpException

182 views
Skip to first unread message

Kyle Clarke

unread,
Mar 15, 2010, 6:44:06 PM3/15/10
to scrapy-users
Hi all, I'm about to put a scrapy crawler into a production
environment so was wanting to track certain events that may happen
whilst crawling.

I have looked though the tutorials and online examples/recipes to view
a "real life" example of using the logging features & catching
exceptions, especially the HttpExceptions.

The logging aspect I have found all ok, eg
from scrapy import log & call the log.msg("message for mum') method so
that all ok

With the exceptions however, I was thinking that it would be best to
wrap the scrapy project into a try catch block & check for each type
of exception raised & then code my project accordingly to deal with
these. At first I thought to throw a try catch block around the
execute() method of the scrapy-ctl.py file - though this would then be
a scrapy wide catch, not an individual project catch which you would
need in order to process exceptions differently for each project.

Could some please show/tell me where would be the best place to trap
all of the exceptions. Unfortunately I am a beginner python dev. I
imagined that when an exception is thrown, it bubbles it's way to the
top of the chain to catch etc.

From reading the docs in regards to the HttpException - this is raised
by the downloader. Does this mean I am required to write my own custom
middleware downloader, add to the settings & then explicitly check for
a status code on the request, return none if 200 status code is found
so that the remainder of the downloaders continue to run & anything
else I drop/raise exceptions etc. Or can I check this another way, eg
in the spiders? I imagine as I have many spiders, it would be best to
have my own custom downloader for encapsulated code that checks the
200 ok status, this way it is applied to all spiders of the project.
In addition, my main reason for the HttpException check is to find out
whether I have been banned/blocked from the website being scraped -
tho I will still get a 200ok status if the website redirects me to a
"you are banned page" - then I presume my only option is to check
request body for a certain xpath & raise, log & email the issue then.
Any help here would be appreciated.

Additionally, I am using scrapy now behind a TOR daemon in conjunction
with privoxy to mask all of my requests. If any want info on this -
contact me direct kyler...@gmail.com
Thanks
Kyle

Pablo Hoffman

unread,
Jun 16, 2010, 3:44:56 PM6/16/10
to scrapy...@googlegroups.com
Hi Kyle,

I was looking at past emails from this group and found this one that went
unreplied. I was wondering, how did you manage to solve this issue?. Did you
end up writing your own middleware for logging this?.

What we have used in the past is the DownloaderStats middleware (enabled by
default) from:

scrapy.contrib.downloadermiddleware.stats.DownloaderStats

It records the number of requests received per response code.

Perhaps you can run some checks when the spider closes by looking at the stats
and trigger notifications based on some custom criteria.

I agree it could be worth looking at some common patters of problems (like no
items scraped, too many non-200 responses, to many responses-vs-items, etc) to
build a default exception handling/notification mechanism. I expect it would
log the results/stats of previous runs and compare it with current/last run, to
figure out if there is a problem.

I think the idea in the end would be to provide a dashboard that shows which
spiders are running OK, which have some problems, and which are not working at
all. That would be a really nice feature to provide out of the box, that lots
of projects would benefit from.

Pablo.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Kyle Clarke

unread,
Jun 18, 2010, 6:13:44 PM6/18/10
to scrapy...@googlegroups.com
Hi Pablo, I guess in a nutshell I didn't solve this issue. Eg I currently wrap try catch blocks around the peices of code that I think are necessary, eg Dbase commits & persisting up to Amazon S3. Additionally, as I have a piece of mailer code, I send myself an email if I think the issue is high enough etc. I had to create the mailer class to email stats of the number of crawled percentages for my client anyway.

At this stage, I do not believe the site I'm scraping will be able to ban my requests being behind TOR, though it has come as a real compromise with speed. At this stage I have no elegant scrapy/python solution - mostly due to this being a "side" project and mostly because I'm a beginner Python dev (tho I can PHP anything).

Thanks for your response however, it would be great to have a default exception handling/notification mechanism. Like all things, I'm sure it's in the development backlog! Not sure what will push the priority higher, ha!

Also - on a side note - thanks for Open sourcing your dev work here with Scrapy - it definitely has some real power under the hood - although I don't understand a lot of it in detail!
Best regards
Kyle
Reply all
Reply to author
Forward
0 new messages