I have looked though the tutorials and online examples/recipes to view
a "real life" example of using the logging features & catching
exceptions, especially the HttpExceptions.
The logging aspect I have found all ok, eg
from scrapy import log & call the log.msg("message for mum') method so
that all ok
With the exceptions however, I was thinking that it would be best to
wrap the scrapy project into a try catch block & check for each type
of exception raised & then code my project accordingly to deal with
these. At first I thought to throw a try catch block around the
execute() method of the scrapy-ctl.py file - though this would then be
a scrapy wide catch, not an individual project catch which you would
need in order to process exceptions differently for each project.
Could some please show/tell me where would be the best place to trap
all of the exceptions. Unfortunately I am a beginner python dev. I
imagined that when an exception is thrown, it bubbles it's way to the
top of the chain to catch etc.
From reading the docs in regards to the HttpException - this is raised
by the downloader. Does this mean I am required to write my own custom
middleware downloader, add to the settings & then explicitly check for
a status code on the request, return none if 200 status code is found
so that the remainder of the downloaders continue to run & anything
else I drop/raise exceptions etc. Or can I check this another way, eg
in the spiders? I imagine as I have many spiders, it would be best to
have my own custom downloader for encapsulated code that checks the
200 ok status, this way it is applied to all spiders of the project.
In addition, my main reason for the HttpException check is to find out
whether I have been banned/blocked from the website being scraped -
tho I will still get a 200ok status if the website redirects me to a
"you are banned page" - then I presume my only option is to check
request body for a certain xpath & raise, log & email the issue then.
Any help here would be appreciated.
Additionally, I am using scrapy now behind a TOR daemon in conjunction
with privoxy to mask all of my requests. If any want info on this -
contact me direct kyler...@gmail.com
Thanks
Kyle
I was looking at past emails from this group and found this one that went
unreplied. I was wondering, how did you manage to solve this issue?. Did you
end up writing your own middleware for logging this?.
What we have used in the past is the DownloaderStats middleware (enabled by
default) from:
scrapy.contrib.downloadermiddleware.stats.DownloaderStats
It records the number of requests received per response code.
Perhaps you can run some checks when the spider closes by looking at the stats
and trigger notifications based on some custom criteria.
I agree it could be worth looking at some common patters of problems (like no
items scraped, too many non-200 responses, to many responses-vs-items, etc) to
build a default exception handling/notification mechanism. I expect it would
log the results/stats of previous runs and compare it with current/last run, to
figure out if there is a problem.
I think the idea in the end would be to provide a dashboard that shows which
spiders are running OK, which have some problems, and which are not working at
all. That would be a really nice feature to provide out of the box, that lots
of projects would benefit from.
Pablo.
> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.