Scrapy Errback not triggering

233 views

Skip to first unread message

Faheem Nadeem

unread,

Jun 18, 2015, 7:02:26 AM6/18/15

to scrapy...@googlegroups.com

I am trying to write a customised ROBOTSTXTMIDDLEWARE, the following code snippets are from there. I need to get robots through a nginx proxy first. My problem is with Errbback callback. It seems to be triggering only for some error like 4xx etc. But it is not even called for errors like DNS , connection refused, internet failure etc. The funny thing is the same perfectly works when I use it in a spider and it captures all kinds of errors. If I see the stats middleware for the scenario when my proxy is disabled I see a connection refused stat but somehow this is taken by the Errback callback. I am using scrapy v0.24. Am I missing something?

"downloader/exception_count": 1,
"downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError": 1,

def _download_robots(self, robots_url, req_netloc, spider):
print "requested download: %s" % robots_url
robots_req = Request(
robots_url,
priority=self.DOWNLOAD_PRIORITY,
meta={'bypass_robots': True, 'req_netloc': req_netloc}
)
dfd = self.crawler.engine.download(robots_req, spider)
dfd.addCallback(self._download_success)
dfd.addErrback(self._download_error, robots_req, spider)

def _download_error(self, failure, request, spider):

# Not called for non http errors :(

netloc = request.meta.get('req_netloc')
url = urlparse_cached(request)
print 'download error ' + url

# Check if we have failed for nginx, we try directly
if self._downloading_robots[netloc][0] == CacheLocation.nginx:
self._downloading_robots[netloc] = CacheLocation.nginx, DownloadingStatus.downloaded
print 'download error nginx'
else:
print 'download error direct'
# We have failed directly too, check response codes and act accordingly
if isinstance(failure.value, HttpError):
http_status = failure.value.response.status
status = http_status if http_status in (401, 403, 404) else 404
print 'status http ' + status
else:
# Rest of failures, allow fetching ;)
status = 404
print 'status failure ' + status

# Make a reppy rule and add
rules = Rules(netloc, status, '', time() + self._cache_lifespan)
self._robots_cache.add(rules)

# Remove from downloading robots
self._downloading_robots.pop(netloc, None)

Reply all

Reply to author

Forward

0 new messages