Scrapy Errback not triggering

233 views
Skip to first unread message

Faheem Nadeem

unread,
Jun 18, 2015, 7:02:26 AM6/18/15
to scrapy...@googlegroups.com
I am trying to write a customised ROBOTSTXTMIDDLEWARE, the following code snippets are from there. I need to get robots through a nginx proxy first. My problem is with Errbback callback. It seems to be triggering only for some error like 4xx etc. But it is not even called for errors like DNS , connection refused, internet failure etc. The funny thing is the same perfectly works when I use it in a spider and it captures all kinds of errors. If I see the stats middleware for the scenario when my proxy is disabled I see a connection refused stat but somehow this is taken by the Errback callback. I am using scrapy v0.24. Am I missing something?

"downloader/exception_count": 1, 

"downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError": 1, 


def _download_robots(self, robots_url, req_netloc, spider):
        print "requested download: %s" % robots_url
        robots_req = Request(
            robots_url,
            priority=self.DOWNLOAD_PRIORITY,
            meta={'bypass_robots': True, 'req_netloc': req_netloc}
        )
        dfd = self.crawler.engine.download(robots_req, spider)
        dfd.addCallback(self._download_success)
        dfd.addErrback(self._download_error, robots_req, spider) 
 

def _download_error(self, failure, request, spider):
                   # Not called for non http errors :( 
        netloc = request.meta.get('req_netloc')
        url = urlparse_cached(request)
        print 'download error ' + url

        # Check if we have failed for nginx, we try directly
        if self._downloading_robots[netloc][0] == CacheLocation.nginx:
            self._downloading_robots[netloc] = CacheLocation.nginx, DownloadingStatus.downloaded
            print 'download error nginx'
        else:
            print 'download error direct'
            # We have failed directly too, check response codes and act accordingly
            if isinstance(failure.value, HttpError):
                http_status = failure.value.response.status
                status = http_status if http_status in (401, 403, 404) else 404
                print 'status http ' + status
            else:
                # Rest of failures, allow fetching ;)
                status = 404
                print 'status failure ' + status

            # Make a reppy rule and add
            rules = Rules(netloc, status, '', time() + self._cache_lifespan)
            self._robots_cache.add(rules)

            # Remove from downloading robots
            self._downloading_robots.pop(netloc, None)
Reply all
Reply to author
Forward
0 new messages