Broad crawl - more than half responses are timeouts when I use a lot of spider instances

140 views

Skip to first unread message

Futile Studio

unread,

Apr 3, 2017, 8:10:04 AM4/3/17

to scrapy-users

Hello,

I have a database with urls and xpaths to get elements from these urls. There can be many urls from one site like: example.com/article1, example.com/article2... which has the same xpaths.

I want to scrape those urls as fast as possible but I don't want to send many requests to some domain at once, there should be some delay.

For these purposes, I've created a GenericSpider. Instance of this spider gets arguments - list of urls and xpath. When I want to get data, I instantiate all spiders and run them.

The problem is that when I do this for all spider instances, there is a lot of timeouts (more than half requests ends with timeout).

But when I do it only for 50 spiders, everything works correctly.

So my solution is to instantiate and crawl first 50 spiders, then another 50 spider etc. but it raises ReactorNotRestartable.

I'm new in scrapy so I appreciate all advices, maybe this is not a best solution. Thanks

class GenericScraper(scrapy.Spider):
    download_timeout = 20
    name = 'will_be_overriden'
    custom_settings = {'CONCURRENT_REQUESTS': 30,
                       'DOWNLOAD_DELAY':1}
    def __init__(self, occs_occurence_scanning_id_map_dict):
        super(GenericScraper,self).__init__()
        ...

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse, errback=self.err ,meta={'handle_httpstatus_all':True})

    def err(self,failure):
        ...


    def parse(self, response):
        ...
        hxs = HtmlXPathSelector(response)
        save result to database

And this is a method in which I instantiate spiders and run crawl:

def run_spiders():
    from scrapy.crawler import CrawlerProcess
    ....
    process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
                              "EXTENSIONS": {
                                  'scrapy.telnet.TelnetConsole': None
                              },
                              "LOG_FILE": 'scrapylog.log',
                              "CONCURRENT_REQUESTS": 30,
                              'REACTOR_THREADPOOL_MAXSIZE': 20,
                              "ROBOTSTXT_OBEY": False,
                              "USER_AGENT": ua.chrome,
                              "LOG_LEVEL": 'INFO',
                              "COOKIES_ENABLED": False
                              })
    


   # THIS SCRAPES LESS THAN HALF URLS, THE REST ENDS WITH TIMEOUTS


    for s in Site.objects.all(): # site contains list of urls and xpath
        ...
        process.crawl(occurence_spider.GenericScraper, site)

    process.start()


   # THIS SCRAPES ONLY THE FIRST 50 SITES (without timeouts), THEN IT RAISES

 File "C:\Users..., line 730, in startRunning
    raise error.ReactorNotRestartable()
        ReactorNotRestartable

    st = 50
    while st<sites_count:

        st += 50
        for s in Site.objects.all()[st:st+50]:
             ...
             process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict)

        process.start()

Reply all

Reply to author

Forward

0 new messages