Broad crawl - more than half responses are timeouts when I use a lot of spider instances

140 views
Skip to first unread message

Futile Studio

unread,
Apr 3, 2017, 8:10:04 AM4/3/17
to scrapy-users
Hello,

I have a database with urls and xpaths to get elements from these urls. There can be many urls from one site like: example.com/article1, example.com/article2... which has the same xpaths.
I want to scrape those urls as fast as possible but I don't want to send many requests to some domain at once, there should be some delay. 

For these purposes, I've created a GenericSpider. Instance of this spider gets arguments - list of urls and xpath. When I want to get data, I instantiate all spiders and run them.

The problem is that when I do this for all spider instances, there is a lot of timeouts (more than half requests ends with timeout). 

But when I do it only for 50 spiders, everything works correctly. 

So my solution is to instantiate and crawl first 50 spiders, then another 50 spider etc. but it raises ReactorNotRestartable.


I'm new in scrapy so I appreciate all advices, maybe this is not a best solution. Thanks


class GenericScraper(scrapy.Spider):
download_timeout = 20
name = 'will_be_overriden'
custom_settings = {'CONCURRENT_REQUESTS': 30,
'DOWNLOAD_DELAY':1}
def __init__(self, occs_occurence_scanning_id_map_dict):
super(GenericScraper,self).__init__()
...

def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse, errback=self.err ,meta={'handle_httpstatus_all':True})

def err(self,failure):
...

def parse(self, response):
...
hxs = HtmlXPathSelector(response)
save result to database

And this is a method in which I instantiate spiders and run crawl:

def run_spiders():
from scrapy.crawler import CrawlerProcess
....
process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
"EXTENSIONS": {
'scrapy.telnet.TelnetConsole': None
},
"LOG_FILE": 'scrapylog.log',
"CONCURRENT_REQUESTS": 30,
'REACTOR_THREADPOOL_MAXSIZE': 20,
"ROBOTSTXT_OBEY": False,
"USER_AGENT": ua.chrome,
"LOG_LEVEL": 'INFO',
"COOKIES_ENABLED": False
})



# THIS SCRAPES LESS THAN HALF URLS, THE REST ENDS WITH TIMEOUTS


for s in Site.objects.all(): # site contains list of urls and xpath
...
process.crawl(occurence_spider.GenericScraper, site)

process.start()


   # THIS SCRAPES ONLY THE FIRST 50 SITES (without timeouts), THEN IT RAISES

 File "C:\Users..., line 730, in startRunning
    raise error.ReactorNotRestartable()
        ReactorNotRestartable

    st = 50
while st<sites_count:

st += 50
for s in Site.objects.all()[st:st+50]:
...
process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict)

process.start()
Reply all
Reply to author
Forward
0 new messages