I want to scrape those urls as fast as possible but I don't want to send many requests to some domain at once, there should be some delay.
For these purposes, I've created a GenericSpider. Instance of this spider gets arguments - list of urls and xpath. When I want to get data, I instantiate all spiders and run them.
The problem is that when I do this for all spider instances, there is a lot of timeouts (more than half requests ends with timeout).
So my solution is to instantiate and crawl first 50 spiders, then another 50 spider etc. but it raises ReactorNotRestartable.
I'm new in scrapy so I appreciate all advices, maybe this is not a best solution. Thanks
class GenericScraper(scrapy.Spider):
download_timeout = 20
name = 'will_be_overriden'
custom_settings = {'CONCURRENT_REQUESTS': 30,
'DOWNLOAD_DELAY':1}
def __init__(self, occs_occurence_scanning_id_map_dict):
super(GenericScraper,self).__init__()
...
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse, errback=self.err ,meta={'handle_httpstatus_all':True})
def err(self,failure):
...
def parse(self, response):
...
hxs = HtmlXPathSelector(response)
save result to database
And this is a method in which I instantiate spiders and run crawl:
def run_spiders():
from scrapy.crawler import CrawlerProcess
....
process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
"EXTENSIONS": {
'scrapy.telnet.TelnetConsole': None
},
"LOG_FILE": 'scrapylog.log',
"CONCURRENT_REQUESTS": 30,
'REACTOR_THREADPOOL_MAXSIZE': 20,
"ROBOTSTXT_OBEY": False,
"USER_AGENT": ua.chrome,
"LOG_LEVEL": 'INFO',
"COOKIES_ENABLED": False
})
# THIS SCRAPES LESS THAN HALF URLS, THE REST ENDS WITH TIMEOUTS
for s in Site.objects.all(): # site contains list of urls and xpath
...
process.crawl(occurence_spider.GenericScraper, site)
process.start()
# THIS SCRAPES ONLY THE FIRST 50 SITES (without timeouts), THEN IT RAISES
File "C:\Users..., line 730, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
st = 50
while st<sites_count:
st += 50
for s in Site.objects.all()[st:st+50]:
...
process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict)
process.start()