Suppose the code like this:
...
@config(age = 10*24*60*60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
return { "url" : response.url}
...
If I did not understand wrong, the logic is:
The program will crawl url of every detail_page.
If the task status is "SUCCESS" or "FAILED", the task will be recrawled again after 10 days.
For now, i just started a project that crawled nearly 100k of url.
Some of them are failed to get due to proxy problem. So I modified the code like this:
def on_start(self):
Handler.crawl_config['proxy'] = 'xxx.xxx.xx.xxx:xxxx'
...
I manually change the proxy setting every time the proxy fails.
But I wanted to recrawl the 'FAILED' task only. What should I do?