How can I recrawl failed task only?

31 views
Skip to first unread message

zev...@hotmail.com

unread,
Oct 5, 2017, 10:09:23 PM10/5/17
to pyspider-users
Suppose the code like this:
    ...
    @config(age = 10*24*60*60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)        

               
    @config(priority=2)
    def detail_page(self, response):
        return { "url" : response.url}
    ...

If I did not understand wrong, the logic is:
  The program will crawl url of every detail_page.
  If the task status is "SUCCESS" or "FAILED", the task will be recrawled again after 10 days.

For now, i just started a project that crawled nearly 100k of url.
Some of them are failed to get due to proxy problem. So I modified the code like this:

def on_start(self):
    Handler.crawl_config['proxy'] = 'xxx.xxx.xx.xxx:xxxx'
    ...

I manually change the proxy setting every time the proxy fails.
But I wanted to recrawl the 'FAILED' task only. What should I do?

Roy Binux

unread,
Oct 7, 2017, 6:14:12 PM10/7/17
to zev...@hotmail.com, pyspider-users
> The program will crawl url of every detail_page.
>  If the task status is "SUCCESS" or "FAILED", the task will be recrawled again after 10 days.

No, index page will recrawl after 10 days, not detail page.

You cannot restart failed URLs only, unless you know the failed URLs.


--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/e7ce9d67-c9ab-4b1e-a27a-1df0462f4053%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

zev...@hotmail.com

unread,
Oct 9, 2017, 2:38:25 AM10/9/17
to pyspider-users
I wanted to override on_result so when detail_page is failed to crawl, the program can export the url to DB.
Then I can open a new project to run all the failed url.

But heres the problem:
  How to catch the Http error exception? I placed try and catch on every function but still failed to catch Http erro exception(503,599,...etc)
Reply all
Reply to author
Forward
0 new messages