how to catch download error and try download it again?

1,002 views
Skip to first unread message

hongleij

unread,
Jan 18, 2011, 3:46:20 AM1/18/11
to scrapy-users
i met some error while download,but i can browse the page in firefox
or chrome,if i download again, it will downloaded successfully.
so how could i do so?

2011-01-17 20:35:37+0800 [completedownload.nexusphp] ERROR: Error
downloading <http://pt.sjtu.edu.cn/viewsnatches.php?id=53551>:
[Failure instance: Traceback: <type 'exceptions.IOError'>: Not a
gzipped file
C:\Python26\lib\site-packages\twisted\internet\defer.py:
455:_startRunCallbacks
C:\Python26\lib\site-packages\twisted\internet\defer.py:
542:_runCallbacks
C:\Python26\lib\site-packages\twisted\internet\defer.py:361:callback
C:\Python26\lib\site-packages\twisted\internet\defer.py:
455:_startRunCallbacks
--- <exception caught here> ---
C:\Python26\lib\site-packages\twisted\internet\defer.py:
542:_runCallbacks
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\core
\downloader\middleware.py:46:process_response
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\contrib
\downloadermiddleware\httpcompression.py:21:process_response
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\contrib
\downloadermiddleware\httpcompression.py:37:_decode
C:\Python26\lib\gzip.py:212:read
C:\Python26\lib\gzip.py:255:_read
C:\Python26\lib\gzip.py:156:_read_gzip_header
]
2011-01-17 20:38:43+0800 [completedownload.nexusphp] ERROR: Error
downloading <http://pt.sjtu.edu.cn/viewsnatches.php?id=53159>:
[Failure instance: Traceback: <type 'exceptions.IOError'>: Not a
gzipped file
C:\Python26\lib\site-packages\twisted\internet\defer.py:
455:_startRunCallbacks
C:\Python26\lib\site-packages\twisted\internet\defer.py:
542:_runCallbacks
C:\Python26\lib\site-packages\twisted\internet\defer.py:361:callback
C:\Python26\lib\site-packages\twisted\internet\defer.py:
455:_startRunCallbacks
--- <exception caught here> ---
C:\Python26\lib\site-packages\twisted\internet\defer.py:
542:_runCallbacks
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\core
\downloader\middleware.py:46:process_response
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\contrib
\downloadermiddleware\httpcompression.py:21:process_response
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\contrib
\downloadermiddleware\httpcompression.py:37:_decode
C:\Python26\lib\gzip.py:212:read
C:\Python26\lib\gzip.py:255:_read
C:\Python26\lib\gzip.py:156:_read_gzip_header
]
2011-01-17 20:39:10+0800 [completedownload.nexusphp] ERROR: Error
downloading <http://pt.sjtu.edu.cn/viewsnatches.php?id=53067>:
[Failure instance: Traceback: <type 'exceptions.IOError'>: Not a
gzipped file
C:\Python26\lib\site-packages\twisted\internet\defer.py:
455:_startRunCallbacks
C:\Python26\lib\site-packages\twisted\internet\defer.py:
542:_runCallbacks
C:\Python26\lib\site-packages\twisted\internet\defer.py:361:callback
C:\Python26\lib\site-packages\twisted\internet\defer.py:
455:_startRunCallbacks
--- <exception caught here> ---
C:\Python26\lib\site-packages\twisted\internet\defer.py:
542:_runCallbacks
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\core
\downloader\middleware.py:46:process_response
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\contrib
\downloadermiddleware\httpcompression.py:21:process_response
C:\Python26\lib\site-packages\scrapy-0.13.0-py2.6.egg\scrapy\contrib
\downloadermiddleware\httpcompression.py:37:_decode
C:\Python26\lib\gzip.py:212:read
C:\Python26\lib\gzip.py:255:_read
C:\Python26\lib\gzip.py:156:_read_gzip_header

Shane Evans

unread,
Jan 18, 2011, 8:18:22 AM1/18/11
to scrapy...@googlegroups.com
This looks like an error with gzip compression, similar to the error reported here:
what version of scrapy are you using?


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.


hongleij

unread,
Jan 18, 2011, 7:44:54 PM1/18/11
to scrapy-users
Scrapy-0.13.0.2539.win32.exe
ActivePython 2.6.6
Win7

On 1月18日, 下午9时18分, Shane Evans <shane.ev...@gmail.com> wrote:
> This looks like an error with gzip compression, similar to the error
> reported here:http://groups.google.com/group/scrapy-users/browse_thread/thread/9130...
> what version of scrapy are you using?
>
> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@google groups.com>
> > .

hongleij

unread,
Jan 21, 2011, 5:38:19 AM1/21/11
to scrapy-users
Will you give some suggestions??

On 1月18日, 下午9时18分, Shane Evans <shane.ev...@gmail.com> wrote:
> This looks like an error with gzip compression, similar to the error
> reported here:http://groups.google.com/group/scrapy-users/browse_thread/thread/9130...
> what version of scrapy are you using?
>
> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@google groups.com>
> > .

Shane Evans

unread,
Jan 24, 2011, 4:35:40 PM1/24/11
to scrapy...@googlegroups.com
hmm... I tried "scrapy shell http://pt.sjtu.edu.cn/viewsnatches.php?id=53551" and it all seemed to work ok for me, as did calling those pages from within a spider.  However, I tried with the latest code on Linux. Do you still get the same problem with "scrapy shell"? how about a simple crawler that just calls these pages? If you have something simple that recreates this problem it would be great.

One thing to try is to check if you still get these errors when you increase the delay between requests. Set http://doc.scrapy.org/0.13/topics/settings.html#concurrent-requests-per-spider to 1 and http://doc.scrapy.org/0.13/topics/settings.html#download-delay to a few seconds and see if you still get problems. It could be that heavy crawling is causing some problems for that website.

If this is an intermittent problem on the website, then retrying like you suggest makes sense. It should be possible to be notified of errors by passing a function as the errback parameter to the Request object: http://doc.scrapy.org/0.13/topics/request-response.html#scrapy.http.Request . These requests are made by the spider. If you're using CrawlSpider, it manages creating these, so you may need to override a method or write a custom Rule or something (I think there probably should be an 'errback' parameter on the Rule object.. it seems missing).

To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

hongleij

unread,
Mar 3, 2011, 9:51:49 AM3/3/11
to scrapy-users
Many Thanks!but how could i get the formal request url in the
Request's errback function?

On 1月25日, 上午5时35分, Shane Evans <shane.ev...@gmail.com> wrote:
> hmm... I tried "scrapy shellhttp://pt.sjtu.edu.cn/viewsnatches.php?id=53551"
> and it all seemed to work ok for me, as did calling those pages from within
> a spider.  However, I tried with the latest code on Linux. Do you still get
> the same problem with "scrapy shell"? how about a simple crawler that just
> calls these pages? If you have something simple that recreates this problem
> it would be great.
>
> One thing to try is to check if you still get these errors when you increase
> the delay between requests. Sethttp://doc.scrapy.org/0.13/topics/settings.html#concurrent-requests-p...
> to
> 1 andhttp://doc.scrapy.org/0.13/topics/settings.html#download-delayto a
> few seconds and see if you still get problems. It could be that heavy
> crawling is causing some problems for that website.
>
> If this is an intermittent problem on the website, then retrying like you
> suggest makes sense. It should be possible to be notified of errors by
> passing a function as the errback parameter to the Request object:http://doc.scrapy.org/0.13/topics/request-response.html#scrapy.http.R....
> These requests are made by the spider. If you're using CrawlSpider, it
> manages creating these, so you may need to override a method or write a
> custom Rule or something (I think there probably should be an 'errback'
> parameter on the Rule object.. it seems missing).
>

hongleij

unread,
Mar 3, 2011, 10:09:47 AM3/3/11
to scrapy-users
my code like this:

def requestError(self,failure):
pass #i want get the url here
def parse(self, response):
yield Request("http://www.hust.edu.cn/
err",callback=self.parse,errback=self.requestError)

surfeurX

unread,
Apr 6, 2011, 4:52:42 AM4/6/11
to scrapy-users
Hi all,

I had the same problem yesterday, but I avoided it by turning off in
the settings file
scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware

DOWNLOADER_MIDDLEWARES_BASE = {

'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware':
100,
'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware':
300,

'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':
400,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,

'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware':
550,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware':
600,
'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':
700,

'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':
750,
#####
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':
800,
'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,

'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware':
900,
}
~

Pablo Hoffman

unread,
Apr 6, 2011, 12:04:03 PM4/6/11
to scrapy...@googlegroups.com
You can turn off that specific middleware by adding this to your project
settings:

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': None,
}

Which is better than replicating the whole DOWNLOADER_MIDDLEWARES_BASE setting
in your project settings.

Btw, what Scrapy version are you using?. I couldn't reproduce this with
0.12.2539.

Pablo.

> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

surfeurX

unread,
Apr 6, 2011, 5:33:54 PM4/6/11
to scrapy-users
I had the error on scrapy versions (0, 13, 0) and (0, 12, 0).
scrapy shell "http://nf.nfdaily.cn/spqy/default.htm"


On Apr 6, 5:04 pm, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> You can turn off that specific middleware by adding this to your project
> settings:
>
> DOWNLOADER_MIDDLEWARES = {
>     'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddlew are': None,
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages