scrapy downloading Error -3 while decompressing: invalid stored block lengths

shiva krishna

unread,

Jun 6, 2012, 5:55:19 AM6/6/12

to scrapy...@googlegroups.com

Hi i am scraping a website and it contains many urls from which i need to fetch data.

Actually i used xpath and fetched all the href's i mean urls and saved in to a list.I looped this list and yielded a request. Below is my spider code,

class ExampledotcomSpider(BaseSpider):

name = "exampledotcom"

allowed_domains = ["www.example.com"]

start_urls = ["http://www.example.com/movies/city.html"]

def parse(self, response):

hxs = HtmlXPathSelector(response)

cinema_links = hxs.select('//div[@class="contentArea"]/div[@class="leftNav"]/div[@class="cinema"]/div[@class="rc"]/div[@class="il"]/span[@class="bt"]/a/@href').extract()

for cinema_hall in cinema_links:

yield Request(cinema_hall, callback=self.parse_cinema)

def parse_cinema(self, response):

hxs = HtmlXPathSelector(response)

cinemahall_name = hxs.select('//div[@class="companyDetails"]/div[@itemscope=""]/span[@class="srchrslt"]/h1/span/text()').extract()

........

Here for example i had 60 urls in the list , for about 37 urls are not downloaded i mean for all these urls an error message has been appeared as below

2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-70mm-%3Cnear%3E-place/040PXX40-XX40-000147377847-A6M3>: Error -3 while decompressing: invalid stored block lengths

2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-35mm-%3Cnear%3E-place/040PXX40-XX40-000164969686-H9C5>: Error -3 while decompressing: invalid stored block lengths

only for some urls scrapy is downloading an for remaining showing as above , i really not able to understand whats happening and whats wrong with my code.

can anyone please suggest me how to remove these errors.

Thanks in advance.

Shane Evans

unread,

Jun 6, 2012, 10:41:46 PM6/6/12

to scrapy...@googlegroups.com

Here for example i had 60 urls in the list , for about 37 urls are not downloaded i mean for all these urls an error message has been appeared as below

2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-70mm-%3Cnear%3E-place/040PXX40-XX40-000147377847-A6M3>: Error -3 while decompressing: invalid stored block lengths

2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-35mm-%3Cnear%3E-place/040PXX40-XX40-000164969686-H9C5>: Error -3 while decompressing: invalid stored block lengths

only for some urls scrapy is downloading an for remaining showing as above , i really not able to understand whats happening and whats wrong with my code.

can anyone please suggest me how to remove these errors.

This looks more like an error downloading and decompressing the page than something in your code. You can try `scrapy fetch URL` to see if scrapy downloads correctly or not (and maybe try a few times in case intermittent).

The error appears to be decompressing the request, so maybe you could try disabling the compression middleware or modifying headers to see if the website can give you the page uncompressed.

It it was possible to share the URL we'd probably be able to help much more easily.

shiva krishna

unread,

Jun 7, 2012, 2:06:40 AM6/7/12

to scrapy...@googlegroups.com

Yes you are absolutely right i got problem during uncompression

When i tried scrapy shell "URL", I am getting the following error

ERROR: Shell error

Traceback (most recent call last):

File "/usr/lib64/python2.7/threading.py", line 503, in __bootstrap

self.__bootstrap_inner()

File "/usr/lib64/python2.7/threading.py", line 530, in __bootstrap_inner

self.run()

File "/usr/lib64/python2.7/threading.py", line 483, in run

self.__target(*self.__args, **self.__kwargs)

--- <exception caught here> ---

File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 207, in _worker

result = context.call(ctx, function, *args, **kwargs)

File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 59, in callWithContext

return self.currentContext().callWithContext(ctx, func, *args, **kw)

File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext

return func(*args,**kw)

File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 47, in _start

self.fetch(url, spider)

File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 88, in fetch

self._schedule, request, spider)

File "/usr/lib64/python2.7/site-packages/twisted/internet/threads.py", line 118, in blockingCallFromThread

result.raiseException()

File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/core/downloader/middleware.py", line 46, in process_response

response = method(request=request, response=response, spider=spider)

File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/contrib/downloadermiddleware/httpcompression.py", line 20, in process_response

decoded_body = self._decode(response.body, encoding.lower())

File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/contrib/downloadermiddleware/httpcompression.py", line 36, in _decode

body = gunzip(body)

File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/gz.py", line 15, in gunzip

chunk = f.read(8196)

File "/usr/lib64/python2.7/gzip.py", line 252, in read

self._read(readsize)

File "/usr/lib64/python2.7/gzip.py", line 303, in _read

uncompress = self.decompress.decompress(buf)

zlib.error: Error -3 while decompressing: invalid distance too far back

So can we make this remove by disabling middleware or else headers ? can i know hoe to disable that in my above code in scrapy.

Reply all

Reply to author

Forward