scrapy downloading Error -3 while decompressing: invalid stored block lengths

777 views
Skip to first unread message

shiva krishna

unread,
Jun 6, 2012, 5:55:19 AM6/6/12
to scrapy...@googlegroups.com
Hi i am scraping a website and it contains many urls from which i need to fetch data.
Actually i used xpath and fetched all the href's i mean urls and saved in to a list.I looped this list and yielded a request. Below is my spider code,

    class ExampledotcomSpider(BaseSpider):
       name = "exampledotcom"
       allowed_domains = ["www.example.com"]
       start_urls = ["http://www.example.com/movies/city.html"]
    
    
       def parse(self, response):
      hxs = HtmlXPathSelector(response)
      cinema_links = hxs.select('//div[@class="contentArea"]/div[@class="leftNav"]/div[@class="cinema"]/div[@class="rc"]/div[@class="il"]/span[@class="bt"]/a/@href').extract()
      for cinema_hall in cinema_links:
       yield Request(cinema_hall, callback=self.parse_cinema)


       def parse_cinema(self, response):
      hxs = HtmlXPathSelector(response)
      cinemahall_name = hxs.select('//div[@class="companyDetails"]/div[@itemscope=""]/span[@class="srchrslt"]/h1/span/text()').extract()
           ........

Here for example i had 60 urls in the list , for about 37 urls are not downloaded i mean for all these urls an error message has been appeared as below

    2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-70mm-%3Cnear%3E-place/040PXX40-XX40-000147377847-A6M3>: Error -3 while decompressing: invalid stored block lengths
    2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-35mm-%3Cnear%3E-place/040PXX40-XX40-000164969686-H9C5>: Error -3 while decompressing: invalid stored block lengths
    
only for some urls scrapy is downloading an for remaining showing as above , i really not able to understand whats happening and whats wrong with my code.

can anyone please suggest me how to remove these errors.

Thanks in advance.

  
    
     

Shane Evans

unread,
Jun 6, 2012, 10:41:46 PM6/6/12
to scrapy...@googlegroups.com

Here for example i had 60 urls in the list , for about 37 urls are not downloaded i mean for all these urls an error message has been appeared as below

    2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-70mm-%3Cnear%3E-place/040PXX40-XX40-000147377847-A6M3>: Error -3 while decompressing: invalid stored block lengths
    2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-35mm-%3Cnear%3E-place/040PXX40-XX40-000164969686-H9C5>: Error -3 while decompressing: invalid stored block lengths
    
only for some urls scrapy is downloading an for remaining showing as above , i really not able to understand whats happening and whats wrong with my code.

can anyone please suggest me how to remove these errors.
This looks more like an error downloading and decompressing the page than something in your code. You can try `scrapy fetch URL` to see if scrapy downloads correctly or not (and maybe try a few times in case intermittent).

The error appears to be decompressing the request, so maybe you could try disabling the compression middleware or modifying headers to see if the website can give you the page uncompressed.

It it was possible to share the URL we'd probably be able to help much more easily.

shiva krishna

unread,
Jun 7, 2012, 2:06:40 AM6/7/12
to scrapy...@googlegroups.com
Yes you are absolutely right i got problem during uncompression

When i tried scrapy shell "URL", I am getting the following  error

ERROR: Shell error
Traceback (most recent call last):
 File "/usr/lib64/python2.7/threading.py", line 503, in __bootstrap
   self.__bootstrap_inner()
 File "/usr/lib64/python2.7/threading.py", line 530, in __bootstrap_inner
   self.run()
 File "/usr/lib64/python2.7/threading.py", line 483, in run
   self.__target(*self.__args, **self.__kwargs)
--- <exception caught here> ---
 File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 207, in _worker
   result = context.call(ctx, function, *args, **kwargs)
 File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 59, in callWithContext
   return self.currentContext().callWithContext(ctx, func, *args, **kw)
 File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext
   return func(*args,**kw)
 File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 47, in _start
   self.fetch(url, spider)
 File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/shell.py", line 88, in fetch
   self._schedule, request, spider)
 File "/usr/lib64/python2.7/site-packages/twisted/internet/threads.py", line 118, in blockingCallFromThread
   result.raiseException()
 File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
   current.result = callback(current.result, *args, **kw)
 File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/core/downloader/middleware.py", line 46, in process_response
   response = method(request=request, response=response, spider=spider)
 File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/contrib/downloadermiddleware/httpcompression.py", line 20, in process_response
   decoded_body = self._decode(response.body, encoding.lower())
 File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/contrib/downloadermiddleware/httpcompression.py", line 36, in _decode
   body = gunzip(body)
 File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/utils/gz.py", line 15, in gunzip
   chunk = f.read(8196)
 File "/usr/lib64/python2.7/gzip.py", line 252, in read
   self._read(readsize)
 File "/usr/lib64/python2.7/gzip.py", line 303, in _read
   uncompress = self.decompress.decompress(buf)
zlib.error: Error -3 while decompressing: invalid distance too far back

So can we make this remove by disabling  middleware or else headers ? can i know hoe to disable that in my above code in scrapy. 
Reply all
Reply to author
Forward
0 new messages