Getting AttributeError: 'Response' object has no attribute 'body_as_unicode' on some sites

1,516 views
Skip to first unread message

André Bergonse

unread,
Aug 11, 2015, 2:40:37 PM8/11/15
to scrapy-users
Hi,

I'm in the process of developing a spider that will run through some 320k different URLs and while doing so I'm finding different situations. Right now I have some cases where Scrapy doesn't seem to detect the correct type of response and returns a Response object instead of an HtmlResponse one (I've been here: http://doc.scrapy.org/en/latest/topics/request-response.html?#response-objects).

In my parse method I'm actually not selecting anything from the page. The purpose of this spider is to send the whole body of the request to the Wappalyzer library (https://github.com/scrapinghub/wappalyzer-python/) to detect apps and technologies used. I'm just taking advantage of the Scrapy architecture for the crawling part, instead of building my own.

Here's an example of a website where this happens: http://boucheriesaintroch.webs.com. If you do scrapy shell http://boucheriesaintroch.webs.com:

In [1]: type(response)
Out[1]: scrapy.http.response.Response
In [2]: response.headers
Out[2]:
{'Date': 'Tue, 11 Aug 2015 16:57:27 GMT',
 'Server': 'Webs.com/1.0',
 'Set-Cookie': 'fwww=b4e6b552bf12b31f11fd753117ad163ea80e738c7fe8587bfd2eebc489eb9921; Path=/',
 'X-Robots-Tag': 'nofollow'}

In Chrome dev tools you can see that the page reports itself as text/html with ISO-8859-1 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

So my question is, why doesn't scrapy give me an HtmlResponse object back and how can I fix this?

Thanks for any tips!


soundjack

unread,
Aug 18, 2015, 8:52:55 PM8/18/15
to scrapy-users
Just for the record for those who encounter this in the future, I found a solution. BTW, I forgot to say I'm using version 0.24.6. Here's how I forced a response into being an HtmlResponse type (seen here - http://git.io/v3zoP - and very slightly adapted):

    def parse(self, response):
        # Scrapy doesn't return an HtmlResponse for some sites which makes loading items fail
        # This forces the response to be HtmlResponse type
        # As seen here http://git.io/v3zoP
        if response.status == 200 and not isinstance(response, HtmlResponse):
            try:
                flags = response.flags
                if "partial" in flags:
                    flags.remove('partial')
                flags.append('fixed')
                response = HtmlResponse(response.url,
                                        headers=response.headers,
                                        body=response.body,
                                        flags=flags,
                                        request=response.request)
                log.msg('Response transformed into HtmlResponse for %s' % response.url, level=log.WARNING)
            except:
                pass

        l = WaLoader(item=WaItem(), response=response)

I was able to go as far as this - https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py - regarding where the response type is decided but I wasn't able to figure out why in this case it didn't return as an HtmlResponse. 

Cheers
soundjack
Reply all
Reply to author
Forward
0 new messages