TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.
TextResponse Objects are used for binary data such as images, sounds etc which has the ability to encode the base Response class.
HTML pages are the most common response types spiders deal with, and their class is HtmlResponse, which inherits from TextResponse, so you can use all its features.
In the docs, it says:TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.I understood this to mean that the base Response class is meant to be used only for binary data. However, I also read:
TextResponse Objects are used for binary data such as images, sounds etc which has the ability to encode the base Response class.which is of course exactly the opposite of how I interpreted it. Would someone here please clarify? Thanks.
As additional background, I am scraping text, not photos or media files. So it makes sense to me that something called TextResponse would beintended for use with text, but I didn't write it, so I don't know. That's why I am asking for clarification.Ordinarily, when I download, it is a bytes object which I then have to convert to unicode. If I can set it up to come to me as unicode in the first place,that would save me a step and be great. But that leads me to my second question: How exactly are we supposed to implement TextResponse?
I am in 100% agreement with the OP here: https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-usersand I don't think he (or I) got a sufficient answer.HTML pages are the most common response types spiders deal with, and their class is HtmlResponse, which inherits from TextResponse, so you can use all its features.Well, if that's so, then TextResponse would be the default and we'd get back unicode strings, right? But that's not what happens. We get byte strings.
And despite the answer found there, it is not at all clear how we can use these response subclasses if we are told the middleware does it all automatically, as if we aren'tsupposed to worry about it. If that were so, why tell us about, or even have - the subclass at all?
Here's an error I got: TypeError: TextResponse url must be str, got list:The list the error is referring to is my start_urls variable that I've been using without issue until I tried to use TextResponse. So if we can't use a list, are we supposed to only feed itone url at a time? Manually?Your patient, thorough, and detailed explanation of these issues is greatly appreciated.
2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'str' in <GET.html>
But I want a string!
That's why I redefined the items in my spider this way:
item['textbody'] = response.text
And besides, isn't item['texbody'] a dict or dict like object?
How do I get a string?!
item['textbody'] = response.text