TextReponse

126 views
Skip to first unread message

Malik Rumi

unread,
May 22, 2017, 10:54:40 PM5/22/17
to scrapy-users
In the docs, it says:

TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.

I understood this to mean that the base Response class is meant to be used only for binary data. However, I also read:

TextResponse Objects are used for binary data such as images, sounds etc which has the ability to encode the base Response class. 


which is of course exactly the opposite of how I interpreted it. Would someone here please clarify? Thanks.

As additional background, I am scraping text, not photos or media files. So it makes sense to me that something called TextResponse would be 
intended for use with text, but I didn't write it, so I don't know. That's why I am asking for clarification.

Ordinarily, when I download, it is a bytes object which I then have to convert to unicode. If I can set it up to come to me as unicode in the first place, 
that would save me a step and be great. But that leads me to my second question: How exactly are we supposed to implement TextResponse? 


and I don't think he (or I) got a sufficient answer.

HTML pages are the most common response types spiders deal with, and their class is HtmlResponse, which inherits from TextResponse, so you can use all its features.

Well, if that's so, then TextResponse would be the default and we'd get back unicode strings, right? But that's not what happens. We get byte strings.

And despite the answer found there, it is not at all clear how we can use these response subclasses if we are told the middleware does it all automatically, as if we aren't
supposed to worry about it. If that were so, why tell us about, or even have - the subclass at all?

Here's an error I got: TypeError: TextResponse url must be str, got list:
The list the error is referring to is my start_urls variable that I've been using without issue until I tried to use TextResponse. So if we can't use a list, are we supposed to only feed it
one url at a time? Manually? 

Your patient, thorough, and detailed explanation of these issues is greatly appreciated. 

Paul Tremberth

unread,
May 23, 2017, 10:43:22 AM5/23/17
to scrapy-users
Hello Malik,


On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote:
In the docs, it says:

TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.

I understood this to mean that the base Response class is meant to be used only for binary data. However, I also read:


 
TextResponse Objects are used for binary data such as images, sounds etc which has the ability to encode the base Response class. 


which is of course exactly the opposite of how I interpreted it. Would someone here please clarify? Thanks.


This line is not from the official docs. And I believe it is neither correct nor clear.

 
As additional background, I am scraping text, not photos or media files. So it makes sense to me that something called TextResponse would be 
intended for use with text, but I didn't write it, so I don't know. That's why I am asking for clarification.

Ordinarily, when I download, it is a bytes object which I then have to convert to unicode. If I can set it up to come to me as unicode in the first place, 
that would save me a step and be great. But that leads me to my second question: How exactly are we supposed to implement TextResponse? 


The Scrapy framework will instantiate the correct Response class or subclass and pass it as argument to your spider callbacks.

If the framework receives an HTML or XML response, it will create an HtmlResponse or XmlResponse respectively, by itself, without you needing to do anything special.

Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects)

The distinction between a plain, raw Response and TextResponse is really that,
on TextResponses, you can call .xpath() and .css() on them directly, without the need to create Selector explicitly.

XPath and CSS selectors only make sense for HTML or XML. That's why .xpath() and .css() are only available on HtmlResponse and XmlResponse instances.

ALL Responses, TextResponse or not, come with the raw body received from the server,
and which is accessible via the .body attribute.
response.body gives you raw bytes.

What TextResponse adds here is a .text attribute that contains the Unicode string of the raw body,
as decoded with the detected encoding of the page.
response.text is a Unicode string.

response.text is NOT available on non-TextResponse.

 

and I don't think he (or I) got a sufficient answer.

HTML pages are the most common response types spiders deal with, and their class is HtmlResponse, which inherits from TextResponse, so you can use all its features.

Well, if that's so, then TextResponse would be the default and we'd get back unicode strings, right? But that's not what happens. We get byte strings.


usually (but not always),
you expect HTML back from Request.

And one usually writes callbacks with this assumption.
And with this assumption, you rarely need to bother about the raw bytes or encoding: you trust scrapy and use response.css() or response.xpath()

And if you need to access the (decoded) unicode content, you use response.text

If your callbacks can, for some (maybe valid) reason, receive responses that are of mixed type,
that is that they are NOT always text (such as image, zip file etc.),
then you can test the response type with isinstance() and you use response.body to get the raw bytes if you need.

 
And despite the answer found there, it is not at all clear how we can use these response subclasses if we are told the middleware does it all automatically, as if we aren't
supposed to worry about it. If that were so, why tell us about, or even have - the subclass at all?


As I mention above, one usually writes spider callbacks for a specific type of Response, and it's usually for HtmlResponse.
But you can totally work with non-TextResponse in Scrapy, if you need it.

One area where the type is more important is middlewares.
These are generic components and may need to handle different types of responses (or skip processing if the type is not the one it's supposed to work on).

You may not need to write your own middlewares, but if you do, you can have a look at scrapy's source code;
For example AjaxCrawlMiddleware
 
Here's an error I got: TypeError: TextResponse url must be str, got list:
The list the error is referring to is my start_urls variable that I've been using without issue until I tried to use TextResponse. So if we can't use a list, are we supposed to only feed it
one url at a time? Manually? 

Your patient, thorough, and detailed explanation of these issues is greatly appreciated. 

I hope this explained the different response type clearly enough.
If not, feel free to ask.

Cheers,
/Paul. 

Malik Rumi

unread,
May 27, 2017, 2:44:09 PM5/27/17
to scrapy-users
Dear Paul,
thank you for the explanation. I'm not sure I understand, to be honest, but let me try a few things and see if it gets clearer. If not, I'll be back. 

Malik Rumi

unread,
May 28, 2017, 1:08:24 AM5/28/17
to scrapy-users
OK, I'm back:

2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'str' in <GET.html>


But I want a string!

That's why I redefined the items in my spider this way: 


item['textbody'] = response.text


And besides, isn't item['texbody'] a dict or dict like object?


How do I get a string?!


Paul Tremberth

unread,
May 31, 2017, 10:32:23 AM5/31/17
to scrapy-users
Hi Malik,

scrapy callbacks MUST return a Request, an Item, a dict, or a list of those (or be a generator of these types, if you use yield), as the error says.
That's part of Scrapy framework's API contract with spider classes.

If you did 
item['textbody'] = response.text

then item['textbody'] contains a Unicode string,
and your callback would return item, not item['textbody'].

You can then process your output items to get the "textbody" field of each.

Scrapy is about outputtin structured data. Plain strings are less structured than a dict or XML element with a "textbody" field.

Is that clearer?

If not, you can post your spider code.

Best,
Paul.

Note that we're moving the community questions and discussion to Reddit.
Reply all
Reply to author
Forward
0 new messages