TextReponse

Malik Rumi

unread,

May 22, 2017, 10:54:40 PM5/22/17

to scrapy-users

In the docs, it says:

TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.

I understood this to mean that the base Response class is meant to be used only for binary data. However, I also read:

TextResponse Objects are used for binary data such as images, sounds etc which has the ability to encode the base Response class.

https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm

which is of course exactly the opposite of how I interpreted it. Would someone here please clarify? Thanks.

As additional background, I am scraping text, not photos or media files. So it makes sense to me that something called TextResponse would be

intended for use with text, but I didn't write it, so I don't know. That's why I am asking for clarification.

Ordinarily, when I download, it is a bytes object which I then have to convert to unicode. If I can set it up to come to me as unicode in the first place,

that would save me a step and be great. But that leads me to my second question: How exactly are we supposed to implement TextResponse?

I am in 100% agreement with the OP here: https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users

and I don't think he (or I) got a sufficient answer.

HTML pages are the most common response types spiders deal with, and their class is HtmlResponse, which inherits from TextResponse, so you can use all its features.

Well, if that's so, then TextResponse would be the default and we'd get back unicode strings, right? But that's not what happens. We get byte strings.

And despite the answer found there, it is not at all clear how we can use these response subclasses if we are told the middleware does it all automatically, as if we aren't

supposed to worry about it. If that were so, why tell us about, or even have - the subclass at all?

Here's an error I got: TypeError: TextResponse url must be str, got list:

The list the error is referring to is my start_urls variable that I've been using without issue until I tried to use TextResponse. So if we can't use a list, are we supposed to only feed it

one url at a time? Manually?

Your patient, thorough, and detailed explanation of these issues is greatly appreciated.

Paul Tremberth

unread,

May 23, 2017, 10:43:22 AM5/23/17

to scrapy-users

Hello Malik,

On Tuesday, May 23, 2017 at 4:54:40 AM UTC+2, Malik Rumi wrote:

In the docs, it says:

TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.

I understood this to mean that the base Response class is meant to be used only for binary data. However, I also read:

Correct. This line is taken from the official docs (https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse)

TextResponse Objects are used for binary data such as images, sounds etc which has the ability to encode the base Response class.

https://www.tutorialspoint.com/scrapy/scrapy_requests_and_responses.htm

which is of course exactly the opposite of how I interpreted it. Would someone here please clarify? Thanks.

This line is not from the official docs. And I believe it is neither correct nor clear.

As additional background, I am scraping text, not photos or media files. So it makes sense to me that something called TextResponse would be
intended for use with text, but I didn't write it, so I don't know. That's why I am asking for clarification.

Ordinarily, when I download, it is a bytes object which I then have to convert to unicode. If I can set it up to come to me as unicode in the first place,
that would save me a step and be great. But that leads me to my second question: How exactly are we supposed to implement TextResponse?

The Scrapy framework will instantiate the correct Response class or subclass and pass it as argument to your spider callbacks.

If the framework receives an HTML or XML response, it will create an HtmlResponse or XmlResponse respectively, by itself, without you needing to do anything special.

Both HtmlResponse and XmlResponse are subclasses of TextResponse. (See https://docs.scrapy.org/en/latest/topics/request-response.html#htmlresponse-objects)

The distinction between a plain, raw Response and TextResponse is really that,

on TextResponses, you can call .xpath() and .css() on them directly, without the need to create Selector explicitly.

XPath and CSS selectors only make sense for HTML or XML. That's why .xpath() and .css() are only available on HtmlResponse and XmlResponse instances.

ALL Responses, TextResponse or not, come with the raw body received from the server,

and which is accessible via the .body attribute.

response.body gives you raw bytes.

What TextResponse adds here is a .text attribute that contains the Unicode string of the raw body,

as decoded with the detected encoding of the page.

response.text is a Unicode string.

response.text is NOT available on non-TextResponse.

I am in 100% agreement with the OP here: https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users

and I don't think he (or I) got a sufficient answer.

HTML pages are the most common response types spiders deal with, and their class is HtmlResponse, which inherits from TextResponse, so you can use all its features.

Well, if that's so, then TextResponse would be the default and we'd get back unicode strings, right? But that's not what happens. We get byte strings.

As Pablo said in https://groups.google.com/forum/#!msg/scrapy-users/-ulA_0Is1Kc/oZzM2kuTmd4J;context-place=forum/scrapy-users ,

usually (but not always),

you expect HTML back from Request.

And one usually writes callbacks with this assumption.

And with this assumption, you rarely need to bother about the raw bytes or encoding: you trust scrapy and use response.css() or response.xpath()

And if you need to access the (decoded) unicode content, you use response.text

If your callbacks can, for some (maybe valid) reason, receive responses that are of mixed type,

that is that they are NOT always text (such as image, zip file etc.),

then you can test the response type with isinstance() and you use response.body to get the raw bytes if you need.

And despite the answer found there, it is not at all clear how we can use these response subclasses if we are told the middleware does it all automatically, as if we aren't
supposed to worry about it. If that were so, why tell us about, or even have - the subclass at all?

As I mention above, one usually writes spider callbacks for a specific type of Response, and it's usually for HtmlResponse.

But you can totally work with non-TextResponse in Scrapy, if you need it.

One area where the type is more important is middlewares.

These are generic components and may need to handle different types of responses (or skip processing if the type is not the one it's supposed to work on).

You may not need to write your own middlewares, but if you do, you can have a look at scrapy's source code;

For example AjaxCrawlMiddleware

https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/ajaxcrawl.py#L38

Here's an error I got: TypeError: TextResponse url must be str, got list:
The list the error is referring to is my start_urls variable that I've been using without issue until I tried to use TextResponse. So if we can't use a list, are we supposed to only feed it
one url at a time? Manually?

Your patient, thorough, and detailed explanation of these issues is greatly appreciated.

I hope this explained the different response type clearly enough.

If not, feel free to ask.

Cheers,

/Paul.

Malik Rumi

unread,

May 27, 2017, 2:44:09 PM5/27/17

to scrapy-users

Dear Paul,

thank you for the explanation. I'm not sure I understand, to be honest, but let me try a few things and see if it gets clearer. If not, I'll be back.

Malik Rumi

unread,

May 28, 2017, 1:08:24 AM5/28/17

to scrapy-users

OK, I'm back:

2017-05-28 05:00:18 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'str' in <GET.html>

But I want a string!

That's why I redefined the items in my spider this way:

item['textbody'] = response.text

And besides, isn't item['texbody'] a dict or dict like object?

How do I get a string?!

Paul Tremberth

unread,

May 31, 2017, 10:32:23 AM5/31/17

to scrapy-users

Hi Malik,

scrapy callbacks MUST return a Request, an Item, a dict, or a list of those (or be a generator of these types, if you use yield), as the error says.

That's part of Scrapy framework's API contract with spider classes.

If you did

item['textbody'] = response.text

then item['textbody'] contains a Unicode string,

and your callback would return item, not item['textbody'].

You can then process your output items to get the "textbody" field of each.

Scrapy is about outputtin structured data. Plain strings are less structured than a dict or XML element with a "textbody" field.

Is that clearer?

If not, you can post your spider code.

Best,

Paul.

Note that we're moving the community questions and discussion to Reddit.

See https://groups.google.com/d/msg/scrapy-users/0ParYGqd5Hg/4z_T-8JpCQAJ

Reply all

Reply to author

Forward