Streaming HttpResponse revisted. Any core devs, please take a look if you can :)

1,501 views
Skip to first unread message

Tai Lee

unread,
Aug 20, 2012, 11:39:07 AM8/20/12
to django-d...@googlegroups.com
I'd like to re-visit the discussion surrounding #7581 [1], a ticket about streaming responses that is getting quite long in the tooth now, which Jacob finally "accepted" 11 months ago (after a long time as DDN) and said that it is clear we have to do *something*, but *what* remains to be seen.

I'd like to try provide a little refresher and summarise the options that have been suggested, and ask any core devs to please weigh in with their preference so that I can work up a new patch that will be more likely to gain acceptance.


THE PROBLEM:

1. There are bugs and surprising behaviour that arise when creating an HttpResponse with a generator as its content, as a result of "quantum state" of `HttpResponse.content` (measuring it changes it).

>>> from django.http import HttpResponse
>>> r = HttpResponse(iter(['a', 'b'])) 
>>> r.content 
'ab'
>>> r.content 
''
>>> r2 = HttpResponse(iter(['a', 'b'])) 
>>> print r2.content
ab
>>> print r2.content

>>> r3 = HttpResponse(iter(['a', 'b']))
>>> r3.content == r3.content
False

2. Some middleware prematurely consume generator content by accessing `HttpResponse.content`, which can use a lot of memory and cause browser timeouts when attempting to stream large amounts of data or slow-to-generate data.

There have been several tickets [2] [3] [4] and django-developers discussions [5] [6] [7] [8] about these issues.


SOME USE CASES FOR STREAMING RESPONSES:

A. Generating and exporting CSV data directly from the database.

B. Restricting file access to authenticated users for files that may be hosted on external servers.

C. Drip-feeding chunks of content to prevent timeout when requesting a page that takes a long time to generate.


OPTION 1:

Remove support for "streaming" responses. If an iterator is passed in as content to `HttpResponse`, consume it in `HttpResponse.__init__()` to eliminate buggy behaviour. Middleware won't have to worry about what type of content a response has.

Now that Jacob has accepted #7581 and said that it is clear we need to do *something*, I hope we can rule out this option.


OPTION 2:

Make `HttpResponse.__init__()` consume any iterator content, and add an `HttpResponseStreaming` class or an `HttpResponse(streaming=False)` argument. Allow middleware to check via `hasattr()` or `isinstance()` whether or not the response has generator content, and conditionally skip code that is incompatible with streaming responses.

Some middleware will have to be updated for compatibility with streaming responses, and any 3rd party middleware that prematurely consumes generator content will continue to work, only without the bugs (and potentially with increased memory usage and browser timeouts).


OPTION 3:

Build a capabilities API for `HttpResponse` objects, and have middleware inspect responses to determine "can I read content?", "can I replace content?", "can I change etag?", etc. This will likely become a bigger and more complicated design decision as we work out what capabilities we want to support. Some have argued that it should be sufficient to know if we have generator content or not, for all the cases that people have reported so far.


OPTION 4:

Provide a way for developers to specify on an `HttpResponse` object or subclass that specific middleware should be skipped for that response. This would be problematic because 3rd party views won't know what other middleware is installed in a project in order to name them for exclusion.


OPTION 5:

Add Yet Another Setting that would allow developers to define `CONDITIONAL_MIDDLEWARE_CLASSES` at a project level. At the project level, developers would know which middleware classes they are using, and when they should be executed or skipped. This would give very fine grained control at a project level to match middleware conditionally with `HttpResponse` subclasses, without requiring any changes to existing or 3rd party middleware. This could look something like this:

MIDDLEWARE_CLASSES = (
    'django.middleware.common',
)

CONDITIONAL_MIDDLEWARE_CLASSES = {
    'exclude': {
        'django.http.HttpResponseStreaming': ['django.middleware.common', 'otherapp.othermiddleware', ...],
    },
    'include': {
        'myapp.MyHttpResponse': ['myapp.mymiddleware', ...],
    },
}


MY TAKE:

I think that option 1 and option 4 are non-starters.

I think option 3 is perhaps a little overkill and will be more difficult to get committed once we start thinking about what capabilities we want to support.

I think option 2 is probably going to be the easiest solution. It's practically implemented and up-to-date already (missing docs and tests).

Although it does involve Yet Another Setting, I think option 5 provides the most flexibility, where it is most needed. It gives developers working at the project level a way to override and conditionally skip or execute 3rd party middleware without having to make any changes to 3rd party middleware.

I would be happy with either option 2 or 5, or a variation.


NEXT STEPS:

I'd really like to see this and the related tickets closed (preferably marked "fixed"!) :)

I'm specifically looking for opinions and direction from any of the core devs, especially those who have previously commented on the ticket or in the discussions, even if it is just to permanently reject some of the options.

I'd also like to hear from anyone who has a new use case or a new solution to suggest.

I will happily work up a new patch if required (including docs and tests), even if just a proof of concept. I just need to know which way the core devs would like me to go.

Thanks.
Tai.





Anssi Kääriäinen

unread,
Aug 20, 2012, 3:19:08 PM8/20/12
to Django developers
My initial feeling is that anything accessing "large content" in
Python is broken and we should not try to make sure these things
continue to work. The problem is how we define if the content is too
large to access. Assuming any generator passed to the __init__ is
large maybe too drastic.

We should either remove the content attribute altogether from
generator based responses, or at least raise an error if the content
is accessed a second time. The current behaviour is definitely broken,
and we should not silently return corrupt data, instead we should
complain loudly.

To me option 5 feels like we are trying to work around the symptoms
instead of solving the underlying problem.

- Anssi

Alex Gaynor

unread,
Aug 20, 2012, 3:21:02 PM8/20/12
to django-d...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.


My inclination is that we should kill .content entirely, middleware that wants to rewrite the response will have two choices:

1) explicitly evaluate the entire response, and return a new HttpResponse object
2) return s anew HttpResponse object that has a generator that iteratively evaluates the previous one and rewrite it incrementally.

Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero

Anssi Kääriäinen

unread,
Aug 20, 2012, 4:47:20 PM8/20/12
to Django developers
On 20 elo, 22:21, Alex Gaynor <alex.gay...@gmail.com> wrote:
> My inclination is that we should kill .content entirely, middleware that
> wants to rewrite the response will have two choices:
>
> 1) explicitly evaluate the entire response, and return a new HttpResponse
> object
> 2) return s anew HttpResponse object that has a generator that iteratively
> evaluates the previous one and rewrite it incrementally.

+1

The above still leaves one part of the problem unsolved. If you have a
middleware that wants to access the content of the response for some
reason, it has no way to know if the content is too large to be
accessed safely. I do think we need in addition some way to tell the
middleware that the content is too large to be handled at all in
Python. As an addition to the already suggested ideas maybe a view
decorator could work?

- Anssi

Alex Ogier

unread,
Aug 20, 2012, 8:43:29 PM8/20/12
to django-d...@googlegroups.com
On Mon, Aug 20, 2012 at 3:21 PM, Alex Gaynor <alex....@gmail.com> wrote:
My inclination is that we should kill .content entirely, middleware that wants to rewrite the response will have two choices:

1) explicitly evaluate the entire response, and return a new HttpResponse object
2) return s anew HttpResponse object that has a generator that iteratively evaluates the previous one and rewrite it incrementally.

Alex

Doesn't that break anyone who hangs arbitrary attributes on their HttpResponse objects for the sake of their middleware? I've never done this on the way out, only on the way in on HttpRequest objects, but I can imagine it being useful if you wanted to conditionally disable middleware or something. You could argue that depending on the same instance making it through several layers of middleware is already iffy, but recommending as a best practice to create new HttpResponses would *really* break this.

Best,
Alex Ogier

Forest Bond

unread,
Aug 20, 2012, 9:40:24 PM8/20/12
to django-d...@googlegroups.com
Hi Tai,

On Mon, Aug 20, 2012 at 08:39:07AM -0700, Tai Lee wrote:
> I'd like to re-visit the discussion surrounding #7581 [1], a ticket about
> streaming responses that is getting quite long in the tooth now, which Jacob
> finally "accepted" 11 months ago (after a long time as DDN) and said that it is
> clear we have to do *something*, but *what* remains to be seen.

Since the last time this came up I've wondered if the simplest solution might be
to make it easier to integrate raw WSGI apps into a site. I still am not 100%
sure it's the right approach but I thought it was worth mentioning.

The argument for this is that the use cases for streaming responses are usually
corner cases that are well outside of Django's sweet spot. They go against the
grain of the Django's flexible request/response handling (middleware in
particular). But they would be trivial to handle in a raw WSGI app.

If Django made it easy to hand off requests to a WSGI app (i.e. mount the app
somewhere in the URLconf and perhaps expose authentication and/or other session
data) the framework could get out of the way and let the WSGI app have more
direct control over response handling. I think there are probably other
scenarios where easy WSGI integration would be beneficial, too.

For what it's worth...

Thanks,
Forest
--
Forest Bond
http://www.alittletooquiet.net
http://www.rapidrollout.com
signature.asc

Tai Lee

unread,
Aug 20, 2012, 10:08:08 PM8/20/12
to django-d...@googlegroups.com
It sounds like the preferred solution is still "don't do that", which I think corresponds most closely with option 1. Simply consuming generator content in `HttpResponse.__init__()` in all cases would prevent some of the surprising behaviour we are seeing, but it would completely break any support for streaming responses.

I thought that Jacob's "It's clear to me that we have to do *something* about this" comment when he marked #7581 as "accepted" would take "don't do that" off the table.

It might not count as a "backwards incompatible" change to revoke all support for streaming responses, as I couldn't find explicit support for them mentioned in the docs, but streaming responses have been around for a long time and people are relying on on this behaviour, as evidenced by several tickets and reported use cases (not limited to "large" responses, but also including "long running" responses).

Those people just have to avoid some core middleware that currently break streaming responses by prematurely consuming content, and Django doesn't provide a generic or documented way for people to hook into this part of the framework on a per-response basis. They just have to choose "all or nothing" when it comes to those middleware, unless they also have control over the middleware (which they don't for core, contrib and third party middleware).

I don't see a huge difference between wrapping or replacing the content of response, compared to wrapping or replacing the entire response. The key issue (for me) still remains, being that middleware would still have no way to know when it is appropriate for them to wrap or replace the entire response. It may also cause new complications with headers or properties of the wrapped response not being preserved.

Removing `HttpResponse.content` entirely is probably a non-starter, as it would be a backwards incompatible change. It has been part of the public API since at least 1.0.

Could somebody please explain their primary objections to option 2? This has a working up-to-date implementation already (just missing tests and docs, which I would be happy to contribute), is backwards compatible, fixes buggy behaviour in the general case where iterators are passed to `HttpResponse` as content but the developer does not explicitly want a streaming response, and it provides a hook for developers to explicitly tell middleware that the response should be streamed with `HttpResponse(stream_content=True)`.

I've seen it asked, but not answered yet (unless I missed it), if there is really a need for more fine grained control than "is it streaming, or not?" (e.g. a capabilities based API on responses). If not, I think this is currently the most likely candidate for inclusion.

Thanks.
Tai.

Tai Lee

unread,
Aug 23, 2012, 5:52:14 AM8/23/12
to django-d...@googlegroups.com
After discussion with akaarai and mYk on IRC, I have two updated proposals.


OPTION 6:

1. Consume `HttpResponse.content` on first access, so that it can be read multiple times and eliminate the current buggy behaviour (e.g. `response.content != response.content`) when a generator is passed in as content.

2. Create a new `HttpStreamingResponse` class, which has no `.content` attribute but instead has `.stream` attribute, and can be iterated only once. If it is iterated a second time, an exception will be raised to trigger a loud failure.

3. Middleware that ships with Django that can work with streaming responses, is updated. E.g. GZipMiddleware can do `if isinstance(response, HttpStreamingResponse): response.stream = compress_sequence(response.stream)`.

4. The `MIDDLEWARE_CLASSES` takes a new optional syntax that allows project authors to nominate certain middleware classes for conditional execution. This means we don't need to add Yet Another Setting, but the project author -- who is the only person in a position to know what combination of 3rd party middleware is incompatible with responses returned by other 3rd party views -- is able to conditionally execute middleware classes by listing compatible (or incompatible) view functions and response classes.

MIDDLEWARE_CLASSES = (
'some.middleware',
'another.middleware',
('some.conditional.middleware', {
'exclude': ['some.incompatible.view.function', 'some.incompatible.response.class'],
'include': ['some.compatible.view.function', 'some.compatible.response.class'],
}),
)

This proposal should be backwards compatible, fix the current buggy behaviour for people who pass a generator as content to `HttpResponse` but don't explicitly want a streaming response, and allow people who do explicitly want a streaming responses to have them.

There may be other use cases for conditionally executing middleware, where the decision lies with the project author instead of the middleware author (who won't know in advance which apps their middleware will have to work with). I haven't tried to think of any.

If people return a `HttpStreamingResponse` from their view, and they have enabled some old or 3rd party middleware that doesn't know anything about this, and which expects to access `response.content`, the project author will *have* to nominate that combination of view/response/middleware as incompatible in their `MIDDLEWARE_CLASSES` setting to avoid a loud failure.

Which leads me to my next and favourite option...


OPTION 7:

1-2. As for option 6.

3. Add support for a new middleware method specifically for streaming responses. E.g. have middleware authors create a `.get_streaming_response()` method if their middleware is able to support streaming responses. Only `HttpResponse` objects will be passed to `.get_response()` middleware methods, and only `HttpStreamingResponse` objects will be passed to `.get_streaming_response()` middleware methods.

4. Add `.get_streaming_response()` methods to middleware that ship with Django that are able to support streaming responses, e.g. GZipMiddleware.

This should be backwards compatible. Existing code will behave exactly as it does now with the exception that `response.content == response.content` will be True, whether a generator was passed in as content or not.

No existing middleware will need to be updated unless they explicitly want to add support for the new `HttpStreamingResponse` objects.

It will be much clearer in the documentation what middleware authors need to do to support streaming responses (implement a new method, vs add a conditional branch of execution in their existing method).

It will be much clearer for project authors to see if a middleware class does support streaming responses (check for a `.get_streaming_response()` method vs hunting in the docs or inside the `.get_response()` method for a conditional branch of execution).

It will be clearer for for developers who want to explicitly stream a response (use `HttpStreamingResponse` instead of `HttpResponse`).

In the case of existing response middleware that doesn't care about accessing the content of a response, they will continue to work as they do now for `HttpResponse` objects, but they will need to be updated if they also want to work with `HttpStreamingResponse` objects. In that case, if they don't access the response content, `.get_streaming_response()` could just be an alias for `.get_response()`.


Cheers.
Tai.

Anssi Kääriäinen

unread,
Aug 23, 2012, 8:51:59 AM8/23/12
to Django developers
On 23 elo, 12:52, Tai Lee <real.hu...@mrmachine.net> wrote:
> After discussion with akaarai and mYk on IRC, I have two updated proposals.
>
> OPTION 6:
>
> 1. Consume `HttpResponse.content` on first access, so that it can be read
> multiple times and eliminate the current buggy behaviour (e.g.
> `response.content != response.content`) when a generator is passed in as
> content.
>
> 2. Create a new `HttpStreamingResponse` class, which has no `.content`
> attribute but instead has `.stream` attribute, and can be iterated only
> once. If it is iterated a second time, an exception will be raised to
> trigger a loud failure.

<SNIP>

Consuming the generator on first access and caching the result will
allow streaming to work if nobody accesses the .content. If a
middleware accesses the content, processes it somehow (gzip), and then
assigns back to content we will still not consume any more memory than
currently. The content property will first generate a byte string,
then return it. If we cache it in the response, the only problem is
that the cached value might be garbage collected just a little later,
but it seems we still will use the same amount of maximum memory.

In short: I think this is a good idea.

My current feeling about the options is that this is getting
complex... So, as-simple-as-can-be solution:
- Add a flag to the response which tells if the response is
streaming or not. You can set this in response's.__init__. We might be
able to set this to True by default if the content is generator based
(can we autodetect this reliably?).
- Cache the .content on first access if the content is generator
based. No matter if the response is flagged as streaming or not.
- Update middlewares that need to skip streaming responses to check
for the flag.

This solves the immediate issue. This should be 100% backwards
compatible: If somebody accesses content when they should not, things
will just work. The content will get cached, but this should not
matter - we are already generating the content, and thus waste the
memory. We could add a warning when streaming response's content is
accessed. We might even want to remove the .content totally for
streams after deprecation.

The only backwards compatibility problem I can see is if we update
some middlewares to skip stream based responses, and also set
is_streaming to True in HttpResponse.__init__ if a generator is passed
in. Now, HttpResponse(['1', '2']) will not be processed by the altered
middlewares. Do we care?

If content is accessed, then streaming will not be effective, but this
is not a regression. If nobody touches the content, then streaming
will work.

As for removing the .content altogether: it seems this
requires .clone() API for response so that middleware can safely copy
response subclasses.

We might need to solve the skip/include middlewares conditionally
issue separately. For example, if I am streaming a large zipped file,
I will not want to apply my GZipMiddleware even if it handles stream
based responses correctly.

- Anssi

Anssi Kääriäinen

unread,
Aug 23, 2012, 10:22:48 AM8/23/12
to Django developers
On 23 elo, 15:51, Anssi Kääriäinen <anssi.kaariai...@thl.fi> wrote:
> My current feeling about the options is that this is getting
> complex... So, as-simple-as-can-be solution:
>   - Add a flag to the response which tells if the response is
> streaming or not. You can set this in response's.__init__. We might be
> able to set this to True by default if the content is generator based
> (can we autodetect this reliably?).
>   - Cache the .content on first access if the content is generator
> based. No matter if the response is flagged as streaming or not.
>   - Update middlewares that need to skip streaming responses to check
> for the flag.

I re-read the original ticket, and this is pretty much Tai Lee's
proposed solution in the ticket. My understanding is that by doing the
above we will not break anything, but will make it at least possible
to use streaming responses with certain middleware. So, we gain
something, but lose nothing.

Middleware which do not access content do not need changes. Middleware
which access content will continue to work, but will break streaming
responses. They do that now, too, possibly silently corrupting the
response (ConditionalGetMiddleware for example). In addition, it is
possible to update content accessing middleware to do the right thing
for streaming responses.

We might additionally add capabilities based middleware API, or a way
to include/exclude middlewares based on various conditions. These seem
hard to design. The problem has been there for 4+ years, so maybe it
is time to apply this simple solution?

- Anssi

Tai Lee

unread,
Aug 23, 2012, 11:48:58 AM8/23/12
to django-d...@googlegroups.com
I have updated the patch on my GitHub repository after some discussion on IRC with Anssi. The patch applies on master again now, it consumes generator content on access instead of on assignment, and raises a PendingDeprecationWarning if you create an HttpResponse with stream_content=True and then you try to access response.content. This should alert people to the fact that they have explicitly asked for a streaming response, but some piece of code somewhere has broken the ability to stream the response by accessing it's content directly.

So the end result should be that we fix the buggy response.content != response.content behaviour, we preserve the behaviour of existing responses and middleware when a generator is passed as content (stream if nothing accesses content directly, otherwise don't stream but still execute middleware), and either stream or don't stream but issue a warning when streaming is explicitly requested.

In 1.7 we could raise a loud exception when stream_content=True and response.content is accessed directly. Middleware can do nothing if they don't care about or need to worry about streaming responses. If they are capable of functioning with a streaming response, they can check response.stream_content and wrap the content generator with `response.content = wrap(response)`.

Cheers.
Tai.

Ryan D Hiebert

unread,
Aug 23, 2012, 12:48:53 PM8/23/12
to django-d...@googlegroups.com
> In 1.7 we could raise a loud exception when stream_content=True and response.content is accessed directly. Middleware can do nothing if they don't care about or need to worry about streaming responses. If they are capable of functioning with a streaming response, they can check response.stream_content and wrap the content generator with `response.content = wrap(response)`.

Rather than requiring stream-capable middleware to check if it is stream content, and also require them to handle non-stream content separately, can we have .stream_content return an auto created generator with some sane automatic divisions (at newlines or every x bytes/characters). This would allow stream-aware middleware to act as if everything they do is streams, even if in reality it is not.

If we don't do this, it seems likely enough that middlewares that are stream-aware might end up doing this conversion anyway. If that's the case, moving that work into .stream_content can save them some common boilerplate code.

One possible problem with this approach would be that if there are many middlewares, alternating back and forth between stream aware and stream unaware, that the overhead might be significant. If it turned out that was the case, then proper ordering of middlewares would be part of optimizing a project, if it isn't already.

Ryan

Tai Lee

unread,
Aug 23, 2012, 9:47:24 PM8/23/12
to django-d...@googlegroups.com
Hi Ryan,

I don't think what you are suggesting would be necessary. Stream-capable middleware could avoid having to handle both cases by simply wrapping response.content with a new iterator in all cases, if they want to. Non-streaming responses can still be iterated. They just have fewer iterations (as few as one).

The only reason for a stream-capable middleware to access response.content explicitly in this case is for improved efficiency when the user doesn't care about streaming the response. I think GZipping the whole content is probably more efficient than GZipping chunks of it.

Cheers.
Tai.

Ryan Hiebert

unread,
Aug 24, 2012, 2:21:02 AM8/24/12
to django-d...@googlegroups.com
> I don't think what you are suggesting would be necessary. Stream-capable middleware could avoid having to handle both cases by simply wrapping response.content with a new iterator in all cases, if they want to. Non-streaming responses can still be iterated. They just have fewer iterations (as few as one).
>
> The only reason for a stream-capable middleware to access response.content explicitly in this case is for improved efficiency when the user doesn't care about streaming the response. I think GZipping the whole content is probably more efficient than GZipping chunks of it.

Tai, thanks for your response.

I was more than a bit confused on how the response object is normally consumed, which caused me to write a message that made me look more than a bit silly.

I see that a single-string iterator is a perfectly good way to do things, thank you for helping me see it. It seems obvious in hindsight.

Ryan

Aymeric Augustin

unread,
Aug 25, 2012, 4:22:02 PM8/25/12
to django-d...@googlegroups.com
Hi Tai,

Thanks for your work and sorry for the late answer. I just had the time to review the discussion.


In my opinion, option 2 is reasonable. I'm -0 on a stream_content keyword argument for HttpResponse. I +1 on adding a separate HttpStreamingResponse class, because it better reflects the difference in behavior, and because it makes it possible to define a different API from the regular HttpResponse (for instance, we can chose not to define response.content). Then we can use a deprecation path to remove support for streaming responses in the regular HttpResponse class.

Once the deprecation path completes, I expect that:
- HttpResponse will be simpler and more predictable (as it'll no longer contain code to support streaming)
- both HttpResponse and HttpStreamingResponse will have intuitive and well-behaved APIs (which isn't true right now)
- the implementations will be easier to understand and to maintain — HttpResponse was one of the hardest classes to port to Python 3, so much that I created a ticket saying it needs a refactoring.

At a high level, I think this is a good compromise. It is fully backwards compatible for people that don't use streaming — the vast majority of Django users. It provides a robust foundation for the minority who does. It takes a bit more work than your current patch :)


In addition, I'm -1 on encoding include or exclude filters in the MIDDLEWARE_CLASSES setting. Chains of includes and excludes are never a good idea — anyone who's struggled with a complex Apache configuration knows what I'm talking about!


Best regards,

--
Aymeric.


Anssi Kääriäinen

unread,
Aug 25, 2012, 5:07:31 PM8/25/12
to Django developers
On 25 elo, 23:22, Aymeric Augustin
<aymeric.augus...@polytechnique.org> wrote:
> In my opinion, option 2 is reasonable. I'm -0 on a stream_content keyword argument for HttpResponse. I +1 on adding a separate HttpStreamingResponse class, because it better reflects the difference in behavior, and because it makes it possible to define a different API from the regular HttpResponse (for instance, we can chose not to define response.content).  Then we can use a deprecation path to remove support for streaming responses in the regular HttpResponse class.

I have been advocating for the stream_content approach simply because
of possible problems in using HttpResponse subclasses in streaming
situations. This is not a problem for me, as I usually have only
HttpResponseForbidden and HttpResponseReverse in use, and those
doesn't seem too useful with streaming content... However, if
somebody has a heavily customized HttpResponse subclass in use, and
needs to use it both in streaming and normal content situations, then
that user might have some problems. Are there such uses for
HttpResponse?

There are more votes for the subclass approach. So, if no further
opinions are raised lets go with that one.

- Anssi

Tai Lee

unread,
Sep 22, 2012, 2:12:18 PM9/22/12
to django-d...@googlegroups.com
Hi Aymeric & Anssi,

Thanks for your feedback. I have updated my branch with an implementation of a new `HttpStreamingResponse` class plus tests and updated the ticket. Please take a look and let me know if this is moving in the right direction, and if there is any other feedback. If the approach is OK, I will add docs as well.

https://code.djangoproject.com/ticket/7581#comment:32
https://github.com/thirstydigital/django/tree/tickets/7581-streaming-response-middleware

Thanks.
Tai.

Tai Lee

unread,
Sep 29, 2012, 12:45:50 AM9/29/12
to django-d...@googlegroups.com
Docs have been added (patch should be complete now) and pull request has been opened, at Anssi's request.


Cheers.
Tai.

Anssi Kääriäinen

unread,
Sep 29, 2012, 9:20:28 AM9/29/12
to Django developers
On 29 syys, 07:45, Tai Lee <real.hu...@mrmachine.net> wrote:
> Docs have been added (patch should be complete now) and pull request has
> been opened, at Anssi's request.
>
> https://groups.google.com/d/topic/django-developers/RrNPfuJxnlM/discu...

To me it seems the use of streaming responses in static file serving
will cause problems for upgraders. If using 3rd party middleware
accessing .content, then the following will happen:
- The static files will now automatically use streaming responses
- The middleware will be applied to the static file responses, too
- There is no .content for streaming responses -> problems

The problem stems from using the new streaming response by default in
Django. If it wasn't used, then there would be no problems for
upgrades.

I think we have to put in place something like
BackwardsCompatStreamingResponse - on access of .content raise a
deprecation warning, but still allow .content to work. This could be
used for two releases in static file serving.

Alternate approach is to tell upgraders to wrap non-working middleware
in a subclass skipping the middleware for streaming responses.

- Anssi

Aymeric Augustin

unread,
Oct 19, 2012, 10:23:01 AM10/19/12
to django-d...@googlegroups.com
Hello,

I've just written a (final?) proposal on the ticket:
https://code.djangoproject.com/ticket/7581#comment:36

If you still want to weigh on this issue or issue a veto (there were several -1 over the last 4 years), now is the time.

Otherwise I'll update the patch and try to commit it in 1.5.

--
Aymeric.

Aymeric Augustin

unread,
Oct 20, 2012, 3:44:57 PM10/20/12
to django-d...@googlegroups.com
Hello,

I just added StreamingHttpResponse and closed #7581. There've been many contributions to this ticket over the years; I committed the simplest thing that could possibly work. If I missed something important, please let me know.

The yak shaving goes on with:
- https://code.djangoproject.com/ticket/6527 — that's part 1 of the problem described by Tai Lee, only part 2 is fixed at this time
- https://code.djangoproject.com/ticket/13222
- https://code.djangoproject.com/ticket/18796

Best regards,

--
Aymeric.



Aymeric Augustin

unread,
Oct 22, 2012, 5:50:49 PM10/22/12
to django-d...@googlegroups.com
Hello,

While I'm working on HttpResponse, I'd like to resolve two related tickets. Here's my plan for discussion.


#6527 is about normalizing how HttpResponse deals with iterators. For more background, see part 1 of the first email of this thread.

Let's take an HttpResponse instantiated with an iterator.

It's possible not to access the content until the WSGI server iterates on the response. This allows creating streaming responses in Django 1.4. But it's incompatible with most middleware.

Most often middleware will access the content before the response is finally rendered. The first access to response.content works, but subsequent accesses return an empty string.

The fix is easy: consume the content of the iterator and save it in a list for further use. This can be done at two points:

a) In __init__. This is a simple solution advocated by Malcolm in several tickets and it decreases the complexity of HttpResponse (a worthy goal).

It's backwards incompatible because it breaks the streaming scenario described above. We'd have to raise a deprecation warning when the iterator is consumed before the final iteration by the WSGI server, tell people to use StreamingHttpResponse instead, and make the change after two releases.

b) When required. This is the solution implemented in this patch:
https://code.djangoproject.com/attachment/ticket/6527/6527-v3.patch
Even if the target is a), this is a useful workaround for the length of the deprecation period.


#18796 is about converting the content of an HttpResponse to bytes.

I believe the result should be identical when accessing response.content / response.streaming_content and when iterating on the response. Currently this isn't guaranteed (for historical reasons for .content, by my fault for .streaming_content).

This becomes easier once #6527 is fixed, because we know exactly where to apply the conversion in HttpResponse: in the method that consumes the iterator, and in __next__. This translates easily to StreamingHttpResponse.

The conversion function:
- must accept text and bytes as input;
- historically accepts integers, and some tests rely on this; I can't figure out under which real-life circumstances this would be useful, but it doesn't really hurt either;
- must deal with encoding "appropriately". This is also the topic on #10190.

I have to work on that third point. There's encoding information available both in self._charset and in the Content-Encoding header. When force_bytes is called on a bytes object with a charset other than utf-8, it will decode the bytes as utf-8 and re-encode them with the provided charset. Is this appropriate? I'll probably write down all the combinations of input type and encoding information, and determine what the output should be in each case. In practice everyone must be using utf-8 everywhere anyway…


Finally, I've closed five or six other tickets that were fixed at the same time as #7581 or didn't add anything to the two tickets described above; please tell me if I've missed something.

Thanks,

--
Aymeric.



Anssi Kääriäinen

unread,
Oct 23, 2012, 7:11:36 AM10/23/12
to Django developers
On 23 loka, 00:51, Aymeric Augustin
<aymeric.augus...@polytechnique.org> wrote:
> Hello,
>
> While I'm working on HttpResponse, I'd like to resolve two related tickets. Here's my plan for discussion.
>
> #6527 is about normalizing how HttpResponse deals with iterators. For more background, see part 1 of the first email of this thread.
>
> Let's take an HttpResponse instantiated with an iterator.
>
> It's possible not to access the content until the WSGI server iterates on the response. This allows creating streaming responses in Django 1.4. But it's incompatible with most middleware.
>
> Most often middleware will access the content before the response is finally rendered. The first access to response.content works, but subsequent accesses return an empty string.
>
> The fix is easy: consume the content of the iterator and save it in a list for further use. This can be done at two points:
>
> a) In __init__. This is a simple solution advocated by Malcolm in several tickets and it decreases the complexity of HttpResponse (a worthy goal).
>
> It's backwards incompatible because it breaks the streaming scenario described above. We'd have to raise a deprecation warning when the iterator is consumed before the final iteration by the WSGI server, tell people to use StreamingHttpResponse instead, and make the change after two releases.
>
> b) When required. This is the solution implemented in this patch:https://code.djangoproject.com/attachment/ticket/6527/6527-v3.patch
> Even if the target is a), this is a useful workaround for the length of the deprecation period.

I vote consume on __init__ after deprecation period. During the
deprecation period, cache on first access. The behaviour of (b)
"stream except if accessed" doesn't seem useful. If you need streaming
response, then use HttpStreamingResponse, if not, then you don't get
streaming response ever.

How do we actually deprecate this? Would we raise deprecation for any
iterator passed to HttpResponse?

> Finally, I've closed five or six other tickets that were fixed at the same time as #7581 or didn't add anything to the two tickets described above; please tell me if I've missed something.

Thanks for getting streaming responses committed. I tried to work on
this feature but my HTTP/WSGI skills weren't good enough to actually
commit anything...

- Anssi

Aymeric Augustin

unread,
Oct 24, 2012, 5:52:48 PM10/24/12
to django-d...@googlegroups.com
Le 22 oct. 2012 à 23:50, Aymeric Augustin <aymeric....@polytechnique.org> a écrit :

> There's encoding information available both in self._charset and in the Content-Encoding header.

I was confused — Content-Encoding is typically "gzip" of "deflate"; it has nothing to do with the response charset.

For all practical values of Content-Encoding [1], the content is compressed binary data, which must be represented as bytes. I don't see any reason to support any other data type when a Content-Encoding is set.

I've written a patch for #18796 [2] and I plan to commit it before the feature freeze.

[1] http://en.wikipedia.org/wiki/HTTP_compression#Content-coding_tokens
[2] https://code.djangoproject.com/attachment/ticket/18796/18796.patch

--
Aymeric.



Aymeric Augustin

unread,
Nov 28, 2012, 1:07:02 PM11/28/12
to django-d...@googlegroups.com
2012/11/28 Gavin Wahl <gavi...@gmail.com>
I would like to avoid having two code paths, one with streaming and one
without, in new middleware. 

Hi Gavin,

Could you give an example of a middleware that:
  - needs to alter the content,
  - will work identically both on regular HttpResponses and StreamingHttpResponses,
  - will not consume a streamed content (that's part of the contract of StreamingHttpResponse)?

Thanks,

--
Aymeric.

Gavin Wahl

unread,
Nov 28, 2012, 1:52:52 PM11/28/12
to django-d...@googlegroups.com
- GzipMiddleware (compress_string(x) == compress_sequence([x]))
- StripWhitespaceMiddleware
- ContentDoctorMiddleware (https://github.com/unomena/django-content-doctor)
- A hypothetical middleware that appends debug information to the end
of the response. debug-toolbar may be able to work like this.

Aymeric Augustin

unread,
Nov 28, 2012, 5:30:14 PM11/28/12
to django-d...@googlegroups.com
Hi Gavin,

The whole point of the HttpStreamingResponse API to access the content is to be incompatible with HttpResponse. Otherwise it's too easy to consume streamed content accidentally.

Re-introducing a common API ("streaming_content") would remove that benefit, it would be confusing in the non-streamed case, and it would certainly be misused ("yes I just used ''.join(streamed_content), isn't that the canonical way to access response content in all cases?").

The incompatible API is a hint that you must think about what you're doing when you're writing streaming-aware middleware, and specifically, that you mustn't consume the iterator.


Your goal is interesting but it seems optimistic to me. Out of the four examples you gave, two are very hard to implement correctly for streaming responses.

> - GzipMiddleware (compress_string(x) == compress_sequence([x]))

This one is implemented by Django. If you look at the code, you'll see it's reasonable to use a different implementation for the streamed and not-streamed cases.

> - StripWhitespaceMiddleware

For this one, the pattern would be:

if response.streaming:
response.streaming_content = itertools.imap(strip_whitespace, response.streaming_content)
else:
response.content = strip_whitespace(response.content)

Yes, this is a bit of boilerplate. I'd like to reduce it, but not at the cost of giving a streaming_content attribute to regular responses; it's too confusing. You can probably abstract the pattern in a response middleware base class. That works only when the "transform" function can be applied to any subset of the content. I don't expect many useful middleware to fall in this category

> - ContentDoctorMiddleware (https://github.com/unomena/django-content-doctor)

This won't work on streaming responses without consuming the content, because a match for the regex may overlap a boundary between response chunks.

> - A hypothetical middleware that appends debug information to the end
> of the response. debug-toolbar may be able to work like this.

Looking for the </body> boils down to the same problem as above, and can't be implemented without consuming the content for the same reason.

--
Aymeric.



Gavin Wahl

unread,
Nov 28, 2012, 6:05:05 PM11/28/12
to django-d...@googlegroups.com
>> - A hypothetical middleware that appends debug information to the end
>> of the response. debug-toolbar may be able to work like this.
> Looking for the </body> boils down to the same problem as above, and can't be implemented without consuming the content for the same reason.

It can be done, it's just a little harder for streaming. If the chunk
you are processing ends with any prefix of '</body>', that is, ('<',
'</', '</b', '</bo', ...), you combine it with the next chunk and look
for '</body>' again.. I don't think this counts as 'consuming' the
content, because you'll only combine combine a few chunks, and only
when necessary.

This treatment, while difficult, is totally necessary when dealing
with iterators over responses, no matter the API. Since you have to
write the complex stream processing anyway (assuming you want to work
with streaming responses), HttpResponse should not force you to write
an additional implementation for the plain, non-streaming response.

> You can probably abstract the pattern in a response middleware base class.
That's a good idea, and it's probably what I'll do if this doesn't
make it into Django. I think this use case is common enough for
HttpResponse to handle, though.
Reply all
Reply to author
Forward
0 new messages