Call for comments: CommonMiddleware (and others) break streaming HttpResponse objects.

19 views
Skip to first unread message

Tai Lee

unread,
Jul 1, 2008, 3:03:52 AM7/1/08
to Django developers
http://code.djangoproject.com/ticket/7581

Just posted this ticket with an initial patch (sans documentation
changes and tests). Basically there are several middleware classes
that access HttpResponse.content directly which break streaming
HttpResponse objects that use a generator to yield content
progressively.

I'd like to get some feedback on the direction and implementation of
the patch and proposed solution before I go any further with tests or
documentation changes.

Cheers.
Tai.

Ivan Sagalaev

unread,
Jul 1, 2008, 3:54:22 AM7/1/08
to django-d...@googlegroups.com

In GzipMiddleware you don't touch Content-Length if the response is not
string. This means that if a response has Content-Length already set it
will become incorrect upon gzipping the content. In this case
Content-Length can only be deleted from headers, though sending
responses with no determined length is a bit ugly...

Malcolm Tredinnick

unread,
Jul 1, 2008, 5:15:01 AM7/1/08
to django-d...@googlegroups.com

I don't think we should be supporting on-demand generation like this. Go
back a year or 18 months in the archives and look at where Brian Harring
put up some patches to do that sort of thing. He and I and others spent
ages chasing down unintended side-effects and in the end it was
considered not worth it. That was tip-of-the-iceberg stuff. Yes, it's
probably possible to get it right, but I'm not convinced it's worth it.

It's quite a large restriction to say that no middleware should ever try
to examine the contents of the HttpResponse since it might be an
iterator that shouldn't be consumed. You're proposing a bunch of
specific changes for the middleware we have now to work around that, but
you're also imposing a huge functionality constraint on any future
middleware (all middleware will have to work with the possibility that a
non-consumable generator is there). You're also effectively said that
only certain content types can be streamed (because a whole bunch of
them *will* be affected by consuming middleware). "Just say no" is a
cleaner solution here.

I thought we'd fixed it, but apparently we haven't: if any iterator is
passed into an HttpResponse, it should be converted to a string
immediately so that things can index into it without doing
non-repeatable consumption.

The only thing that might be worth doing in this are is adding a way to
say "middleware should never be called on this response" and then
somebody can write their own HttpResponse subclass and be in complete
control of their destiny.

Adding all the complexity to the existing HttpResponse gets a -1 from
me. It's too easy to have unintended side-effects and not really in the
sweet spot of Django. It also doesn't seem to be a 1.0 requirement.

Malcolm


Ivan Sagalaev

unread,
Jul 1, 2008, 6:02:35 AM7/1/08
to django-d...@googlegroups.com
Malcolm Tredinnick wrote:
> I thought we'd fixed it, but apparently we haven't: if any iterator is
> passed into an HttpResponse, it should be converted to a string
> immediately so that things can index into it without doing
> non-repeatable consumption.

Malcolm, sorry, that won't work. I've implemented iterable HttpResponse
in the first place out of purely practical reason: web server times out
if a particularly long file was passed into an HttpResponse. And also it
was really slow and was consuming all the memory.

In fact this ticket appeared out of our conversation with Tai Lee. I
don't think you argument on adding complexity makes sense because not
all middleware should care about this. Why, say, TransactionMiddleware
should be affected? Or some 3rd-party HTMLValidationMiddleware which
won't touch huge iterators because it checks only HTML? This patch just
establishes a good practice to follow: if you expect a "reasonably
short" response in your middleware -- check for it.

Malcolm Tredinnick

unread,
Jul 1, 2008, 6:07:32 AM7/1/08
to django-d...@googlegroups.com

On Tue, 2008-07-01 at 14:02 +0400, Ivan Sagalaev wrote:
> Malcolm Tredinnick wrote:
> > I thought we'd fixed it, but apparently we haven't: if any iterator is
> > passed into an HttpResponse, it should be converted to a string
> > immediately so that things can index into it without doing
> > non-repeatable consumption.
>
> Malcolm, sorry, that won't work. I've implemented iterable HttpResponse
> in the first place out of purely practical reason: web server times out
> if a particularly long file was passed into an HttpResponse. And also it
> was really slow and was consuming all the memory.

Any middleware that examines the content has to pull the content into
memory in case it's an iterator. If they don't they're buggy because
they're consuming the content ahead of the web server. If somebody has
particular requirements like yours, then, absolutely, they should use a
subclass of HttpResponse that doesn't do that. And they know the
restrictions from that decision. But the default behaviour shouldn't
require repetitive practices from normal, everyday middleware.

> In fact this ticket appeared out of our conversation with Tai Lee. I
> don't think you argument on adding complexity makes sense because not
> all middleware should care about this. Why, say, TransactionMiddleware
> should be affected? Or some 3rd-party HTMLValidationMiddleware which
> won't touch huge iterators because it checks only HTML? This patch just
> establishes a good practice to follow: if you expect a "reasonably
> short" response in your middleware -- check for it.

If I write a third-party middleware that behaves exactly as the
content-length middleware does now, it won't work with something that
can only be used as a generator. Since I won't know ahead of time what
responses will be give to this middleware I am therefore forced to
assume such a generator will be passed in. Thus, a change like this is
saying that from now on, for all time, no middleware can look at the
content of the response unless it fits into the very narrow profile in
that ticket (matching one or other content types, etc).

Malcolm

Ivan Sagalaev

unread,
Jul 1, 2008, 6:42:17 AM7/1/08
to django-d...@googlegroups.com
Malcolm Tredinnick wrote:
> Any middleware that examines the content has to pull the content into
> memory in case it's an iterator. If they don't they're buggy because
> they're consuming the content ahead of the web server.

Agreed.

> But the default behaviour shouldn't
> require repetitive practices from normal, everyday middleware.

What exactly you mean by "repetitive practices"? I'm afraid my English
fails me again, sorry!

> If I write a third-party middleware that behaves exactly as the
> content-length middleware does now, it won't work with something that
> can only be used as a generator. Since I won't know ahead of time what
> responses will be give to this middleware I am therefore forced to
> assume such a generator will be passed in. Thus, a change like this is
> saying that from now on, for all time, no middleware can look at the
> content of the response unless it fits into the very narrow profile in
> that ticket (matching one or other content types, etc).

If a middleware has to look at the content - it will. But why not make
certain standard middleware not required to look at the content? They
will be more useful then, and it doesn't imply, IMHO, that everyone
should do it. No?

This got me thinking that the real problem is, in fact, a bit different.
Right now middleware applies to all responses indifferently. But it
looks like there are at least one more or less common case -- returning
a big iterable -- for which most middleware don't make sense anyway and
it's better handled as an exception. So may be we should devise a
mechanism for such exceptions that won't require a middleware itself to
be aware of it.

Malcolm Tredinnick

unread,
Jul 1, 2008, 6:59:04 AM7/1/08
to django-d...@googlegroups.com

On Tue, 2008-07-01 at 14:42 +0400, Ivan Sagalaev wrote:
> Malcolm Tredinnick wrote:
> > Any middleware that examines the content has to pull the content into
> > memory in case it's an iterator. If they don't they're buggy because
> > they're consuming the content ahead of the web server.
>
> Agreed.
>
> > But the default behaviour shouldn't
> > require repetitive practices from normal, everyday middleware.
>
> What exactly you mean by "repetitive practices"? I'm afraid my English
> fails me again, sorry!

At the moment, lots of pieces of middleware are buggy because they
consume the content from response.content and neither the content
property nor the middleware itself puts the content back into the
response. Those pieces of middleware should, with the current state of
the code, be checking if the response was created from an iterator and,
if so, be careful to replace the content after use by poking into the
internals of the response. Every single piece of middleware that
examines response.content needs to do this (this is the bit that
requires the repetition).

Of course, that's a silly requirement (and requires poking at non-public
pieces of the response). The response should just return a copy of the
content when response.content is accessed, which means turning any
iterator into a proper string.

> > If I write a third-party middleware that behaves exactly as the
> > content-length middleware does now, it won't work with something that
> > can only be used as a generator. Since I won't know ahead of time what
> > responses will be give to this middleware I am therefore forced to
> > assume such a generator will be passed in. Thus, a change like this is
> > saying that from now on, for all time, no middleware can look at the
> > content of the response unless it fits into the very narrow profile in
> > that ticket (matching one or other content types, etc).
>
> If a middleware has to look at the content - it will. But why not make
> certain standard middleware not required to look at the content? They
> will be more useful then, and it doesn't imply, IMHO, that everyone
> should do it. No?

But it *does* require that everybody will have to behave this way. If I
am writing a piece of middleware I have to assume the worst case, which
here means that the content is a non-repeatable generator. Since
generators would be permitted in that world, I have to treat them as
possible in my middleware code. So I cannot examine the content.

Otherwise my middleware is crippled as a reusable piece of code: it
would have to carry a big warning saying "cannot be used unless you know
your responses will never use generators", which means I have to
investigate the details of every third-party app I have installed before
I can safely use my middleware. That simply is not going to be good
practice.

I really don't see that it's worth the enormous complexity it brings at
the moment to introduce this. It's not something that cannot wait and be
trialled in various approaches after 1.0, but rushing to create some
public interface and functionality promises right now so that generators
become fully supported in HttpResponse is going to lead to bugs. The
ticket itself already proposes some things that will lead to bugs: for
example, content-length must *always* be set, since chunked
transfer-encoding isn't permitted with WSGI, so it's not a question of
checking _is_string or not -- which, btw, is an internal attribute that
middleware should never access -- but, rather, a question of whether the
header has already been set.

Malcolm

Ivan Sagalaev

unread,
Jul 1, 2008, 7:23:42 AM7/1/08
to django-d...@googlegroups.com
Malcolm Tredinnick wrote:
> The response should just return a copy of the
> content when response.content is accessed, which means turning any
> iterator into a proper string.

Ouch.. I thought it already does just that. Yeah, it's a bug then.
Though simple to fix.

> But it *does* require that everybody will have to behave this way. If I
> am writing a piece of middleware I have to assume the worst case, which
> here means that the content is a non-repeatable generator.

Now I understand what you don't like. If we simply make HttpResponse to
replace its source generator with a string upon exhaustion, the problem
will disappear.

But it's another bug. I'm talking here along the lines "why don't we
make GzipMiddleware not read response.content if it doesn't have to".

> It's not something that cannot wait and be
> trialled in various approaches after 1.0

Oh, yes, definitely post-1.0.

> The
> ticket itself already proposes some things that will lead to bugs: for
> example, content-length must *always* be set

I was thinking about it myself... But don't things like Comet work with
infinitely long responses that don't have content length?

Tai Lee

unread,
Jul 1, 2008, 10:30:37 AM7/1/08
to Django developers
Certainly the implementation of the patch is not ideal (by accessing
private attributes on HttpResponse), which is why I brought it up for
discussion before going any further.

I do feel that there is a real benefit and even requirement in some
cases to supporting streaming HttpResponse objects. If there was no
streaming output, why would we accept generators (or anything other
than string) as the content to HttpResponse at all? It would be easier
to simply require content to be passed in as text and leave it to the
developer to consume any generator before instantiating a HttpResponse
object.

If there was no streaming output, I don't see a good alternative when
there's a need to return a large amount of dynamic content (e.g. CSV
export) in response to an HTTP request.

Even if users were to subclass HttpResponse on their own, won't they
run into the same problem with 3rd party middleware (and even standard
Django middleware at present) which ignores the fact that generators
can be passed to HttpResponse as content?

The other alternative I can think of is even less appealing - to
trigger some scheduled offline processing through a web request and
notify the user when it's ready to download as static media.

The streaming HttpResponse functionality was initially favoured by
Adrian when Ivan introduced it. It would be a shame to "just say no"
to it, now.

Is it really so bad to add a public attribute on HttpResponse that
indicates whether or not the content is a static string or a
generator, and have the Django standard middleware (especially any
that are enabled by default) be aware of this and only enable features
that require checking the entire content when a generator has not been
used. Features such as: generating an ETag header, compressing the
entire content in one hit instead of iteratively (is there a
significant benefit to doing this in one hit?), and setting the
Content-Length header?

The default behaviour by "unaware" middleware is still to consume any
content generator, so those middleware that are ignorant of streaming
response will continue to work as they do now. But 3rd party
developers who are aware of this can support it if they chose, or not,
and Django should support it as best it can.

Tai Lee

unread,
Jul 1, 2008, 10:45:29 AM7/1/08
to Django developers


On Jul 1, 7:15 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> It's quite a large restriction to say that no middleware should ever try
> to examine the contents of the HttpResponse since it might be an
> iterator that shouldn't be consumed. You're proposing a bunch of
> specific changes for the middleware we have now to work around that, but
> you're also imposing a huge functionality constraint on any future
> middleware (all middleware will have to work with the possibility that a
> non-consumable generator is there).

I don't think that restriction true, and even if it were it wouldn't
be a significant change from the present functionality, or your
suggested functionality of consuming generators immediately.

If someone passes a "non-consumable" generator as content (either to
be consumed immediately or by 3rd party middleware), or simply had
potentially long running code in their views, the end result is the
same. A timeout in the browser. Supporting streaming HttpResponse
objects at least gives developers a *chance* to avoid these timeouts.

3rd party middleware can continue to access HttpResponse.content
directly and consume generators if they like. There doesn't have to be
a requirement that ALL middleware fully support streaming
HttpResponse. Clearly some middleware which need to manipulate or
access the entire content as a string cannot be streamed. But the core
Django middleware shouldn't consume content generators unnecessarily
and 3rd party middleware developers should have the *option* of
allowing content generators to go unmolested.

All that's required is a public attribute on HttpResponse that
middleware developers can access if they want to be aware of content
generators, to fix code Django middleware to be aware of this where
possible and where appropriate, and leave the default as it is - for
HttpResponse.content to consume content generators.

Tai Lee

unread,
Jul 3, 2008, 7:26:41 PM7/3/08
to Django developers
> The only thing that might be worth doing in this are is adding a way to
> say "middleware should never be called on this response" and then
> somebody can write their own HttpResponse subclass and be in complete
> control of their destiny.

Would this disable ALL middleware from running? Or only the
process_response methods on middleware? Either way, this is not ideal
as there are useful middleware that CAN still work with on-demand
content (e.g. the updated GZipMiddleware from the patch).

Aral

unread,
Aug 3, 2008, 3:35:37 PM8/3/08
to Django developers
Hi Ivan,

Another real-world use case: Google App Engine has a 1MB limit on data
structures. In this case, being able to stream the contents of several
variables will allow generating responses larger than 1MB (which is
important for me, for example, as I'm building a datastore backup
script for Google App Engine.)

I know it's an edge case for a single technology but one where being
able to stream content back would be very useful.

Aral

> Malcolm, sorry, that won't work. I've implemented iterable HttpResponse
> in the first place out of purely practical reason: web server times out
> if a particularly long file was passed into an HttpResponse. And also it
> was really slow and was consuming all the memory.
<snip>

mrts

unread,
Sep 22, 2008, 1:54:38 PM9/22/08
to Django developers
On Jul 4, 2:26 am, Tai Lee <real.hu...@mrmachine.net> wrote:
> > The only thing that might be worth doing in this are is adding a way to
> > say "middleware should never be called on this response" and then
> > somebody can write their own HttpResponse subclass and be in complete
> > control of their destiny.
>
> Would this disable ALL middleware from running? Or only the
> process_response methods on middleware? Either way, this is not ideal
> as there are useful middleware that CAN still work with on-demand
> content (e.g. the updated GZipMiddleware from the patch).

+1 for adding a way to say "process_response should never be called on
this response".

Taking a quick look at the source, HttpResponse seems to support
iteration already:

def __iter__(self):
self._iterator = iter(self._container)
return self

def next(self):
chunk = self._iterator.next()
if isinstance(chunk, unicode):
chunk = chunk.encode(self._charset)
return str(chunk)

I'd expect this is actually used in returning content, i.e. if no
middleware kicks in, passing an iterator to HttpResponse should just
work for streaming?

Tai Lee

unread,
Sep 22, 2008, 7:36:19 PM9/22/08
to Django developers
On Sep 23, 3:54 am, mrts <m...@mrts.pri.ee> wrote:
> +1 for adding a way to say "process_response should never be called on
> this response".
>
> Taking a quick look at the source, HttpResponse seems to support
> iteration already:
>
>     def __iter__(self):
>         self._iterator = iter(self._container)
>         return self
>
>     def next(self):
>         chunk = self._iterator.next()
>         if isinstance(chunk, unicode):
>             chunk = chunk.encode(self._charset)
>         return str(chunk)
>
> I'd expect this is actually used in returning content, i.e. if no
> middleware kicks in, passing an iterator to HttpResponse should just
> work for streaming?

I believe that's correct. However, I'm still not so sure about
stopping `process_response` from running for *all* middleware. As
mentioned above, some middleware can work and are useful with a
streaming response. How about these two ideas:

1) Add a `skipped_middleware` attribute to HttpResponse which is empty
by default, and when looping through and executing middleware, skip
those defined in there. That way the developer can either subclass
`HttpResponse` or simply modify an `HttpResponse` instance before
returning and specify exactly which middleware should be skipped?

2) Create an `HttpStreamingResponse` class and add a `can_stream`
attribute to each middleware which defaults to `False`. When you need
streaming functionality use the provided `HttpStreamingResponse` class
which will automatically skip any middleware for which `can_stream` is
`False`.

mrts

unread,
Sep 23, 2008, 6:50:30 AM9/23/08
to Django developers, man...@softwaremaniacs.org
On Jul 1, 1:02 pm, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:
> I've implemented iterable HttpResponse
> in the first place out of purely practical reason: web server times out
> if a particularly long file was passed into an HttpResponse. And also it
> was really slow and was consuming all the memory.

Ivan, does this imply that iteration does not work out of the box,
i.e. ordinary HttpResponse __iter__() and next() are not used in the
expected way? Or were you just using some middleware that consumed the
response and customization was required because of that?
Reply all
Reply to author
Forward
0 new messages