[RFC] Fixing middleware/view decorator handling of content iterators

17 views
Skip to first unread message

Forest Bond

unread,
Feb 15, 2010, 4:16:11 PM2/15/10
to django-d...@googlegroups.com
Hi,

Django allows an iterator to be passed as response content when instantiating an
HttpResponse. However, doing so causes problems with the following classes and
functions:

UpdateCacheMiddleware:
Caches the response object using the configured cache backend, but most
iterators cannot be pickled.
CommonMiddleware:
Calculates a value for the ETag header, causing the iterator to be drained
prematurely.
GzipMiddleware:
Calculates the response content length, causing the iterator to be drained
prematurely.
ConditionalGetMiddleware
Calculates the response content length and sets the Content-Length header,
causing the iterator to be drained prematurely.
patch_response_headers
Calculates a value for the ETag header, causing the iterator to be drained
prematurely.

Some of these problems are discussed in the following tickets:

#6027: FileWrapper iterator drained by GzipMiddleware before content can be returned
#6527: A bug in HttpResponse with iterators
#12214: never_cache decorator breaks HttpResponse with iterator content

Attached is a proof-of-concept patch illustrating how I'd like to tackle this
issue.

Roughly, my approach is as follows:

* Introduce a ResponseContent class with subclasses MutableResponseContent and
IteratorResponseContent to limit operations on the response content depending
on the type of the value provided (string type or iterator).
* Forbid premature draining of the content iterator via response.content by only
evaluating the content iterator if accessed via iter(response) and raising an
exception if it is accessed via response.content.
* Modify middleware/view decorator code that accesses response.content to check
response.content_obj.is_readable() before trying to read the content.

This approach should be backwards compatible, with the exception that if a
middleware or view decorator does inappropriately access response.content, a
TypeError will be raised. Previously, the content iterator would be prematurely
drained and an empty response would be returned. Since the previous behavior
was not usable, it seems unlikely anyone would be depending on it.

I have a few questions:

* Does this approach make sense to others? If so, I'll add a few tests and look
at how the docs might need updating.
* Is this type of change too invasive for 1.2?
* Should I create a "master" ticket to deal with this issue and continue
development work there? The other tickets focus on specific symptoms of this
problem rather than the root cause more generally.

Any comments are appreciated.

Thanks,
Forest
--
Forest Bond
http://www.alittletooquiet.net
http://www.pytagsfs.org

django-response-content-improvements.patch
signature.asc

Jeremy Dunck

unread,
Feb 15, 2010, 4:27:27 PM2/15/10
to django-d...@googlegroups.com
On Mon, Feb 15, 2010 at 3:16 PM, Forest Bond <for...@alittletooquiet.net> wrote:
...

> * Forbid premature draining of the content iterator via response.content by only
>  evaluating the content iterator if accessed via iter(response) and raising an
>  exception if it is accessed via response.content.
...

> * Is this type of change too invasive for 1.2?

I assume you mean 1.3, since 1.2 is in feature freeze.

Anyway, the response.content attribute is documented and client code
may depend on it, making this a backwards-compatibility issue.

http://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpResponse.content

For GZip support of iterators and similar large in-flight values, I
think spooling to memory and disk, like file upload does, might be a
reasonable solution, but still, .content is just a string, so it'll be
difficult to keep compatibility.

Just my opinion. I hope a committer overrules me. ;-)

Forest Bond

unread,
Feb 15, 2010, 4:45:20 PM2/15/10
to django-d...@googlegroups.com
Hi,

On Mon, Feb 15, 2010 at 03:27:27PM -0600, Jeremy Dunck wrote:
> On Mon, Feb 15, 2010 at 3:16 PM, Forest Bond <for...@alittletooquiet.net> wrote:
> ...
> > * Forbid premature draining of the content iterator via response.content by only
> >  evaluating the content iterator if accessed via iter(response) and raising an
> >  exception if it is accessed via response.content.
> ...
> > * Is this type of change too invasive for 1.2?
>
> I assume you mean 1.3, since 1.2 is in feature freeze.

... and so this change would be disallowed if considered a new feature, but it's
purpose is to fix bugs, although it is a non-trivial change.

> Anyway, the response.content attribute is documented and client code
> may depend on it, making this a backwards-compatibility issue.
>
> http://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpResponse.content

Yes, I did address this in my message. Accessing response.content if an
iterator was passed in would result in an empty response, which doesn't seem
like behavior that is worth preserving in the name of backwards compatibility.

On the other hand, the following pattern is the exception (this would not
produce an empty response):

response.content = some_function(response.content)

I think that pattern is the primary backwards compatibility concern.

> For GZip support of iterators and similar large in-flight values, I
> think spooling to memory and disk, like file upload does, might be a
> reasonable solution, but still, .content is just a string, so it'll be
> difficult to keep compatibility.

If backwards compatibility for this case must be maintained, my approach would
need to be adjusted to either evaluate the iterator and store the result (as you
suggest) or to keep the current behavior WRT response.content and just fix the
offending middleware/view decorators (i.e. use my patch but not throw an
exception when response.content is read).

If we do take the evaluate-and-store approach, I'd want to give the user some
control over this behavior so that he could disable it if the content was going
to be quite large or otherwise inappropriate for temporary storage.

Thanks for your feedback!

-Forest

signature.asc

Russell Keith-Magee

unread,
Feb 18, 2010, 7:00:22 AM2/18/10
to django-d...@googlegroups.com

I think I understand where you're coming from, but I'm not immediately
convinced it's the right approach. In particular, I'm not convinced
we can lock down the ability to iterate over responses more than once
in the way you have done.

I acknowledge that this fixes the problem (in the sense that it makes
the errors caused by multiple iteration go away), but I'm not sure
it's the right solution. Disabling Etags, content-length coding,
gziped content and so on seems like throwing the baby out with the
bathwater.

My immediate reaction is that we should be trying to capture and
preserve the process of evaluating the iterator, not limit the ability
to iterate. That is - the first attempt to evaluate the iterator
should store the content, and that evaluated content should be made
available by anything else that needs to iterate over the content.

Of course, the simplest way to do this would be to just evaluate the
iterator as soon as it is provided -- which kind of negates any
benefit of using an iterator in the first place. So -- I suppose the
bigger question that needs to be asked is what exactly is the use case
for an iterable response? I mean, I understand the general benefit of
using iterators and generators instead of rolled out lists, but what
is the use case in a HTTP context?

> * Is this type of change too invasive for 1.2?

I'm inclined to say yes. I know this is addressing a number of
specific tickets, so it isn't strictly a feature addition, but those
issues have existed for a long time, and the change you're proposing
has a conceptually wide footprint - I'd rather see changes like this
treated like a feature, so they get the benefit of review through
multiple prereleases, etc.

> * Should I create a "master" ticket to deal with this issue and continue
>  development work there?  The other tickets focus on specific symptoms of this
>  problem rather than the root cause more generally.

This would be a good idea. Given that there is a specific course of
action on the table, it's worth having a ticket to track the evolution
of the idea. By way of comparison - this is exactly what Ben Firshman
did with class-based Syndication views in ticket #12403; the ticket
addressed a dozen other tickets, but provided a single place to
coordinate review of the specific solution.

Make sure you reference any other existing tickets, but don't close
those tickets as duplicates - that way, if we find an alternate
approach, we can close this new ticket as wontfix without losing the
original problem reports.

Yours,
Russ Magee %-)

Tom Evans

unread,
Feb 18, 2010, 7:24:23 AM2/18/10
to django-d...@googlegroups.com
On Thu, Feb 18, 2010 at 12:00 PM, Russell Keith-Magee
<freakb...@gmail.com> wrote:
...

> benefit of using an iterator in the first place. So -- I suppose the
> bigger question that needs to be asked is what exactly is the use case
> for an iterable response? I mean, I understand the general benefit of
> using iterators and generators instead of rolled out lists, but what
> is the use case in a HTTP context?
>

From my own experience, we have several web pages that either do a lot
of web service RPC calls (10-12k per request) to other services, or do
a smaller number of very expensive WS calls. Either way, this takes a
lot of time to do this, and so to avoid timeouts, we would iteratively
build up the page, yielding a small HTML comment every n-iterations,
depending on how fast etc the WS calls were.

In a similar manner, we also have some pages where we do some complex
data mining queries thru django's ORM. On these pages as well, we
would use an iterator based response to avoid timing out whilst
building the page. Since generating this data is very expensive time
wise, we use a modified CacheMiddleware to buffer and store the
generated page in the cache. Sample code is on ticket 6527 (1).

Cheers

Tom

(1) http://code.djangoproject.com/ticket/6527#comment:19

Forest Bond

unread,
Feb 18, 2010, 11:34:34 AM2/18/10
to django-d...@googlegroups.com
Hi,

Thanks for your response, Russ.

Let me put my use case out there to ground my responses a bit.

I'm using Rackspace Cloud Files, a cloud storage service, to store large files
(several hundred megabytes up to a few gigabytes in size). I need to serve
these files to authenticated users via a view.

(There are probably other ways to expose the files, but this is by far the
simplest and most obvious. I'd rather not open the discussion up to include
rearchitecting the design to side step the iterator issue.)

Doing this requires being able to stream the data from the storage service to
the user. Reading all data from the storage service and then sending it to the
user would not work as the user's connection would time out.

I don't really care to cache the data on the web server, anyway. The point of
the storage service is to offload the storage of these large files so that the
web server does not need to accommodate them. I don't want to be worried about
the web server running out of disk space if it has to handle too many
simultaneous requests for files.

On Thu, Feb 18, 2010 at 08:00:22PM +0800, Russell Keith-Magee wrote:
> I acknowledge that this fixes the problem (in the sense that it makes
> the errors caused by multiple iteration go away), but I'm not sure
> it's the right solution. Disabling Etags, content-length coding,
> gziped content and so on seems like throwing the baby out with the
> bathwater.

These features are not disabled, of course, they are just not automatically
calculated. If a view needs to return an iterator but it wants these things, it
should be able to calculate Etag and Content-Length headers itself easily
enough.

For instance, Cloud Files provides an Etag and size for each file that is
stored. These can be accessed without retrieving the file and are trivial to
pass along in the HTTPResponse from my view.

> Of course, the simplest way to do this would be to just evaluate the
> iterator as soon as it is provided -- which kind of negates any
> benefit of using an iterator in the first place. So -- I suppose the
> bigger question that needs to be asked is what exactly is the use case
> for an iterable response? I mean, I understand the general benefit of
> using iterators and generators instead of rolled out lists, but what
> is the use case in a HTTP context?

... and I think this hits the nail on the head.

Why would I be looking to use an iterator unless I was dealing with a data
stream that was going to take some time to deliver? In other words, does it
ever make sense to use an iterator except in those situations where capturing
the response content before sending it to the client would be prohibitive?

In my mind, the answer to that question is "no".

However, I do see that enforcing a single iteration may seem restrictive for the
general case. If that is the consensus, then I think there should be some way
to trigger streaming-only, iterate-once handling.

Perhaps one of the following options makes sense:

* StreamingHttpResponse(content = iterator)
* HttpResponse(content = iterator, streaming = True)
* HttpResponse(content = StreamingResponseContent(iterator))

Then, for the general case, we could capture the data from the iterator and
store it on disk or in memory so that it can be accessed arbitrarily in
middleware, etc. Frankly, storing it in memory makes sense to me given that if
there's a lot of data you'll want to use the streaming version to avoid
connection timeouts. Maybe there's some use-case I'm not thinking of, though.

> > * Is this type of change too invasive for 1.2?
>
> I'm inclined to say yes. I know this is addressing a number of
> specific tickets, so it isn't strictly a feature addition, but those
> issues have existed for a long time, and the change you're proposing
> has a conceptually wide footprint - I'd rather see changes like this
> treated like a feature, so they get the benefit of review through
> multiple prereleases, etc.

Makes sense to me.

> > * Should I create a "master" ticket to deal with this issue and continue
> >  development work there?  The other tickets focus on specific symptoms of this
> >  problem rather than the root cause more generally.
>
> This would be a good idea. Given that there is a specific course of
> action on the table, it's worth having a ticket to track the evolution
> of the idea. By way of comparison - this is exactly what Ben Firshman
> did with class-based Syndication views in ticket #12403; the ticket
> addressed a dozen other tickets, but provided a single place to
> coordinate review of the specific solution.

Okay, I'll create a ticket.

> Make sure you reference any other existing tickets, but don't close
> those tickets as duplicates - that way, if we find an alternate
> approach, we can close this new ticket as wontfix without losing the
> original problem reports.

Agreed.

signature.asc

Tai Lee

unread,
Feb 18, 2010, 11:40:55 PM2/18/10
to Django developers
See also #7581, which has a patch that I have been using in my local
Django branch for over a year without issue. My use case is that I
need to generate large amounts of data in CSV format and stream the
response to avoid timeouts. I wouldn't mind `HttpResponse` consuming
`content` immediately if we had a `Http StreamingResponse` or similar
subclass, but the issue I found after discussion with Malcolm and Ivan
Sagalaev and ccahoon who worked on the http-wsgi-improvements branch
relates to middleware needing to be aware of when they should or
should not consume the content in a response, when they should replace
the content or generator in a response, or when they should be skipped
entirely.

Here's my suggestion taken from a recent comment in #7581:

I suggest that we simply add an attribute
`HttpResponse.disabled_middleware` (list or tuple). Any installed
middleware specified there should be bypassed when applying response
middleware. Django could then ship with an `HttpStreamingResponse`
class that disables any included middleware that is known not to work
with generator content, and this could easily be subclassed and
updated by users who need to disable any 3rd party middleware to
create a streaming response.

Cheers.
Tai.

Forest Bond

unread,
Feb 20, 2010, 3:46:52 PM2/20/10
to django-d...@googlegroups.com
Hi,

On Thu, Feb 18, 2010 at 08:40:55PM -0800, Tai Lee wrote:
> See also #7581, which has a patch that I have been using in my local
> Django branch for over a year without issue. My use case is that I
> need to generate large amounts of data in CSV format and stream the
> response to avoid timeouts. I wouldn't mind `HttpResponse` consuming
> `content` immediately if we had a `Http StreamingResponse` or similar
> subclass, but the issue I found after discussion with Malcolm and Ivan
> Sagalaev and ccahoon who worked on the http-wsgi-improvements branch
> relates to middleware needing to be aware of when they should or
> should not consume the content in a response, when they should replace
> the content or generator in a response, or when they should be skipped
> entirely.

Okay, you're ticket is already tracking the main problem here. I won't create a
new one. Instead, I'll add patches/comments to that ticket (after discussion
here).

I didn't realize you had already done some work on this. Let's work together so
to avoid duplicating effort.

Here are the use cases that have been identified:

1. A content iterator was passed for no apparent reason. The view could just
as easily have returned HttpResponse(''.join(iterator)) with no ill effects.
I assume this use-case is what people have in mind when they talk about
evaluating the iterator and storing the content in memory or on disk; doing
this with large responses could cause a connection timeout.

2. Your use case:

* The response content must be streamed.
* The content length is unknown and won't be known until after the headers
are sent.
* An Etag couldn't be calculated because the content is not known until
after the headers are sent.
* The content stream should be gzip compressed.
* Cache headers should be added.
* Response content should not be cached locally.

3. My use case:

* The response content must be streamed.
* The content length is known and the Content-Length header can be set in
the view.
* The Etag is known and the Etag header can be set in the view.
* The streamed content is already compressed, so it should not be compressed
again. Further, compressing the stream would invalidate the previously
set Content-Length and Etag headers.
* Cache headers should be added.
* Response content should not be cached locally.

These are the fundamental ways in which handling of streaming and non-streaming
responses must differ:

=================== =============== =======================================
Feature Non-streaming Streaming
------------------- --------------- ---------------------------------------
Set Content-Length Yes No
------------------- --------------- ---------------------------------------
Set Etag Yes No
------------------- --------------- ---------------------------------------
Gzip compress Yes Yes if it would not invalidate existing
headers (e.g. Content-Length, Etag)
------------------- --------------- ---------------------------------------
Set cache headers Yes Yes
------------------- --------------- ---------------------------------------
Cache response Yes No, the response may be too big for
locally local caching to make sense (maybe we
should cache if we know the
content-length ahead of time and it is
not too large?).
=================== =============== =======================================

Do the above behaviors sound right to everyone?

Then there is the question of how to trigger handling as a streaming response.
Either we require a view to explicitly request it by using a
StreamingHttpResponse instead of a normal HttpResponse, or we assume that if the
view returns an iterator, it wants a streaming response.

Personally, I think that any view that returns an iterator must want a streaming
response. Otherwise, why would it have returned an iterator? I can't think of
any other reason to do that. However, I have no objection to requiring views
use StreamingHttpResponse explicitly.

Once we know that a response must be streamed, I think it makes sense to forbid
middleware from breaking streaming by evaluating the iterator prematurely. I've
advocated throwing an exception when that happens so that the middleware can be
fixed to handle streaming responses correctly.

Hopefully we can get some consensus on the above issues. If we can, I'm more
than happy to whip up a patch implementing the behavior we can all agree on.

> Here's my suggestion taken from a recent comment in #7581:
>
> I suggest that we simply add an attribute
> `HttpResponse.disabled_middleware` (list or tuple). Any installed
> middleware specified there should be bypassed when applying response
> middleware. Django could then ship with an `HttpStreamingResponse`
> class that disables any included middleware that is known not to work
> with generator content, and this could easily be subclassed and
> updated by users who need to disable any 3rd party middleware to
> create a streaming response.

Okay, I think "disabled_middleware" is a bad name because it doesn't actually
imply that all middleware should be disabled. Anyway, it seems like middleware
could just check if isinstance(response, StreamingHttpResponse).

I look forward to your feedback.

signature.asc

Forest Bond

unread,
Feb 20, 2010, 4:14:20 PM2/20/10
to django-d...@googlegroups.com
Hi,

On Sat, Feb 20, 2010 at 03:46:52PM -0500, Forest Bond wrote:
> Then there is the question of how to trigger handling as a streaming response.
> Either we require a view to explicitly request it by using a
> StreamingHttpResponse instead of a normal HttpResponse, or we assume that if the
> view returns an iterator, it wants a streaming response.
>
> Personally, I think that any view that returns an iterator must want a streaming
> response. Otherwise, why would it have returned an iterator? I can't think of
> any other reason to do that. However, I have no objection to requiring views
> use StreamingHttpResponse explicitly.

I should note that in order to maintain backwards compatibility, passing an
iterator to HttpResponse should deliver at least the current level of support
for streaming the response. That's an argument in support of enabling streaming
behavior for a response if an iterator is passed in (i.e. not requiring views to
use StreamingHttpResponse to get streaming behavior).

-Forest

signature.asc

Forest Bond

unread,
Feb 20, 2010, 4:22:44 PM2/20/10
to django-d...@googlegroups.com
Hi,

Sorry for the volume of messages, but I had one more note to add. ;)

On Sat, Feb 20, 2010 at 03:46:52PM -0500, Forest Bond wrote:

> Here are the use cases that have been identified:
>
> 1. A content iterator was passed for no apparent reason. The view could just
> as easily have returned HttpResponse(''.join(iterator)) with no ill effects.
> I assume this use-case is what people have in mind when they talk about
> evaluating the iterator and storing the content in memory or on disk; doing
> this with large responses could cause a connection timeout.

If it wasn't already clear, I don't think that this use case is one that should
be supported. I included it because it seems that there is some expectation
that it is (as evidenced by the proposal that iterators be evaluated and the
result stored in memory or on disk).

-Forest

signature.asc

Tai Lee

unread,
Feb 21, 2010, 4:28:23 AM2/21/10
to Django developers
On Feb 21, 7:46 am, Forest Bond <for...@alittletooquiet.net> wrote:
> Okay, I think "disabled_middleware" is a bad name because it doesn't actually
> imply that all middleware should be disabled.  Anyway, it seems like middleware
> could just check if isinstance(response, StreamingHttpResponse).

I'm not sure I follow your reasoning here. If you just don't like the
name, that's a bike shed argument. I'm proposing that
`disabled_middleware` be a list or tuple of middleware that should NOT
be called on the response, because the author of the `HttpResponse`
subclass that defined `disabled_middleware` knows that those
middleware are not needed or will not work as intended.

The point that I gleaned from earlier discussions is that we cannot
fix this problem in middleware. We cannot have middleware check `if
isinstance(response, StreamingHttpResponse)` and conditionally change
its behaviour because we would force ALL Django and 3rd party
middleware in existence to check for the possibility that a response
has generator or string content and change their behaviour
accordingly.

I believe the answer is to define explicit `HttpResponse` subclasses
and provide a mechanism by which each subclass can explicitly disable
certain middleware. That way, all existing Django and 3rd party
middleware do not need to change or know or care about what kind of
content the response has. Then the developer can simply disable any
middleware that is causing a problem with a particular response,
regardless if it is a problem with streaming, or they don't want an
ETag, or they don't want a response compressed, etc.

Cheers.
Tai.

Russell Keith-Magee

unread,
Feb 21, 2010, 9:16:52 AM2/21/10
to django-d...@googlegroups.com

I agree in spirit, but I'm not sure I can agree in practice. The docs
currently say it's ok to return an iterator in a response. As this
discussion and the various tickets indicate, this isn't entirely true.
However, in order to be backwards compatible with our documented API,
you *should* be able to use an iterator in a normal HTTPResponse.

That doesn't mean it has to be spectacularly efficient -- we could
simply baking an iterator with ''.join(iterator) if it is provided to
HttpResponse. However, it would mean that any existing uses of
iterators would continue to work, and the iterator draining bugs would
disappear.

As a way out of this documentation hole, part of the solution may be
to deprecate the use of iterators in HttpResponse (in favor of
StreamingHttpResponse, or whatever we end up with). If the
''.join(iterator) call also raises a PendingDeprecation warning, we
can drive people to either fix their code (manually evaluating the
iterator before using it), or switch to the streaming response type.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Feb 21, 2010, 9:44:07 AM2/21/10
to django-d...@googlegroups.com
On Sun, Feb 21, 2010 at 5:28 PM, Tai Lee <real....@mrmachine.net> wrote:
> On Feb 21, 7:46 am, Forest Bond <for...@alittletooquiet.net> wrote:
>> Okay, I think "disabled_middleware" is a bad name because it doesn't actually
>> imply that all middleware should be disabled.  Anyway, it seems like middleware
>> could just check if isinstance(response, StreamingHttpResponse).
>
> I'm not sure I follow your reasoning here. If you just don't like the
> name, that's a bike shed argument. I'm proposing that
> `disabled_middleware` be a list or tuple of middleware that should NOT
> be called on the response, because the author of the `HttpResponse`
> subclass that defined `disabled_middleware` knows that those
> middleware are not needed or will not work as intended.
>
> The point that I gleaned from earlier discussions is that we cannot
> fix this problem in middleware. We cannot have middleware check `if
> isinstance(response, StreamingHttpResponse)` and conditionally change
> its behaviour because we would force ALL Django and 3rd party
> middleware in existence to check for the possibility that a response
> has generator or string content and change their behaviour
> accordingly.

Any existing middleware that tries to evaluate response.content will
be draining iterator-based content at present, which will be breaking
responses. If a middleware is currently buggy, providing an API for
developers to make their middlewares not buggy is certainly a viable
option.

That said, a straight up "isinstance" call probably isn't the best
option - more on this in a bit.

> I believe the answer is to define explicit `HttpResponse` subclasses
> and provide a mechanism by which each subclass can explicitly disable
> certain middleware. That way, all existing Django and 3rd party
> middleware do not need to change or know or care about what kind of
> content the response has. Then the developer can simply disable any
> middleware that is causing a problem with a particular response,
> regardless if it is a problem with streaming, or they don't want an
> ETag, or they don't want a response compressed, etc.

The problem with this approach is that it doesn't work well with a
fully reusable stack. If I build an app with the intention of re-use,
I don't know the middleware stack that will exist when my app is
deployed.

If I manually exclude an ETag middleware, there's no guarantee that
middleware will be used. This isn't inherently a problem - we could
just document that any middlware named for exclusion that isn't in use
will be ignored.

However, the converse situation does cause problems. If the developer
using my app has written a custom middleware than tries to twiddle the
content of the response, I won't know the name of that middleware in
advance to put it on a ban list.

To me, it seems like what we need to be able to capture in a response
object is the set of capabilities that a response supports -- For
example:

* Can I determine the length of this response?
* Can I inspect the contents of this response?
* Can I modify the contents of this response?

In the case of a simple static HTTP response, the answer is "yes" to all three.

In the case of Simon Willison's proposed TemplateResponse (see ticket
#12815), the answer to both is maybe, depending on the whether the
template has been baked yet (and there might be certain conditions
under which you might want a length/content request to induce baking)

In Forest's "streaming from Cloudfiles" use case, we can provide the
length of the response; we can't inspect or twiddle the content.

Tai's use case can't provide a length, but will allow content
modification (to provide a GZipped stream, for example)

These capabilities are just a starting point. Defining these
capabilities will be an important design process. It would be easy to
just add "Has a known ETag?" and "Can be cached?" as capabilities, but
we may need to look deeper to see if these questions are really asking
something more significant.

I'm not sure the Etag question has anything deeper behind it, but the
caching restriction really has more to do with whether we can know in
advance whether the content will be large, or whether the entire
response can (or should) be loaded into memory at once. Regardless of
the final form -- it's the deeper concepts that need to be captured as
capabilities.

What I'm driving at here is getting to the core of duck typing. As the
developer of FooMiddleware, I don't really care whether I get a
HttpResponse, a StreamingHTTPResponse, or something else entirely --
what I care about is whether I can iterate over the content, modify
the content, ask for a content length, or whatever it is that my
middleware needs to do. The capabilities is more important than the
specific type -- especially when the capabilities are subject to
change (such as Forrest's example where Etags and content lengths can
be provided, as long as GZipping is disabled).

Yours,
Russ Magee %-)

Forest Bond

unread,
Feb 21, 2010, 3:48:02 PM2/21/10
to django-d...@googlegroups.com
Hi,

On Sun, Feb 21, 2010 at 01:28:23AM -0800, Tai Lee wrote:
> On Feb 21, 7:46 am, Forest Bond <for...@alittletooquiet.net> wrote:
> > Okay, I think "disabled_middleware" is a bad name because it doesn't actually
> > imply that all middleware should be disabled.  Anyway, it seems like middleware
> > could just check if isinstance(response, StreamingHttpResponse).
>
> I'm not sure I follow your reasoning here. If you just don't like the
> name, that's a bike shed argument. I'm proposing that
> `disabled_middleware` be a list or tuple of middleware that should NOT
> be called on the response, because the author of the `HttpResponse`
> subclass that defined `disabled_middleware` knows that those
> middleware are not needed or will not work as intended.

Sorry, I badly misread your message. Now that I've reread it I see that the
name is perfectly appropriate.

signature.asc

Forest Bond

unread,
Feb 21, 2010, 6:07:35 PM2/21/10
to django-d...@googlegroups.com
Hi,

On Sun, Feb 21, 2010 at 10:44:07PM +0800, Russell Keith-Magee wrote:
> To me, it seems like what we need to be able to capture in a response
> object is the set of capabilities that a response supports -- For
> example:
>
> * Can I determine the length of this response?
> * Can I inspect the contents of this response?
> * Can I modify the contents of this response?
>
> In the case of a simple static HTTP response, the answer is "yes" to all three.

I had considered this kind of approach but wondered if the additional complexity
was really worth it. Now that I've thought about it some more, I'm pretty sure
that it isn't.

Consider the following "capabilities":

* "Has known ETag?"
* "Has known content length?"

I can already answer these questions with the current code. I do it like this:

# "Has known ETag?"
if 'ETag' in response:
...

# "Has known content length?"
if 'Content-Length' in response:
...

In other words, many "capabilities" of the form "Has feature foo?" can be
expressed as "Has header bar?".

Now, I'm being a little unfair because when you say "Can I determine the length
of this response?" you clearly mean "Can I do len(response.content)?". However,
consider that HttpResponse could easily do this in __init__:

self['Content-Length'] = len(self.content)

... and the problem simply disappears. Now, middleware that wants to know if
the response has a known length can just look for the Content-Length header and
be done with it (assuming that for StreamingHTTPResponses the view would
explicitly set the Content-Length header if it can).

There are some "capabilities" that clearly aren't synonymous with "Has header
bar?". For instance:

* "Can I read the content?"
* "Can I replace the content?"
* "Can I wrap the content iterator with a new iterator?" (e.g. gzip streams)

But I advocate strongly against creating a capabilities-based API just to answer
these questions, because I think there are only two useful sets of answers: one
set for responses with string content, and one set for responses with iterator
content (but maybe I lack imagination ;).

Anyway, if you accept that there are few (if any) reasons to have these
individual capabilities explicitly query-able, then perhaps you are also willing
to accept that isinstance(response, StreamingHTTPResponse) is a perfectly
adequate way to find the answer to the "can I do X" questions above.

Of course, once you know that you can, you have to decide if you should, which
is a completely separate matter...

> In Forest's "streaming from Cloudfiles" use case, we can provide the
> length of the response; we can't inspect or twiddle the content.

Actually, you *can* inspect the content and even change it (just like in Tai's
use case). I would just rather you didn't because I want to keep my
Content-Length and Etag headers intact. Those headers are more important to me
than any benefit provided by Gzipping, etc.

So it's not strictly a question of capabilities. It's also a question of doing
what I mean. If I set headers in my view (especially important headers like
Content-Length), I sort of expect that they won't just be dropped by some
middleware (although I'm pretty much okay with middleware replacing them with
new values where appropriate).

(Alternatively, it's a question of preference rather than correctness. In other
words, perhaps there should be some way for me to ask that a particular response
not be gzip-compressed. But I think that is middleware-specific and orthogonal
to the issue at hand, so I'd rather just ignore it for now.)

My point is that if GzipMiddleware was modified to do streaming compression if
it can do so without invalidating existing headers, that covers both my use case
and Tai's. Sure, I don't love that this behavior is triggered implicitly, but
again, I think explicitly requesting it should be handled by middleware-specific
mechanisms such as view decorators, etc.

(The never_cache decorator provides a good example. We could add a never_gzip
decorator.)

> I'm not sure the Etag question has anything deeper behind it, but the
> caching restriction really has more to do with whether we can know in
> advance whether the content will be large, or whether the entire
> response can (or should) be loaded into memory at once. Regardless of
> the final form -- it's the deeper concepts that need to be captured as
> capabilities.

On the other hand, you can do it just as easily without capabilities:

if isinstance(response, StreamingHTTPResponse):
try:
content_length = int(response['Content-Length'])
except KeyError:
pass
else:
if content_length < 200000:
# This might be implemented using itertools.tee
add_streaming_response_to_cache(response)

else:
add_normal_response_to_cache(response)

Obviously this is an oversimplification (I ignore never_cache, etc.), but I
think the point stands. Simply knowing whether or not the response is streaming
and whether or not it has a known size is sufficient for knowing whether or not
you should cache it.

(If you wanted to get fancy you could handle local caching of streaming
responses without a known size–you could wrap the content iterator with a
generator function that determines the final size of the content and possibly
caches it.)

> What I'm driving at here is getting to the core of duck typing. As the
> developer of FooMiddleware, I don't really care whether I get a
> HttpResponse, a StreamingHTTPResponse, or something else entirely --
> what I care about is whether I can iterate over the content, modify
> the content, ask for a content length, or whatever it is that my
> middleware needs to do. The capabilities is more important than the
> specific type -- especially when the capabilities are subject to
> change (such as Forrest's example where Etags and content lengths can
> be provided, as long as GZipping is disabled).

I completely agree that middleware should be able to inspect response objects to
employ duck-typing-style checks. But I don't think that a formalized
capabilities-based interface is the only way to do this, and I think that
capabilities ("can I do X") are only half of the picture (the other half being
"should I do X"), so even if response capabilities are known, middleware is
still going to have to infer a lot based on what headers are already set.

But if someone can come up with a use case where "has header bar?" and
isinstance(response, StreamingHTTPResponse) are insufficient and a
capabilities-style interface would work, I will gladly accept that interface.

signature.asc

Luke Plant

unread,
Feb 21, 2010, 7:39:40 PM2/21/10
to django-d...@googlegroups.com
On Sunday 21 February 2010 23:07:35 Forest Bond wrote:

> Simply knowing whether or
> not the response is streaming and whether or not it has a known
> size is sufficient for knowing whether or not you should cache it.

I can't find it now, but someone definitely had a use case where they
definitely wanted local caching *and* a streaming response (it was
views that took a long time to generate a small amount of actual
content - they outputted an occasional HTML comment to stop the
connection from timing out). So I don't think the equation you make
stands - we have to leave this to the developer to tell us.

Implementing Russell's proposal could be as simple as:

if getattr(response, 'can_modify_content', True):
# go ahead and modify content

This:
- is backwards compatible
- is easy to use - just add the attribute to your response,
or even create a decorator for the view
- has far better duck typing properties than using isinstance.
This is really important for people who may have developed
their own HttpResponse subclass.

(That's not necessarily a final API, I'm just illustrating the
advantages over isinstance).

Django should make the defaults for this behaviour amount to 'do what
I mean', but add an API which provides the explicit 'do what I say'
control.

Luke


--
Noise proves nothing. Often a hen who has merely laid an egg
cackles as if she laid an asteroid.
-- Mark Twain

Luke Plant || http://lukeplant.me.uk/

Forest Bond

unread,
Feb 21, 2010, 10:25:09 PM2/21/10
to django-d...@googlegroups.com
Hi,

On Mon, Feb 22, 2010 at 12:39:40AM +0000, Luke Plant wrote:
> On Sunday 21 February 2010 23:07:35 Forest Bond wrote:
>
> > Simply knowing whether or
> > not the response is streaming and whether or not it has a known
> > size is sufficient for knowing whether or not you should cache it.
>
> I can't find it now, but someone definitely had a use case where they
> definitely wanted local caching *and* a streaming response (it was
> views that took a long time to generate a small amount of actual
> content - they outputted an occasional HTML comment to stop the
> connection from timing out). So I don't think the equation you make
> stands - we have to leave this to the developer to tell us.

Caching streamed responses doesn't require a capabilities-style interface. The
CacheMiddleware could do this regardless of interface and either way would have
to wrap the content iterator in a generator.

Also, the view developer can already explicitly request non-caching behavior
using never_cache. This is the established way for the developer to specify
which behavior he wants.

> Implementing Russell's proposal could be as simple as:
>
> if getattr(response, 'can_modify_content', True):
> # go ahead and modify content
>
> This:
> - is backwards compatible
> - is easy to use - just add the attribute to your response,
> or even create a decorator for the view
> - has far better duck typing properties than using isinstance.
> This is really important for people who may have developed
> their own HttpResponse subclass.
>
> (That's not necessarily a final API, I'm just illustrating the
> advantages over isinstance).

I don't think the discussion is over using attributes versus isinstance. I'd be
happy with a response.is_streaming attribute instead of using
isinstance(response, StreamingHttpResponse).

If there is disagreement, it is between the following two view points:

* We require a set of abstract, discrete response "capabilities" to implement
middleware response handling semantics.
* There are two kinds of responses: "streamed" and "non-streamed". Everything
else is just headers.

As Russ noted, the real challenge in implementing a capabilities-based interface
is in deciding what the capabilities should be. Deciding what capabilities
should be supported and how they should be interpreted by middleware is much
more difficult than deciding how they should be accessed.

For instance, can you describe the semantics of this hypothetical
can_modify_content capability? How should middleware interpret it? What use
case in particular does it address?

And even if a response has "can_modify_content" set to True, how will the
middleware know whether the content should be modified using string operations
or by wrapping an iterator? Is that a separate capability?

signature.asc

Tai Lee

unread,
Feb 21, 2010, 11:55:10 PM2/21/10
to Django developers
I didn't think about the author of a `HttpResponse` subclass needing
to know all potential middleware that will not work with that
particular response class. That is definitely a problem. However, I
don't think that exposing a capabilities based API on `HttpResponse`
is the answer.

If we do that, then ALL middleware that already exists or is yet to be
written will need to interrogate every response and conditionally
execute based on the capabilities that the response has (and does not
have), which will be very repetitive and prone to error and still
likely result in 3rd party middleware not working as intended with
some responses.

From what I understand of previous discussions [1], this is what
Malcolm was strongly opposed to. Middleware should not need to examine
the response to determine if they can or cannot execute. E.g. if it is
possible for a response to have generator content that cannot be
consumed, then ALL middleware must check for this possibility.

Perhaps a better alternative is to provide a way to map middleware to
be skipped (or executed) for a specific `HttpResponse` class. I
hesitate to recommend Yet Another Setting, but this would give
developers control over which middleware execute (or do not execute)
for which responses, even if both the middleware and `HttpResponse`
class are 3rd party code.

MIDDLEWARE_CLASSES = (
'django.middleware.common',
)

CONDITIONAL_MIDDLEWARE_CLASSES = {
'exclude': {
'django.http.HttpResponseStreaming':
['django.middleware.common', 'otherapp.othermiddleware', ...],
},
'include': {
'myapp.MyHttpResponse': ['myapp.mymiddleware', ...],
},
}

Middleware specified in `MIDDLEWARE_CLASSES` would run all the time,
except when the middleware function is linked to the `HttpResponse`
class in `CONDITIONAL_MIDDLEWARE_CLASSES['exclude']`. Middleware that
is NOT specified in `MIDDLEWARE_CLASSES` would run only when the
middleware function is linked to the `HttpResponse` class in
`CONDITIONAL_MIDDLEWARE_CLASSES['include']`.

Cheers.
Tai.

[1] http://groups.google.com/group/django-developers/browse_thread/thread/3e0d78b98498c159/f08036fb1e7fe6b9

Reply all
Reply to author
Forward
0 new messages