[Web-SIG] Chunked Tranfer encoding on request content.

Graham Dumpleton

unread,

Mar 4, 2007, 6:28:26 PM3/4/07

to web...@python.org

The WSGI specification doesn't really say much about chunked transfer encoding
for content sent within the body of a request. The only thing that appears to
apply is the comment:

WSGI servers must handle any supported inbound "hop-by-hop" headers on their
own, such as by decoding any inbound Transfer-Encoding, including chunked
encoding if applicable.

What does this really mean in practice though?

As a means of getting feedback on what is the correct approach I'll go through
how the CherryPy WSGI server handles it. The problem is that the CherryPy
approach raises a few issues which makes me wander if it is doing it in the
most appropriate way.

In CherryPy, when it sees that the Transfer-Encoding is set to 'chunked' while
parsing the HTTP headers, it will at that point, even before it has called
start_response for the WSGI application, read in all content from the body of
the request.

CherryPy reads in the content like this for two reasons. The first is so that
it can then determine the overall length of the content that was available and
set the CONTENT_LENGTH value in the WSGI environ. The second reason is so that
it can read in any additional HTTP header fields that may occur in the trailer
after the last data chunk and also incorporate them into the WSGI environ.

The first issue with what it does is that it has read in all the content. This denies
a WSGI application the ability to stream content from the body of a request and
process it a bit at a time. If the content is huge, that it buffers it can also mean
the application process size will grow significantly.

The second issue, although I am confused on whether the CherryPy WSGI server
actually implements this correctly, is that if the client was expecting to see a
100 continue response, this will need to be sent back to the client before any
content can be read. When chunked transfer encoding is not used, such a 100
continue response would in a good WSGI server only be sent when the WSGI
application called read() on wsgi.input for the first time. Ie., the 100 continue
indicates that the application which is consuming the data is actually ready to
start processing it. What CherryPy WSGI server is doing is circumventing that and
the client could think the final consumer application is ready before it actually is.

Note that I am assuming here that 100 continue is still usable in conjunction
with chunked transfer encoding. In CherryPy WSGI server it only actually sends
the 100 continue after it attempts to try and read content in the presence of a
chunked transfer encoding header. Not sure if this is actually a bug or not.

CherryPy WSGI server also doesn't wait until first read() by WSGI application
before sending back the 100 continue either and instead sends it as soon as the
headers are parsed. This may be fine, but possibly not most optimal as it denies
an application the ability to fail a request and avoid a client sending the
actual content.

Now, to my mind, the preferred approach would be that the content would not
be read up front like this and instead CONTENT_LENGTH would simply be unset
in the WSGI environ.

>From prior discussions related to input filtering on the list, a WSGI
application shouldn't really be paying much attention to CONTENT_LENGTH anyway
and should just be using read() to get data until it returns an empty string.
Thus, for chunked data, that it doesn't know the content length up front
shouldn't matter as it should just call read() until there is no more. BTW, it may
not be this simple for something like a proxy, but that is a discussion for another
time.

Doing this also means that the 100 continue only gets sent when the application
is ready and there is no need to for the content to be buffered up.

That it is the actual application which is consuming the data and not some
intermediary means that an application could implement some mechanism whereby
it reads some data, acts on that and starts sending some data in response. The
client then might send more data based on that response which the application
only then reads, send more data as response etc. Thus an end to end
communication stream can be established where the actual overall content length
of the request could never be established up front.

The only problem with deferring any reading of data to when the application
wants to actually read it, is that if the overall length of content in the request
is bounded, there is no way to get access to the additional headers in the trailer
of the request and have them available in the WSGI environ since processing of
the WSGI environ has already occurred before any data was read.

So, what gives. What should a WSGI server do for chunked transfer encoding on
a request?

I may not totally understand 100 continue and chunked transfer encoding and
am happy to be correct in my understanding of them, but what CherryPy WSGI
server does doesn't seem right to me at first look.

Graham
_______________________________________________
Web-SIG mailing list
Web...@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/python-web-sig-garchive-9074%40googlegroups.com

Sidnei da Silva

unread,

Mar 4, 2007, 7:55:04 PM3/4/07

to Graham Dumpleton, web...@python.org

I'm not quite aware of the 100 Continue semantics, but I know that
applications which request Transfer-Encoding: chunked should *not*
expect a Content-Length response header, nor should the WSGI thingie
doing the 'chunking' need to know it in advance.

'chunked' is actually very simple. Simplifying it a lot, it basically
needs to output '%x\r\n%s\r\n' % (len(chunk), chunk) for every chunk
of data except the last which should be '0\r\n\r\n'. The only trick
here is ensuring that no chunk of length '0' is written except the
last.

What might be happening is that CherryPy is outputting the whole
response body as a single chunk, and relying on the 'Content-Length'
header, which would be silly, I hope that's not what's happening
though I haven't looked.

--
Sidnei da Silva
Enfold Systems http://enfoldsystems.com
Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Graham Dumpleton

unread,

Mar 4, 2007, 8:33:38 PM3/4/07

to Sidnei da Silva, web...@python.org

Sidnei da Silva wrote ..

> I'm not quite aware of the 100 Continue semantics, but I know that
> applications which request Transfer-Encoding: chunked should *not*
> expect a Content-Length response header, nor should the WSGI thingie
> doing the 'chunking' need to know it in advance.
>
> 'chunked' is actually very simple. Simplifying it a lot, it basically
> needs to output '%x\r\n%s\r\n' % (len(chunk), chunk) for every chunk
> of data except the last which should be '0\r\n\r\n'. The only trick
> here is ensuring that no chunk of length '0' is written except the
> last.
>
> What might be happening is that CherryPy is outputting the whole
> response body as a single chunk, and relying on the 'Content-Length'
> header, which would be silly, I hope that's not what's happening
> though I haven't looked.

I am not talking about the response body. I am talking about the body of
the request. For example, the body of a POST request being sent from
client to server.

Graham

Robert Brewer

unread,

Mar 4, 2007, 9:02:25 PM3/4/07

to Graham Dumpleton, web...@python.org

Graham Dumpleton wrote:
> In CherryPy, when it sees that the Transfer-Encoding
> is set to 'chunked' while parsing the HTTP headers,
> it will at that point, even before it has called
> start_response for the WSGI application, read in all
> content from the body of the request.
>
> CherryPy reads in the content like this for two reasons.
> The first is so that it can then determine the overall
> length of the content that was available and set the
> CONTENT_LENGTH value in the WSGI environ.

Right; IIRC the rfile just hangs if you try to read
past Content-Length. Perhaps that can be fixed inside
socket.makefile somewhere?

> The second reason is so that it can read in any
> additional HTTP header fields that may occur in
> the trailer after the last data chunk and also
> incorporate them into the WSGI environ.

Yeah; I didn't see any other way to get Trailers into
the environ. Perhaps that can be added to WSGI 2.0?

I also just haven't had time to write a dechunker
which worked on the fly. Patches welcome ;)

> When chunked transfer encoding is not used, such a
> 100 continue response would in a good WSGI server
> only be sent when the WSGI application called read()
> on wsgi.input for the first time.

Sounds reasonable. Again, patches welcome ;)

> Note that I am assuming here that 100 continue is
> still usable in conjunction with chunked transfer
> encoding. In CherryPy WSGI server it only actually
> sends the 100 continue after it attempts to try
> and read content in the presence of a chunked
> transfer encoding header. Not sure if this is
> actually a bug or not.

It looks like a bug. The Expect header should be
checked before decode_chunked (at least until the
100 response can be moved inside read()).

Thanks for catching those!

Robert Brewer
System Architect
Amor Ministries
fuma...@amor.org

Sidnei da Silva

unread,

Mar 4, 2007, 9:13:11 PM3/4/07

to Graham Dumpleton, web...@python.org

On 3/4/07, Graham Dumpleton <gra...@dscpl.com.au> wrote:
> I am not talking about the response body. I am talking about the body of
> the request. For example, the body of a POST request being sent from
> client to server.

Ah, ok. Anyway I don't see why it would need to read the whole body to
do chunked.

--
Sidnei da Silva
Enfold Systems http://enfoldsystems.com
Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Graham Dumpleton

unread,

Mar 4, 2007, 11:50:43 PM3/4/07

to Robert Brewer, web...@python.org

Robert Brewer wrote ..

> Graham Dumpleton wrote:
> > In CherryPy, when it sees that the Transfer-Encoding
> > is set to 'chunked' while parsing the HTTP headers,
> > it will at that point, even before it has called
> > start_response for the WSGI application, read in all
> > content from the body of the request.
> >
> > CherryPy reads in the content like this for two reasons.
> > The first is so that it can then determine the overall
> > length of the content that was available and set the
> > CONTENT_LENGTH value in the WSGI environ.
>
> Right; IIRC the rfile just hangs if you try to read
> past Content-Length. Perhaps that can be fixed inside
> socket.makefile somewhere?
>
> > The second reason is so that it can read in any
> > additional HTTP header fields that may occur in
> > the trailer after the last data chunk and also
> > incorporate them into the WSGI environ.
>
> Yeah; I didn't see any other way to get Trailers into
> the environ. Perhaps that can be added to WSGI 2.0?

Don't know how you could cater for trailers in WSGI 2.0 without coming up with
some totally new scheme of passing such additional information to the WSGI
application.

First idea I can think of at present is that if chunked transfer encoding
that WSGI server sets 'wsgi.trailers' as an empty dictionary which it keeps a
reference to and only populates when it actually encounters the trailers. Ie.,
only guaranteed to be set when read() finally returns an empty string. Any
middleware would have to be obligated to pass the reference though and not
actually copy the dictionary so that changes made later back at WSGI server
layer would be available to application.

Second idea I can think of is a new member function in 'wsgi.input' called
'trailers()' which could be used to access them. Alternatively, 'wsgi.trailers'
could also be a function. Either way, it could return None when not yet known
and dictionary when it is.

One problem with this is that in Apache, when the trailers are encountered, the
lower level HTTP filter simply merges them on top of the existing input headers.
You don't want to pass the full set of input headers again, so simply means the
WSGI adapter for Apache would need to remember what headers it sent in environ
to begin with and only put in trailers what had changed and thus were actually in
the trailer.

Anyway, it looks for the time being that if I am going to support streaming of
chunked data that I state as a limitation that trailers aren't available as WSGI
doesn't support a way of getting them.

BTW, I looked around at the various packages trying to provide a WSGI server
and I can't find one besides CherryPy WSGI server that even attempts to support
chunked encoding on input. Makes it hard to use what other people did as a
guide. :-(

Graham

Mark Nottingham

unread,

Sep 4, 2007, 10:04:15 PM9/4/07

to Graham Dumpleton, web...@python.org

Are you actually seeing chunked request bodies in the wild? If so,
from what UAs?

IME they're not very common, because of lack of support in most
servers, and some interop issues with proxies (IIRC).

Cheers,

> Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%
> 40mnot.net

--
Mark Nottingham http://www.mnot.net/

Graham Dumpleton

unread,

Sep 5, 2007, 7:55:14 AM9/5/07

to Mark Nottingham, web...@python.org

On 05/09/07, Mark Nottingham <mn...@mnot.net> wrote:
> Are you actually seeing chunked request bodies in the wild? If so,
> from what UAs?
>
> IME they're not very common, because of lack of support in most
> servers, and some interop issues with proxies (IIRC).

It has come up as an issue on mod_python list a couple of times. Agree
though that it isn't common. From memory the people were using custom
user agents designed for a special purpose.

Just because it isn't common doesn't mean that an attempt shouldn't be
made to support it, especially if it is part of the HTTP standard.

Also, the same solution for handling this would also be applicable in
cases where mutating input filters are used which change the length of
the request content but are unable to update the content length
header. Thus, like with chunked encoding, a way is needed in this
circumstance to indicate that there is content, but the length isn't
known.

Graham

> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com

Reply all

Reply to author

Forward