HttpRequest.read() issues

Showing 1-12 of 12 messages
HttpRequest.read() issues Tom Christie 4/3/11 7:03 AM
It's not very obvious from the docs or source if HttpRequest.read() can always be safely treated as a limited input stream, or if the developer needs to respect HttpRequest.META['CONTENT_LENGTH'].

As far as I can tell the intention is that it can always be treated as a limited stream, that seems to at least be the implication in WSGIRequest.__init__, in which case it looks to me like there are two bugs.

1. This code should not raise an Assertion Error...

>>> from django.test.client import RequestFactory
>>> req=RequestFactory().post('/', {'foo':'bar'})
>>> req.read(9999)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/tomchristie/workspace/django-rest-framework/env/lib/python2.6/site-packages/django/http/__init__.py", line 296, in read
    return self._stream.read(*args, **kwargs)
  File "/Users/tomchristie/workspace/django-rest-framework/env/lib/python2.6/site-packages/django/test/client.py", line 51, in read
    assert self.__len >= num_bytes, "Cannot read more than the available bytes from the HTTP incoming data."
AssertionError: Cannot read more than the available bytes from the HTTP incoming data.

After all, running under the dev server I can do this just fine without causing an exception:

def test(request):
    return HttpResponse("Read data: '%s'\n" % request.read(9999))

(In the first case the underlying stream isn't being wrapped in a LimitedStream, in the second case it is)

2. Isn't the use of LimitBytes in MultipartParser.parse() now redundant?

If it isn't the intention that HttpRequest.read() can be treated as a limited stream then shouldn't this be documented, and in any case wouldn't it be better if it was always a limited stream - there's some duplicated behavior with parsing the CONTENT_LENGTH in WSGIRequest, HttpRequest and MultipartParser that looks like ti could be avoided.

Am I just fundamentally misunderstanding something?

Cheers,

  Tom
Re: HttpRequest.read() issues Tom Christie 4/5/11 4:33 AM
I've created two tickets for this, with patches and tests...

http://code.djangoproject.com/ticket/15762 - WSGIRequest should wrap the test client wsgi.input in LimitedStream
http://code.djangoproject.com/ticket/15763 - MultiPartParser's LimitBytes is now redundant.

It's possible that I've misunderstood and you can't assume that it'll be okay to do request.read(BUFFER_SIZE) without respecting CONTENT_LENGTH,
although if that's the case then presumably that's an issue itself?

(Eg. because it means that you can't hand the request over to some arbitrary parser that just treats it like any other file-like object)

Cheers,

  t.
Re: HttpRequest.read() issues Ivan Sagalaev 4/6/11 8:00 PM
On 04/03/2011 07:03 AM, Tom Christie wrote:
> It's not very obvious from the docs or source if HttpRequest.read() can
> always be safely treated as a limited input stream, or if the developer
> needs to respect HttpRequest.META['CONTENT_LENGTH'].
>
> As far as I can tell the intention is that it can always be treated as a
> limited stream, that seems to at least be the implication in
> WSGIRequest.__init__, in which case it looks to me like there are two bugs.

Hello Tom,

I'll do my best to remember my intentions when this code was first
written. But anyway it might need some fresh thinking. Thanks!

I've decided not to wrap streams in LimitedStream indefinitely because
it would affect performance. I confess guilty for never actually
measuring the hit but reading from connection seemed to me a hot enough
place to operate as close to the socket as possible, especially because
all the thing was made for performance-sensitive cases with potentially
big request payloads.

In most practical situations request.read() was working just fine
without the wrapper: mod_wsgi, mod_python, flup. The only exception at
that time was the Django's runserver that would just hang when trying to
read past content_length bytes. Actually there's a big comment on this
matter right before LimitedStream wrapping in WSGIRequest.__init__. Also
I thought that one could use LimitedStream explicitly in the user code
if needed.

> 1. This code should not raise an Assertion Error...
>
>  >>> from django.test.client import RequestFactory
>  >>> req=RequestFactory().post('/', {'foo':'bar'})
>  >>> req.read(9999)
> Traceback (most recent call last):
> File "<console>", line 1, in <module>
> File
> "/Users/tomchristie/workspace/django-rest-framework/env/lib/python2.6/site-packages/django/http/__init__.py",
> line 296, in read
> return self._stream.read(*args, **kwargs)
> File
> "/Users/tomchristie/workspace/django-rest-framework/env/lib/python2.6/site-packages/django/test/client.py",
> line 51, in read
> assert self.__len >= num_bytes, "Cannot read more than the available
> bytes from the HTTP incoming data."
> AssertionError: Cannot read more than the available bytes from the HTTP
> incoming data.

Judging by svn logs this mock object (FakePayload) was written back at
the days of uploads refactoring. Judging by its comment it is a safety
measure to fail early and loud to protect client code from even
attempting to read input stream past content_length.

So from all these here are my thoughts:

- I still believe that not using the limitator for everything is a good
thing (doesn't break real code, doesn't degrade performance)

- It may be a good idea for a FakePayload to imitate real-world data
providers and just return empty buffers when there's no data left.

Thoughts?

> 2. Isn't the use of LimitBytes in MultipartParser.parse() now redundant?

I didn't even know another such thing exists though I remember searching
for it before writing LimitedStream. Now they just seem to be duplicates
and it's a good idea to ditch LimitBytes (since LimitedStream implements
a readline() too) and move it into some common place. http.utils seems a
good fit for it.

Re: HttpRequest.read() issues Graham Dumpleton 4/6/11 9:31 PM


On Thursday, April 7, 2011 1:00:32 PM UTC+10, Ivan Sagalaev wrote:
On 04/03/2011 07:03 AM, Tom Christie wrote:
> It's not very obvious from the docs or source if HttpRequest.read() can
> always be safely treated as a limited input stream, or if the developer
> needs to respect HttpRequest.META['CONTENT_LENGTH'].
>
> As far as I can tell the intention is that it can always be treated as a
> limited stream, that seems to at least be the implication in
> WSGIRequest.__init__, in which case it looks to me like there are two bugs.

Hello Tom,

I'll do my best to remember my intentions when this code was first
written. But anyway it might need some fresh thinking. Thanks!

I've decided not to wrap streams in LimitedStream indefinitely because
it would affect performance. I confess guilty for never actually
measuring the hit but reading from connection seemed to me a hot enough
place to operate as close to the socket as possible, especially because
all the thing was made for performance-sensitive cases with potentially
big request payloads.

In most practical situations request.read() was working just fine
without the wrapper: mod_wsgi, mod_python, flup. The only exception at
that time was the Django's runserver that would just hang when trying to
read past content_length bytes. Actually there's a big comment on this
matter right before LimitedStream wrapping in WSGIRequest.__init__. Also
I thought that one could use LimitedStream explicitly in the user code if needed.


By reading more than CONTENT_LENGTH you are actually violating WSGI 1.0 (PEP 333) specification.

In WSGI 1.0.1 (PEP 3333), an additional restriction is placed on WSGI servers/gateways that they themselves must return a proper end sentinel when input is exhausted. This though doesn't remove the requirement in the specification that WSGI applications should NOT read more than CONTENT_LENGTH.

So, OP should not be trying to read more than CONTENT_LENGTH. Now though, user code isn't operating directly on WSGI however, but Django. As such, Django should provide a limited stream to prevent user reading more than CONTENT_LENGTH either by returning an empty end sentinel, or perhaps if wanted to be pedantic, raise an error.

Graham
Re: HttpRequest.read() issues Tom Christie 4/7/11 1:56 AM
> So, OP should not be trying to read more than CONTENT_LENGTH.

From the underlying stream, sure.  The question is if it's okay to do on the HttpRequest object.  It's an issue because now that HttpRequest exposes a file-like interface to the stream some users are likely to do things like this:

request_data = json.load(request)

And expect that everything should work fine.

> Django should provide a limited stream to prevent user reading more
> than CONTENT_LENGTH either by returning an empty end sentinel,
> or perhaps if wanted to be pedantic, raise an error.

And it does currently provide a limited stream, in some cases, but for (very reasonable) performance reasons, only when it appears neccessary.

> Now they just seem to be duplicates 
> and it's a good idea to ditch LimitBytes (since LimitedStream implements 
> a readline() too) and move it into some common place. http.utils seems a 
> good fit for it.

Moving LimitedStream to somewhere like http.utils sounds like a good plan to me.

> Thoughts?

I think HttpRequest definatly needs to ensure that it's safe to treat it as any other file-like object, now that it exposes the .read()/.readlines() methods, and at the moment it sounds like the behaviour of .read(BUFFER_SIZE) past the end of the input is not well defined.

1. It might be reasonable to defer the creation of HttpRequest._stream, but to _always_ wrap it in LimitedStream if it is created.  (IE HttpRequest._stream is a property that's only initialised when it's accessed)  That'd presumably be a performance gain for any code path that _doesn't_ access .POST/.raw_post_data/.read, and a performance hit for anything that _does_.  In 99% of cases you'll be doing a .read() operation without specifying any length at all so the performance hit will be the initial creation of the LimitedStream object, but you won't actually have subsequent function calls that incur the extra layer of wrapping.

2. Ensure that the .read()/.readline() interfaces will always expose a LimitedStream, but avoid creating a limited stream for .POST and .raw_post_data by reading directly from the underlying stream and enforcing the CONTENT_LENGTH behavior explicitly in those cases.

3. Alter the LimitedStream wrapping behaviour to be more cautious.  The current behaviour is to wrap it in LimitedStream if it's known to be necessary.    It could instead wrap it in LimitedStream _unless_ it's known to be _uneccessary_.

Do any of those sound like reasonable options?

I guess the other possibility is:

4. Everything is fine as it is, and in real world cases WSGI servers are passing a limited stream on already.

It basically sounds like that's the case, but it's not obvious how much we can rely on that.

It might be a case that I need to file a seperate bug for this:

HttpRequest.read(BUFFER_SIZE) behaviour is unclear.

and place a note on the other two that their resolution is dependant on this, does that make sense to y'all?

Really appreciate your input on this,

  Tom
unk...@googlegroups.com 4/7/11 2:05 AM <This message has been deleted.>
Re: HttpRequest.read() issues Tom Christie 4/7/11 2:37 AM
It occurs to me that (1) isn't a hit for accessing multipart POST requests, since we're wrapping the underlying stream in LimitedStream, but beng able to drop MultiPartParser's use of LimitBytes.
Re: HttpRequest.read() issues Graham Dumpleton 4/7/11 3:07 AM
Silly question. Where is the proof that using a limited stream is a performance issue? These sorts of things are never going to be the bottleneck and sounds a bit like premature optimisation to think that wrapping it with a length limiting stream is going to be an issue.

Graham
Re: HttpRequest.read() issues Tom Christie 4/7/11 3:31 AM
> Where is the proof that using a limited stream is a performance issue? These sorts of things are never going to be the bottleneck and sounds a bit like premature optimisation to think that wrapping it with a length limiting stream is going to be an issue.

There isn't any, so good point. :)

Even so, presumably we wouldn't want WSGIRequest to create a LimitedStream object for _every_ single incoming HTTP request, regardless of if it actually has a payload or not?

In that case deferring the LimitedStream creation, but always ensuring that HttpRequest._stream is a LimitedStream rather than the underlying stream would be a sensible thing to do.  It's fairly simple logic, and it's guaranteed to be a safe thing to do.

I'd be happy to code up a patch for this, if we reckon that's a sensible approach.

Re: HttpRequest.read() issues Tom Christie 4/7/11 10:21 AM
Okay, since there's clearly an underlying issue here I've created a seperate ticket for this and marked the other two tickets as duplicates of it.

So...

#15785 - HttpRequest.read(NUM_BYTES) can read beyond the end of wsgi.input stream.  (Violation of WSGI spec & under-defined behaviour) [1]

There is a patch (with tests) attached. [2]

* Changes WSGIRequest._stream to be a property that is (always) instantiated as a LimitedStream when first accessed.
* Removes some redundant code in HttpRequest and MultiPartParser.
* Fixes some minor bugs in tests/regressiontests/requests/tests.py
* Adds two tests for MultiPartParser to check graceful behaviour on truncated or empty multipart requests.
* Adds a test for TestClient request.read(LARGE_BUFFER) behaviour.

This fixes (#15762) without having to alter any test client code, and includes the fixes for (#15763)

This looks like a decent way to do things to me - it's less code than before, it's safer, and it's easy to understand.

I'd appreciate any eyeballs/kicking the tires etc...

Whaddya reckon?

  t.


NB. There's also some behaviour in MultiPartParser that could be adapted slightly to support streaming requests with unknown length.
Obv that's mostly entirely pointless right now since WSGI won't support them anyway (yet?...)
<groks around, searches PEP3333 and PEP444 etc etc.>
Right, as far as I understand it PEP3333 doesn't address chunked requests, only chunked responses, so I guess this point is moot?
Re: HttpRequest.read() issues Ivan Sagalaev 4/7/11 12:35 PM
Graham Dumpleton:

> Silly question. Where is the proof that using a limited stream is a
> performance issue?

Last night I actually did test it :-). You're right the difference in
performance is less than a statistical deviation between different
uploads over network.

Tom Christie:


> Even so, presumably we wouldn't want WSGIRequest to create a
> LimitedStream object for _every_ single incoming HTTP request,
> regardless of if it actually has a payload or not?
>
> In that case deferring the LimitedStream creation, but always
> ensuring that HttpRequest._stream is a LimitedStream rather than the
> underlying stream would be a sensible thing to do.

And this one looks like another premature optimization :-). Creating a
single wrapper object is a negligible hit but the code bloat from making
it lazy is very real. So I'm -1 on laziness.

> #15785 - HttpRequest.read(NUM_BYTES) can read beyond the end of
> wsgi.input stream.
>
> There is a patch (with tests) attached. [2]

Apart from the laziness issue your patch looks fine to me. Thanks!

Re: HttpRequest.read() issues Tom Christie 4/7/11 1:32 PM
> Last night I actually did test it :-). You're right the difference in
> performance is less than a statistical deviation between different
> uploads over network.


Nice work.  And point well made Graham!

> Creating a single wrapper object is a negligible hit but the code bloat from making
> it lazy is very real. So I'm -1 on laziness.
Apart from the laziness issue your patch looks fine to me. Thanks!

Great stuff.
New patch added to the ticket, just to keep everything easy.


Coolio, thanks for input peeps...

  t.