Requesting Chunked Tornado Pages w/ urllib2

307 views
Skip to first unread message

Jeffrey Jenkins

unread,
Sep 21, 2011, 5:32:39 PM9/21/11
to Tornado Web Server
Hi,

The testing system I'm using for the project I'm trying to convert to
Tornado is using urllib2, and it's not feasible for me to change it.
The issue I'm having is that when I return chunked responses httplib
is raising a "IncompleteRead" exception.

I've managed to get a minimal working application that reproduces the
error. I tried the client on python 2.7 and 3.2 and got two different
errors. The server is running in 2.7 in both cases (as that's the
version I'll be running in production). It's also the case that if I
pass a header to write out a ton of spaces before I do the callback
everything works OK.

Code for both clients, errors, and the application below. Any help
would be appreciated!

-Jeff

====================

# server.py
import logging
import tornado.ioloop
import tornado.options
import tornado.web

class MainHandler(tornado.web.RequestHandler):
@tornado.web.asynchronous
def get(self):
def request_callback(async=True):
self.flush()
self.finish()

# if header is passed, write out 500k of "A"s
if 'Magic_urllib_fix' in self.request.headers:
self.write('A' * 500000)
self.flush()

# write first batch
self.write('B' * 12083)
self.flush()

tornado.ioloop.IOLoop.instance().add_callback(request_callback)

@tornado.web.asynchronous
def post(self):
return self.get()


application = tornado.web.Application([
(r"/.*", MainHandler),
], debug=True)

if __name__ == "__main__":
import sys
tornado.options.parse_command_line()
logging.info("starting torando web server")
application.listen(sys.argv[1])
tornado.ioloop.IOLoop.instance().start()


==============
# python3.2 client (python 2.7 server)

import urllib.request

def call(with_fix):
op = urllib.request.build_opener()
url = 'http://local.shopwiki.com:8000/'
if with_fix:
op.addheaders = [('MAGIC_URLLIB_FIX', 't')]
op.open(url).read()

call(with_fix=True)
call(with_fix=False)

# Output:
# Traceback (most recent call last):
# File "/usr/local/Cellar/python3/3.2/lib/python3.2/http/client.py",
line 529, in _read_chunked
# chunk_left = int(line, 16)
# ValueError: invalid literal for int() with base 16: ''
#
# During handling of the above exception, another exception occurred:
#
# Traceback (most recent call last):
# File "curl.py", line 10, in <module>
# call(with_fix=True)
# File "curl.py", line 8, in call
# op.open(url).read()
# File "/usr/local/Cellar/python3/3.2/lib/python3.2/http/client.py",
line 489, in read
# return self._read_chunked(amt)
# File "/usr/local/Cellar/python3/3.2/lib/python3.2/http/client.py",
line 534, in _read_chunked
# raise IncompleteRead(b''.join(value))
# http.client.IncompleteRead: IncompleteRead(512083 bytes read)

============================
# python 2.7 client (python 2.7 server)

import urllib2

def call(with_fix):
op = urllib2.build_opener()
url = 'http://local.shopwiki.com:8000/'
if with_fix:
op.addheaders = [('MAGIC_URLLIB_FIX', 't')]
op.open(url).read()

call(with_fix=True)
call(with_fix=False)

# Output:
# Traceback (most recent call last):
# File "curl.py", line 11, in <module>
# call(with_fix=False)
# File "curl.py", line 8, in call
# op.open(url).read()
# File "/usr/local/Cellar/python/2.7.1/lib/python2.7/socket.py",
line 351, in read
# data = self._sock.recv(rbufsize)
# File "/usr/local/Cellar/python/2.7.1/lib/python2.7/httplib.py",
line 533, in read
# return self._read_chunked(amt)
# File "/usr/local/Cellar/python/2.7.1/lib/python2.7/httplib.py",
line 576, in _read_chunked
# raise IncompleteRead(''.join(value))
# httplib.IncompleteRead: IncompleteRead(3891 bytes read)

Jeffrey Jenkins

unread,
Sep 21, 2011, 5:45:25 PM9/21/11
to Tornado Web Server
This seems to be fixed in 2.1, which was released today. If someone
knows what exactly the bug fix was that made it work it would make me
feel a bit more comfortable. The release notes didn't have anything
which jumped out at me.

Ben Darnell

unread,
Sep 22, 2011, 1:17:24 AM9/22/11
to python-...@googlegroups.com
Interesting.  I think the underlying bug may still be there, it's just gotten harder to trigger thanks to the addition of a new fast path for IOStream.write.  There's a race condition involving HTTPConnection._on_write_complete and finish.  In 2.1 I'm not sure if it's still possible to run into this bug, but you're definitely safe if you pass a callback to flush() and don't flush or finish again until that callback has been run.  

I know this is just a reduced example for testing purposes, but just for the record there's no need to call flush in request_callback since there has been no write since the last flush.  If you know when you're writing the last chunk that it will be the last, it's slightly more efficient to just call finish() instead of flush() and then finish() (since this increases the chance that the chunked-encoding footer will be included in the same packet as the last of the data).

-Ben

Frank Smit

unread,
Sep 22, 2011, 2:26:46 AM9/22/11
to python-...@googlegroups.com
I had a IncompleteRead Exception when I was using Requests to download
a PDF file after a post request with chunked encoding. This piece of
code fixed it:

import httplib

def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial

return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

I got it from: http://bobrochel.blogspot.com/2010/11/bad-servers-chunked-encoding-and.html

Ben Darnell

unread,
Sep 22, 2011, 11:28:01 AM9/22/11
to python-...@googlegroups.com
On Wed, Sep 21, 2011 at 11:26 PM, Frank Smit <fr...@61924.nl> wrote:
I had a IncompleteRead Exception when I was using Requests to download
a PDF file after a post request with chunked encoding. This piece of
code fixed it:

This masks the error, but the problem is still there, so I wouldn't recommend it - it's the server that's buggy, not httplib.  If you swallow this exception you'll have no way of knowing whether you got the whole file or not.  

-Ben

Jeffrey Jenkins

unread,
Sep 22, 2011, 1:19:15 PM9/22/11
to Tornado Web Server
I had tried that "fix" after finding it online, but I wasn't getting
all of the data.

I think I put the extra flush in when I was messing around to see what
did/didn't produce the behaviour. I'll double-check my code.

If I'm using the flush callback, what would be the idiomatic way to do
it? I would prefer not to put the rest of my code into the flush
callback since I'd like to execute it while the first flush is
happening. The first thing that comes to mind is putting the finish()
into a callback of the final flush() (since I assume flushes happen in
serial)

On Sep 22, 11:28 am, Ben Darnell <b...@bendarnell.com> wrote:
> On Wed, Sep 21, 2011 at 11:26 PM, Frank Smit <fr...@61924.nl> wrote:
> > I had a IncompleteRead Exception when I was using Requests to download
> > a PDF file after a post request with chunked encoding. This piece of
> > code fixed it:
>
> This masks the error, but the problem is still there, so I wouldn't
> recommend it - it's the server that's buggy, not httplib.  If you swallow
> this exception you'll have no way of knowing whether you got the whole file
> or not.
>
> -Ben
>
>
>
>
>
>
>
>
>
> > import httplib
>
> > def patch_http_response_read(func):
> >    def inner(*args):
> >        try:
> >            return func(*args)
> >        except httplib.IncompleteRead, e:
> >            return e.partial
>
> >    return inner
> > httplib.HTTPResponse.read =
> > patch_http_response_read(httplib.HTTPResponse.read)
>
> > I got it from:
> >http://bobrochel.blogspot.com/2010/11/bad-servers-chunked-encoding-an...
>
> > On Thu, Sep 22, 2011 at 7:17 AM, Ben Darnell <b...@bendarnell.com> wrote:
> > > Interesting.  I think the underlying bug may still be there, it's just
> > > gotten harder to trigger thanks to the addition of a new fast path for
> > > IOStream.write.  There's a race condition involving
> > > HTTPConnection._on_write_complete and finish.  In 2.1 I'm not sure if
> > it's
> > > still possible to run into this bug, but you're definitely safe if you
> > pass
> > > a callback to flush() and don't flush or finish again until that callback
> > > has been run.
> > > I know this is just a reduced example for testing purposes, but just for
> > the
> > > record there's no need to call flush in request_callback since there has
> > > been no write since the last flush.  If you know when you're writing the
> > > last chunk that it will be the last, it's slightly more efficient to just
> > > call finish() instead of flush() and then finish() (since this increases
> > the
> > > chance that the chunked-encoding footer will be included in the same
> > packet
> > > as the last of the data).
> > > -Ben
>
> > > On Wed, Sep 21, 2011 at 2:45 PM, Jeffrey Jenkins <j...@qcircles.net>

Ben Darnell

unread,
Sep 22, 2011, 1:50:55 PM9/22/11
to python-...@googlegroups.com
On Thu, Sep 22, 2011 at 10:19 AM, Jeffrey Jenkins <je...@qcircles.net> wrote:
I had tried that "fix" after finding it online, but I wasn't getting
all of the data.

I think I put the extra flush in when I was messing around to see what
did/didn't produce the behaviour.  I'll double-check my code.

If I'm using the flush callback, what would be the idiomatic way to do
it?  I would prefer not to put the rest of my code into the flush
callback since I'd like to execute it while the first flush is
happening.  The first thing that comes to mind is putting the finish()
into a callback of the final flush() (since I assume flushes happen in
serial)

Flush callbacks were originally added for flow control, so the idiomatic way is to put the rest of the code into the flush callback.  The flush callback will normally be called almost immediately (when the data is passed from IOStream to the kernel), and will only be delayed when the socket buffers are full, in which case you are likely to be better off waiting than pumping more data into the IOStream buffer.  If you don't want this flow control, it would work to just use a flush callback on the final flush and then call finish (but that would be very non-idiomatic; in the absence of this bug you'd just call finish immediately after writing the last of the data).  I've just checked in a fix; it turns out to be a one-line change: https://github.com/facebook/tornado/commit/4b346bdde80c1e677ca0e235e04654f8d64b365c

-Ben

Jeffrey Jenkins

unread,
Sep 23, 2011, 7:40:44 AM9/23/11
to Tornado Web Server
Great, thanks! I'll try going with 2.1, and if I run into the issue
again I'll just apply that patch! Thank you for being so responsive!

On Sep 22, 1:50 pm, Ben Darnell <b...@bendarnell.com> wrote:
> On Thu, Sep 22, 2011 at 10:19 AM, Jeffrey Jenkins <j...@qcircles.net> wrote:
> > I had tried that "fix" after finding it online, but I wasn't getting
> > all of the data.
>
> > I think I put the extra flush in when I was messing around to see what
> > did/didn't produce the behaviour.  I'll double-check my code.
>
> > If I'm using the flush callback, what would be the idiomatic way to do
> > it?  I would prefer not to put the rest of my code into the flush
> > callback since I'd like to execute it while the first flush is
> > happening.  The first thing that comes to mind is putting the finish()
> > into a callback of the final flush() (since I assume flushes happen in
> > serial)
>
> Flush callbacks were originally added for flow control, so the idiomatic way
> is to put the rest of the code into the flush callback.  The flush callback
> will normally be called almost immediately (when the data is passed from
> IOStream to the kernel), and will only be delayed when the socket buffers
> are full, in which case you are likely to be better off waiting than pumping
> more data into the IOStream buffer.  If you don't want this flow control, it
> would work to just use a flush callback on the final flush and then call
> finish (but that would be very non-idiomatic; in the absence of this bug
> you'd just call finish immediately after writing the last of the data).
>  I've just checked in a fix; it turns out to be a one-line change:https://github.com/facebook/tornado/commit/4b346bdde80c1e677ca0e235e0...
Reply all
Reply to author
Forward
0 new messages