Newbie question: Why does read() method of urllib hang?

Andrew Ward

unread,

Feb 11, 2002, 12:58:18 PM2/11/02

to

I can't help feeling I'm doing something stupid. Most of the time:

mystring=u.read()

works fine, but sometimes it just hangs. Is this simply because the remote
server is not responding? I would have thought that would cause
urllib.urlopen() to hang, not u.read().

Or have I made some fundamental mistake in error handling or failing to
clean up, or do I need threads, or...!

Any help greatly appreciated.

Andrew Ward

Alan Runyan

unread,

Feb 11, 2002, 5:59:12 PM2/11/02

to

> mystring=u.read()
>
> works fine, but sometimes it just hangs. Is this simply because the remote
> server is not responding? I would have thought that would cause
> urllib.urlopen() to hang, not u.read().

Andrew, what version of python are you running? A friend of mine who I am
trying to convert to Python ran into this exact problem.
He was trying to do a HTTP POST to a web page, which was assigning him
cookies and redirecting him (the Real World ;). urllib
doesnt handle this very well at all ;'(. he reported to me urlopen() was
hanging so I gave it a go. I'm using Python 2.1.2 and I could not
reproduce this.

So.. what I attempted was to re-write what he assumed urlopen() would do for
him. and now I am stuck. I'm not quite sure how
cookies and redirect work together. I know urllib2 kinda gives you some
more options, but this is *very* unintuitive I believe. we really need
examples. here is my code if someone could take a look at it and see what I
am trying to do I would greatly appreciate it.

-- snip! --
import urllib2, urllib, urlparse
from urllib2 import Request
import httplib

DEBUG = 1

class CookieHTTPRedirectHandler(urllib2.HTTPRedirectHandler,
urllib2.HTTPHandler):
def http_error_302(self, req, fp, code, msg, headers):
if DEBUG:
print 'was going to ' + req._Request__original +
str(req.headers)

import pdb; pdb.set_trace()

if headers.has_key('location'):
newurl = headers['location']
elif headers.has_key('uri'):
newurl = headers['uri']
else:
print 'returning'
return
newurl = urlparse.urljoin(req.get_full_url(), newurl)

# XXX Probably want to forget about the state of the current
# request, although that might interact poorly with other
# handlers that also use handler-specific request attributes

response_headers={}
for head in headers.headers:
cookie='Set-Cookie:'
if head[:len(cookie)]==cookie:
response_headers[cookie]=head[len(cookie)+1:]
print 'redirect headers ' + str(response_headers)
new = Request(newurl, req.get_data(), response_headers)

if DEBUG:
print 'redirected to ' + new._Request__original

new.error_302_dict = {}
if hasattr(req, 'error_302_dict'):
if len(req.error_302_dict)>10 or \
req.error_302_dict.has_key(newurl):
raise HTTPError(req.get_full_url(), code,
self.inf_msg + msg, headers, fp)
new.error_302_dict.update(req.error_302_dict)
new.error_302_dict[newurl] = newurl

# Don't close the fp until we are sure that we won't use it
# with HTTPError.
fp.read()
fp.close()
print 'returning : ' + str(new.headers)
return self.parent.open(new)

def http_open(self, req):
return self.do_open(httplib.HTTP, req)

class HTTPConnection:
def __init__(self, url, request_data, headers):
self._request=urllib2.Request(url, urllib.urlencode(request_data),
{})
self._director=urllib2.OpenerDirector()
self._director.add_handler(CookieHTTPRedirectHandler())
self._conn=self._director.open(self._request)

if __name__=='__main__':
url='http://www.winemag.com/buyingGuide/login.asp'
req_data={'LoginID':'wine',
'LoginPassword':'enthusiast',
'Submit':'Login' }
winemag=HTTPConnection(url, req_data, {})

Andrew Ward

unread,

Feb 12, 2002, 2:33:05 AM2/12/02

to

> Andrew, what version of python are you running?

I am using 2.2 under Windows XP, but in my case, the read() is of a plain
old single page of html text, no cookies, no nothing complicated!

Yes, urlopen() is what I would expect to hang, rather than the read(), so
I'm very puzzled.

Andrew Ward

dwfreeze

unread,

Feb 12, 2002, 8:50:54 AM2/12/02

to

"Andrew Ward" <wine...@spamcop.net> wrote in message news:<Ry3a8.40197$bP3.337803@NewsReader>...

Standard "Mee too" post follows:

>>> sys.version
'2.1.2 (#31, Jan 15 2002, 17:28:11) [MSC 32 bit (Intel)]'
>>> import urllib
>>> f = urllib.urlopen('http://slashdot.org').read()

After I execute the urlopen the IDLE shell is definately "hung".

Matthias Huening

unread,

Feb 12, 2002, 9:15:33 AM2/12/02

to

dwfr...@yahoo.com (dwfreeze) wrote in
news:8f070afd.02021...@posting.google.com:

>
> Standard "Mee too" post follows:
>
>>>> sys.version
> '2.1.2 (#31, Jan 15 2002, 17:28:11) [MSC 32 bit (Intel)]'
>>>> import urllib
>>>> f = urllib.urlopen('http://slashdot.org').read()
>
> After I execute the urlopen the IDLE shell is definately "hung".

Hhmm, I can't reproduce this. Everything works just fine.

>>> import sys
>>> import urllib
>>> sys.version
'2.1.1 (#20, Jul 26 2001, 11:38:51) [MSC 32 bit (Intel)]'

>>> f = urllib.urlopen('http://slashdot.org').read()

>>> print f
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML><HEAD><TITLE>Slashdot: News for nerds, stuff that matters</TITLE>
...
etc.etc.

Andrew Ward

unread,

Feb 12, 2002, 4:30:51 PM2/12/02

to

It's very intermittent for me. I can leave it running for hours in the
morning and it's fine, in the evenings (UK time) it usually does fail fairly
soon.

I'm rewriting the url stuff as individual httplib calls to see if I can pin
down the exact point of failure, although I assume it will still hang, and
that it will also be at the read() (I imagine that the urllib read() maps
directly to the http read() method).

Andrew Ward

Fredrik Lundh

unread,

Feb 17, 2002, 4:01:01 AM2/17/02

to

Andrew Ward wrote:
> It's very intermittent for me. I can leave it running for hours in the
> morning and it's fine, in the evenings (UK time) it usually does fail fairly
> soon.
>
> I'm rewriting the url stuff as individual httplib calls to see if I can pin
> down the exact point of failure, although I assume it will still hang, and
> that it will also be at the read() (I imagine that the urllib read() maps
> directly to the http read() method).

did you get anywhere on this one?

I just realized that a program that has been running
successfully for several years (as a cronjob) recently
started hanging about 99% of the times it's run, after
we upgraded the linux version on that box.

it also hangs when you run it from the command line,
under both 1.5.2 and 2.0.1 -- and when I press control-c,
I find that it's stuck on this line:

oldcontent = file.read()

where "file" is an urllib stream.

and to confuse things even more, if I fire up an inter-
preter and types in the correspond urllib calls, it never
ever hangs.

bug in linux?

</F>

Andrew Ward

unread,

Feb 18, 2002, 2:37:53 AM2/18/02

to

> did you get anywhere on this one?

Sort of. I replaced the urllib stuff with the individual httplib statements
and so far, I haven't seen it hang.

Too soon to be 100% sure, really, but looks promising.

Andrew Ward