IncompleteRead Error

James WH

unread,

Mar 7, 2012, 12:49:05 PM3/7/12

to gevent: coroutine-based Python network library

Dear All,

I recently stumbled upon gevent (btw: amazing tool!), and Iam trying
to modify the examples (concurrent_download.py) to suit my crawler
application. I have had some success in doing this, and was able to
fetch around 200 pages without any problems(at times).

However, at times, I get the following error:

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py",
line 390, in run
result = self._run(*self.args, **self.kwargs)
File "multithread_ge.py", line 45, in print_head
content = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 601, in _read_chunked
value.append(self._safe_read(chunk_left))
File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(3206 bytes read, 4978 more expected)
<Greenlet at 0xd5f870: print_head('http://www.crawledwebsite.com/
259')> failed with IncompleteRead

Has anyone encountered this before? I tried googling for a solution
under "gevent IncompleteRead" but couldnt find any appropriate
solution.

I would appreciate any guidance (or simply exception handling?) on
this.

Many thanks for your time.

Best regards,

James WH

Ralf Schmitt

unread,

Mar 7, 2012, 2:06:20 PM3/7/12

to gev...@googlegroups.com

James WH <s0315...@gmail.com> writes:

> Dear All,
>
> I recently stumbled upon gevent (btw: amazing tool!), and Iam trying
> to modify the examples (concurrent_download.py) to suit my crawler
> application. I have had some success in doing this, and was able to
> fetch around 200 pages without any problems(at times).
>
> However, at times, I get the following error:
>
> Traceback (most recent call last):
> File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py",
> line 390, in run
> result = self._run(*self.args, **self.kwargs)
> File "multithread_ge.py", line 45, in print_head
> content = urllib2.urlopen(url).read()
> File "/usr/lib/python2.7/socket.py", line 351, in read
> data = self._sock.recv(rbufsize)
> File "/usr/lib/python2.7/httplib.py", line 541, in read
> return self._read_chunked(amt)
> File "/usr/lib/python2.7/httplib.py", line 601, in _read_chunked
> value.append(self._safe_read(chunk_left))
> File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
> raise IncompleteRead(''.join(s), amt)
> IncompleteRead: IncompleteRead(3206 bytes read, 4978 more expected)
> <Greenlet at 0xd5f870: print_head('http://www.crawledwebsite.com/
> 259')> failed with IncompleteRead

This looks like the remote end is closing the connection. It doesn't
look like a gevent issue.

--
Cheers
Ralf

James WH

unread,

Mar 7, 2012, 2:19:38 PM3/7/12

to gev...@googlegroups.com

Hi Ralph,

Thank you for your response.

What I have done is now installed 1.0b1 version (using cython==0.15) and the situation seems to be better.

That is, although I have a few incomplete reads, gevent seems to carry on with the fetching pages (which it didnt in version 0.13.6).

I am not sure if it is a remote end closing issue, since I am crawling my own pages -so that would be strange.

Thanks for your time again.

-James WH

Harry Waye

unread,

Mar 7, 2012, 2:26:21 PM3/7/12

to gev...@googlegroups.com

You're most likely hitting this issue: http://bobrochel.blogspot.com/2010/11/bad-servers-chunked-encoding-and.html

James WH

unread,

Mar 7, 2012, 6:22:51 PM3/7/12

to gev...@googlegroups.com

Hi Harry,

Thanks for your reply.

I have looked it at - but, that is in reference to httplib,

Do you think adaptation to urllib2 is straightforward? Should I be aware of anything?

Thanks for your time again.

Best,

James WH.

Antonin AMAND

unread,

Mar 8, 2012, 9:06:27 AM3/8/12

to gev...@googlegroups.com

James,

I believe that using urllib is a bad choice for crawling for the
simple reason it doesn't supports HTTP1.1 persistent connections.
It means your program is going to create a new tcp connection for
every request, which is damn slow.

Other choices are:

- httplib2
- python httplib (a bit two low level in my opinion)
- geventhttpclient: https://github.com/gwik/geventhttpclient, I wrote it ;)

urllib and httplib2 are all based on python httplib. If you want to
use one of them, you might want to try using
geventhttpclient monkey patching of httplib.
The monkey patching replaces httplib's parser which I found a bit
buggy. If you are using gevent 0.13 you're which patch_all you are
already using a different parser from libevent.

My (biased) choice would be to use gevent 1.0b1 and geventhttpclient.

HTH

Antonin

Matthias Urlichs

unread,

Mar 8, 2012, 11:27:40 AM3/8/12

to gev...@googlegroups.com

Hi,

Antonin AMAND:

> The monkey patching replaces httplib's parser which I found a bit
> buggy.

Hmm. Did you submit that to the Python stdlib people?

--
-- Matthias Urlichs

James WH

unread,

Mar 11, 2012, 8:40:01 AM3/11/12

to gev...@googlegroups.com

Hi Antonin,

Many thanks for your reply & sorry for the delay in replying from my end.

At the moment I use urllib2, something like:

=====================================
import json as simplejson

urls = ['http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2

def print_head(url):
    print ('Starting %s' % url)
    data = urllib2.urlopen(url).read()
    print ('%s: %s bytes: %r' % (url, len(data), data[:150]))
    print "done with", url
    return data

jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.joinall(jobs)
h=[]
# collect results
h = [job.value for job in jobs]
print ( '%r this is your length' % (len(h)))
n = 0
for data in h:
    with open('k'+str(n)+'.txt', 'w') as myfile:
        myfile.write(str(data))
=====================================

with monkey patched gevent 1.0b1.

In this case, how do I use your libraries? I looked at your examples, and by the looks of it, I need to do:

=====================================
import json as simplejson
urls = ['http://www.python.org']

import gevent
from gevent import monkey
monkey.patch_all()

import geventhttpclient.httplib
geventhttpclient.httplib.patch()
import httplib2
import urllib2

def print_head(url):
    print ('Starting %s' % url)
    data = urllib2.urlopen(url).read()
    print ('%s: %s bytes: %r' % (url, len(data), data[:150]))
    print "done with", url
    return data

jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.joinall(jobs)
h=[]

# collect results

h = [job.value for job in jobs]
print ( '%r this is your length' % (len(h)))
n = 0
for data in h:
    with open('k'+str(n)+'.txt', 'w') as myfile:
        myfile.write(str(data))    
=====================================

Is this correct?



Many thanks for your time.

regards,

James WH.

Reply all

Reply to author

Forward