> Dear All,
>
> I recently stumbled upon gevent (btw: amazing tool!), and Iam trying
> to modify the examples (concurrent_download.py) to suit my crawler
> application. I have had some success in doing this, and was able to
> fetch around 200 pages without any problems(at times).
>
> However, at times, I get the following error:
>
> Traceback (most recent call last):
> File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py",
> line 390, in run
> result = self._run(*self.args, **self.kwargs)
> File "multithread_ge.py", line 45, in print_head
> content = urllib2.urlopen(url).read()
> File "/usr/lib/python2.7/socket.py", line 351, in read
> data = self._sock.recv(rbufsize)
> File "/usr/lib/python2.7/httplib.py", line 541, in read
> return self._read_chunked(amt)
> File "/usr/lib/python2.7/httplib.py", line 601, in _read_chunked
> value.append(self._safe_read(chunk_left))
> File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
> raise IncompleteRead(''.join(s), amt)
> IncompleteRead: IncompleteRead(3206 bytes read, 4978 more expected)
> <Greenlet at 0xd5f870: print_head('http://www.crawledwebsite.com/
> 259')> failed with IncompleteRead
This looks like the remote end is closing the connection. It doesn't
look like a gevent issue.
--
Cheers
Ralf
I believe that using urllib is a bad choice for crawling for the
simple reason it doesn't supports HTTP1.1 persistent connections.
It means your program is going to create a new tcp connection for
every request, which is damn slow.
Other choices are:
- httplib2
- python httplib (a bit two low level in my opinion)
- geventhttpclient: https://github.com/gwik/geventhttpclient, I wrote it ;)
urllib and httplib2 are all based on python httplib. If you want to
use one of them, you might want to try using
geventhttpclient monkey patching of httplib.
The monkey patching replaces httplib's parser which I found a bit
buggy. If you are using gevent 0.13 you're which patch_all you are
already using a different parser from libevent.
My (biased) choice would be to use gevent 1.0b1 and geventhttpclient.
HTH
Antonin
Antonin AMAND:
> The monkey patching replaces httplib's parser which I found a bit
> buggy.
Hmm. Did you submit that to the Python stdlib people?
--
-- Matthias Urlichs
===================================== import json as simplejson urls = ['http://www.python.org']
import gevent from gevent import monkey monkey.patch_all()import urllib2
import geventhttpclient.httplib geventhttpclient.httplib.patch() import httplib2
def print_head(url): print ('Starting %s' % url) data = urllib2.urlopen(url).read() print ('%s: %s bytes: %r' % (url, len(data), data[:150])) print "done with", url return data
jobs = [gevent.spawn(print_head, url) for url in urls] gevent.joinall(jobs) h=[]
# collect results
h = [job.value for job in jobs] print ( '%r this is your length' % (len(h))) n = 0 for data in h: with open('k'+str(n)+'.txt', 'w') as myfile: myfile.write(str(data)) =====================================Is this correct?
Many thanks for your time.
regards,
James WH.