Re: [gevent] Basic question about using gevent to crawl web pages

204 views
Skip to first unread message

Shaun Lindsay

unread,
Jan 18, 2013, 2:32:05 PM1/18/13
to gev...@googlegroups.com
You can spawn greenlets from within greenlets.

If you express your crawler recursively (visit/parse a url, then visit/parse all the urls within that, etc etc), you can replace your recursions with greenlet spawns.  You won't need to join until the very end.

Don't get hung up on trying to parallelize within each step -- try to parallelize across the steps.

On Fri, Jan 18, 2013 at 8:16 AM, Cen Wang <iwar...@gmail.com> wrote:
I just know about gevent and want to ask a very basic question here. I want to use gevent to make a web spider. The whole crawling process is divided into three parts.

def visit(url) # a func to visit a url, returns the html of the page.

def parse(html) # using beautifulsoup to parse html from the visit(url) and return urls of desired items.

def download(url) # uses the results returned from parse to download something.

My question is that I find that the official documents tell very well to beginners about how to use gevent to accomplish single step tasks, which is to use spawn and then joinall. But in my problem, which is a three-step question, using this pattern(spawn and joinall) will make it block. For example, I have 10 urls to visit at the beginning. Using joinall will lead to blocking until all pages are visited. Also, it is not necessary to wait until all pages are parsed. It is apparently too foolish. We can visit pages and as soon as we get the html, we start parsing then download.

So is there any answers? Thanks in advance.

Harry Waye

unread,
Jan 18, 2013, 4:05:07 PM1/18/13
to gev...@googlegroups.com
Is that not a slightly unwieldy way for going about this?  You'll still want to deal with host concurrency, for which you could use a thread pool, but then your at risk of deadlocking when you reach pool capacity with this recursive style.

I think it makes more sense here to maintain a queue of urls and a pool for each host you come across.  Maybe have a look at http://sdiehl.github.com/gevent-tutorial/ for some example usage.
--
Harry Waye, Co-founder/CTO

Follow us on Twitter: @arachnys

---
Arachnys Information Services Limited is a company registered in England & Wales. Company number: 7269723. Registered office: 40 Clarendon St, Cambridge, CB1 1JX.

Matt Billenstein

unread,
Jan 18, 2013, 5:41:04 PM1/18/13
to gev...@googlegroups.com
On Fri, Jan 18, 2013 at 09:05:07PM +0000, Harry Waye wrote:
> Is that not a slightly unwieldy way for going about this? You'll still
> want to deal with host concurrency, for which you could use a thread pool,
> but then your at risk of deadlocking when you reach pool capacity with
> this recursive style.
> I think it makes more sense here to maintain a queue of urls and a pool
> for each host you come across. Maybe have a look
> at http://sdiehl.github.com/gevent-tutorial/ for some example usage.

+1 on queue, but for some reasonable level of concurrency (20-50) you'd
probably be alright even if all workers were hitting the same host.

So, the way I'd implement this is to have a global queue, spawn some workers
that in a loop read a url from the queue, download and parse the content from
that url, and then push back the new urls to the queue. Termination condition
would be not getting a url from the queue for 10 seconds or something.

m

--
Matt Billenstein
ma...@vazor.com
http://www.vazor.com/

Matthias Urlichs

unread,
Jan 19, 2013, 3:50:12 AM1/19/13
to gev...@googlegroups.com
Hi,

Matt Billenstein:
> Termination condition would be not getting a url from the queue for 10
> seconds or something.
>
Termination condition would be no active workers and an empty queue.

You can easily notice that condition with a global state variable which
each task increments after taking a value from the queue, and decrements
again just before taking the next value.

Trivially:

class Done(): pass
count=0

def worker():
while True:
global count
url = queue.get()
if url is Done:
queue.put(Done)
break
count += 1

try:
process(url)
except Exception as err:
report_error(url,err)
finally:
count -= 1
if count == 0 and queue.empty():
queue.put(Done)
break

--
-- Matthias Urlichs

Matt Billenstein

unread,
Jan 19, 2013, 4:58:59 AM1/19/13
to gev...@googlegroups.com
Ah, yes, that is better.

m

--

Reply all
Reply to author
Forward
0 new messages