I just know about gevent and want to ask a very basic question here. I want to use gevent to make a web spider. The whole crawling process is divided into three parts.def visit(url) # a func to visit a url, returns the html of the page.def parse(html) # using beautifulsoup to parse html from the visit(url) and return urls of desired items.def download(url) # uses the results returned from parse to download something.My question is that I find that the official documents tell very well to beginners about how to use gevent to accomplish single step tasks, which is to use spawn and then joinall. But in my problem, which is a three-step question, using this pattern(spawn and joinall) will make it block. For example, I have 10 urls to visit at the beginning. Using joinall will lead to blocking until all pages are visited. Also, it is not necessary to wait until all pages are parsed. It is apparently too foolish. We can visit pages and as soon as we get the html, we start parsing then download.So is there any answers? Thanks in advance.
+1 on queue, but for some reasonable level of concurrency (20-50) you'd
probably be alright even if all workers were hitting the same host.
So, the way I'd implement this is to have a global queue, spawn some workers
that in a loop read a url from the queue, download and parse the content from
that url, and then push back the new urls to the queue. Termination condition
would be not getting a url from the queue for 10 seconds or something.
m
--
Matt Billenstein
ma...@vazor.com
http://www.vazor.com/
m
--