Depth Crawling of the web site using scrapy.

598 views
Skip to first unread message

prem

unread,
Jul 16, 2009, 8:11:19 AM7/16/09
to scrapy-users
Hi ,

How can I achieve Depth based crwaling feature using scrapy. For
eg. If I want to crawl google.com, scrapy should give me all the link
and, even crawl the link, which again should crawl the inner links.
This should continue till no links are left to be crawled.

Thanks
Prem.

Pablo Hoffman

unread,
Jul 16, 2009, 8:56:47 AM7/16/09
to scrapy...@googlegroups.com
Is this what you're looking for?
http://doc.scrapy.org/ref/settings.html#scheduler-order

doridori Jo

unread,
Jul 16, 2009, 2:28:54 PM7/16/09
to scrapy...@googlegroups.com
My only gripe with this method is that its extremely slow, when crawling unknown sites, as it will go through all links.

Does scrapy offer some sort of multithread_http_get_page() , so it doesn't have to go through all the links one by one ?

Pablo Hoffman

unread,
Jul 16, 2009, 5:02:06 PM7/16/09
to scrapy...@googlegroups.com
On Thu, Jul 16, 2009 at 11:28:54AM -0700, doridori Jo wrote:
> My only gripe with this method is that its extremely slow, when crawling
> unknown sites, as it will go through all links.

You can filter which links to follow (with regexes and other criteria) using
the CrawlSpider and Link Extractors.

> Does scrapy offer some sort of multithread_http_get_page() , so it doesn't
> have to go through all the links one by one ?

I don't follow. Can you provide an example?

Pablo.

doridori Jo

unread,
Jul 16, 2009, 5:55:40 PM7/16/09
to scrapy...@googlegroups.com
in cURL + php, you can do multi_curl,

simultaneously fetch number of pages using a single process. So instead of fetching pages one by one, you can have several threads working together.

Pablo Hoffman

unread,
Jul 16, 2009, 6:10:17 PM7/16/09
to scrapy...@googlegroups.com
doridori,

This is how Scrapy works, except that it uses asynchronous programming and
non-blocking IO instead of threading which is more efficient for highly
concurrent network applications (such as web crawlers).

For more info see:
http://twistedmatrix.com/projects/core/documentation/howto/async.html
http://jessenoller.com/2009/02/11/twisted-hello-asynchronous-programming/

Pablo.

doridori Jo

unread,
Jul 16, 2009, 6:15:36 PM7/16/09
to scrapy...@googlegroups.com
excellent!

prem

unread,
Jul 17, 2009, 1:09:51 AM7/17/09
to scrapy-users
Hi Pablo,

Thanks for the quick response. But I would like to know if Scrapy
supports for Depth crawling. For example if I supply the scrapy the
root url as input (google.com for eg), then can scrapy give me all the
links i.e. the anchor tag in google.com and crawl the links and give
all the links i.e. anchor tag of the inner links of the pages. The
scrapy should continue crawling the inner links till there is no more
links left to crawl.
I will be obliged if you can tell me how I can achieve the above using
scrapy with some example code snippet.

Thanks
Prem

On Jul 16, 5:56 pm, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> Is this what you're looking for?http://doc.scrapy.org/ref/settings.html#scheduler-order
> > Prem.- Hide quoted text -
>
> - Show quoted text -

Pablo Hoffman

unread,
Jul 17, 2009, 7:37:20 AM7/17/09
to scrapy...@googlegroups.com
Yes, that's whow Scrapy works. Take a look at the example project in the code
(in the examples/googledir directory). It can be used to scrape the entire
Google Directory:

http://directory.google.com/

Pablo.
Reply all
Reply to author
Forward
0 new messages