Confused about Scrapy "depth-first" or "bread-first" order

355 views
Skip to first unread message

Vincent Férotin

unread,
Sep 15, 2015, 9:03:18 AM9/15/15
to scrapy...@googlegroups.com
Hi, Scrapy community!


Relatively new to Scrapy, I found to be confused by its behavior.
As stated in the doc (
http://doc.scrapy.org/en/1.0/faq.html#does-scrapy-crawl-in-breadth-first-or-depth-first-order
),
Scrapy is meant to crawl in depth-first order by default.
But, as far as I understand, this is not how it actually behaves.

For a given scrapping project, I need that Scrapy crawls URLs in a
depth-first order.
I observed that this does not occur as expected.
To be sure, or at least illustrate what behavior I observed,
I created a small scrapping project here:
https://github.com/vincent-ferotin/scraping-github

This project crawls GitHub and some given projects trees,
and registers orders in which requests and responses are proceeded.
For details, please refer to its README (directly readable at
project's URL above).

I illustrate results in images, for both requests and responses orders,
with two configurations (default said to be "depth-first" order, and
other for "breadth-first").
For "depth-first" order, requests orders are:
https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-requests-depth_priority_0.png
and responses ones are:
https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-responses-depth_priority_0.png
For "breadth-first" order, requests orders are:
https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-requests-depth_priority_1.png
and responses ones are:
https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-responses-depth_priority_1.png

In any case, what I understand (please correct me if needed) is that crawling
is done through a *breadth-first* order, in any case.
What changes is that for "breadth-first" order, order respects
left-to-right order specified by graph to crawl,
whereas for "depth-first", left-to-right is not respected (I do not
understand it also).
Please let verify it by yourself, running code's project.
(Orders are pretty printed through logging at end of crawling.)

So, my very first question, would be: am I right? (or: where do I
misunderstand?)
If so, should documentation regarding "depth-first" VS "breadth-first"
order be rewritten?
And, is there a way to obtain a true depth-first order crawling?


Thanks,

-- Vincent
Reply all
Reply to author
Forward
0 new messages