Why engine fetch requests from scheduler first other than the start_urls generated ones?

36 views

Skip to first unread message

Jianhao Chen

unread,

Mar 30, 2016, 5:52:13 AM3/30/16

to scrapy-users

From HERE I found that Scrapy engine fetch requests from scheduler before the start_urls generated ones.

In my usage, I enqueued thousands of start urls (which from various domains) to the queue and the crawling goes not so fast (maybe networking issues). The problems comes up with me was that the spider itself extracts links and follows them, then Scrapy will fetch the requests from scheduler. It makes the concurrency lower.

I would like to learn about the design purpose of this mechanism.
BRs.

Dimitris Kouzis - Loukas

unread,

Apr 2, 2016, 7:59:53 AM4/2/16

to scrapy-users

Are you asking for http://doc.scrapy.org/en/latest/topics/broad-crawls.html ? Finishing all the start_urls before going wide?

Message has been deleted

Jianhao Chen

unread,

Apr 5, 2016, 1:38:16 AM4/5/16

to scrapy-users

Yes. While scrapy engine get the requests from scheduler first, not from start_urls.

Reply all

Reply to author

Forward

0 new messages