That's right, and it's due to the asynchronous nature of Scrapy. Requests may
arrive in different order due to differences in each site bandwidth and response
size. Even downloading the same page twice may cause it to arrive at different
times.
> Even if you add a priority parameter to the Request (like Request(url,
> priority=1, callback=self.test) ) there is no guarantee that your
> requests (and therefore the order of callback's execution which
> extracts the Items which in the end will go to a Pipeline ... you've
> lost me here i know :) ... are executed in the order of priority.
Correct, the priority is only for the scheduler, and only define the order in
which the requests are pulled from the scheduler. After they're sent to the
downloader, the time it takes to "finish" each one depends on each requests and
can't be known beforehand.
> I think priority matters only if your downloader's queue is full. In
> this case, Requests which have higher priority will enter the download
> queue faster.
> But if all the Requests returned from parse() can be queued in the
> downloader, the first callback will belong to the Request which
> finishes first ... and so on.
Exactly.
> Please notice that I'm also a scrapy beginner and I might be wrong
> about these theories. I'm not sure if there is any setting you can use
> to force a certain order of callbacks execution. Maybe at some point
> you need a DeferredList to wait for all the requests to finish before
> starting calling the callbacks.
If you really want to control the order of your requests, you need to setup a
request/callback chain like I explained to Dave in my previous responses. This,
however, is not the most common case as you typically want to download as
"fast" as possible, to avoid idle time. If you were downloading all requests
sequentially you'd loose the benefits of asynchronous programming.
By building your own request/callback chain you can have certain requests
executed sequentially, while others are being executed in parallel, like a
tree with branches.
In practice, sequential requests are very common for "initializing" a site
(login, set currency, etc) but may also be used for collecting additional data
for items. For example:
4 5 6
o----o----X
/
o----o----o
1 2 3\
o----o----X
7 8 9
1. set currenct
2. login
3. some item list
4 & 7. first page of item info - you collect some fields here
5 & 8. second page of item info - you continue adding more fields here
6 & 9. third page of certain - you collect some final fields for and item here
and return it. Here is the only callback that actually returns an item.
> I believe that another consequence of the asynchronous requests is
> that this setting "SCHEDULER_ORDER" doesn't work quite as expected.
> I would expect that SCHEDULER_ORDER = BFO would reach leaves pages
> (the websites can be seen as graphs of pages) first before processing
> the siblings pages. Maybe I didn't realized what is the true meaning
> of SCHEDULER_ORDER
Actually, it's just the opposite. With Breadth First Order (BFO) you visit
siblings first, then children. With Depth First Order (DFO) you visit leaves
first. But, as the setting itself says it, it only controls the order in which
the requests are *scheduled* (not *downloaded*) which is typically what you
want.
However, it's worth noting that the SCHEDULER_ORDER is only applied to requests
of the *same priority* which means that priority has more priority (if you'll
forgive the repetition) than scheduler order.
> I am certain that Pablo will shed some light over these issues soon :)
I hope I have :)
Pablo.