How do you perform sequential requests?

4,966 views
Skip to first unread message

Dave Marble

unread,
Oct 22, 2009, 4:06:20 PM10/22/09
to scrapy-users
It would be nice to be able to perform sequential requests when
absolutely needed. I come from using mechanize, where if I want to do
the following it's really straight-forward -- though obviously doesn't
handle async/multiple simultaneous requests like scrapy does through
Twisted.

A couple examples of what I'm looking to do:

Example 1: Go to page A, scrape 20 links, crawl those 20 links in
order.

Example 2: Go to page A, find POST variables, POST to get to page B,
find POST variables, POST to get to page C, .... etc. until you reach
page F where you scrape 20 links.

Pablo Hoffman

unread,
Oct 22, 2009, 5:46:28 PM10/22/09
to scrapy...@googlegroups.com
Doing this should be trivial by implementing your own spider. You don't have to
use the Crawl Spider for such straightforward simple spiders.

Here's an (untested) example:


class CustomSpider(BaseSpider):

domain_name = 'example.com'

def start_requests(self):
return Request('http://example.com/page1', callback=self.parse_page1)

def parse_page1(self):
return Request('http://example.com/page2', callback=self.parse_page2)

def parse_page2(self): # this one performs a POST
return FromRequest('http://example.com/page3_post', \
formdata={'user': 'john', 'pass': 'secret'}, \
callback=self.parse_page3_post)

def parse_post3(self):
# parse post result page here


Is this what you were asking for or did I misunderstand you?.

Do you think these cases should be made more clear in the documentation?.
Perhaps we could write some side-by-side comparison of spiders written in
Mechanize and Scrapy, as a quick start guide for people coming from Mechanize.

Pablo.

Dave Marble

unread,
Oct 22, 2009, 7:10:55 PM10/22/09
to scrapy-users
That's clear and I did figure that out from the documentation, though
it could be made more clear. However, if I have a lot of sequential
calls to make, it's ugly and a lot of extra code.

No idea if it's something useful or do-able, but I'd prefer:

----- snip -----
class CustomSpider(BaseSpider):
domain_name = 'example.com'

def start_requests(self):
ExecuteImmediately(Request('http://example.com/page1'))
ExecuteImmediately(FormRequest('http://example.com/page2'), \
formdata={'user': 'john', 'pass': 'secret'})
ExecuteImmediately(Request('http://example.com/page3'))
ExecuteImmediately(Request('http://example.com/page4'))
return Request('http://example.com/page5',
callback=self.rest_of_crawler)

def rest_of_crawler(self, response):
# continue with rest of crawler
----- snip -----

or something like this:

----- snip -----
class CustomSpider(BaseSpider):
domain_name = 'example.com'

def start_requests(self):
list = []
list.append(Request('http://example.com/page1'))
list.append(FormRequest('http://example.com/page2'), \
formdata={'user': 'john', 'pass': 'secret'})
list.append(Request('http://example.com/page3'))
list.append(Request('http://example.com/page4'))
return Synchronous_Requests(list, callback=self.rest_of_crawler)

def rest_of_crawler(self, response):
# continue with rest of crawler
----- snip -----

Aníbal Pacheco

unread,
Oct 22, 2009, 4:35:09 PM10/22/09
to scrapy...@googlegroups.com
Another possibility is to wrap the returned requests inside a recursive
callback with a selector parameter or similar who decides which of the
requests is triggered, this parameter should be updated on each
recursive call.

cheers!

Florentin

unread,
Oct 22, 2009, 7:36:51 PM10/22/09
to scrapy-users
I think Dave was referring to the case where you return several
requests from a response.
def parse(self, response):
url1 = 'http://www.google.com'
url2 = 'http://www.yahoo.com'
url3 = 'http://www.bing.com'
return [Request(url1, callback=self.parse2), Request(url2,
callback=self.parse2), Request(url3, callback=self.parse2)]

def parse2(self, response):
item = TestItem()
hxs = HtmlXPathSelector(response)
item['title'] = hxs.select('//title/text()').extract().pop()
return TestItem

Doing this, the current scrapy implementation will download all those
links asynchronously and the engine won't wait until all the downloads
have finished to
execute parse2(). If you are using a File Export Pipeline, there will
be no guarantee that your export file will have it's rows in the same
order every time you run the spider. So you can't assume that first
title in a CSV file belongs to google, second one is yahoo etc...

Even if you add a priority parameter to the Request (like Request(url,
priority=1, callback=self.test) ) there is no guarantee that your
requests (and therefore the order of callback's execution which
extracts the Items which in the end will go to a Pipeline ... you've
lost me here i know :) ... are executed in the order of priority.

I think priority matters only if your downloader's queue is full. In
this case, Requests which have higher priority will enter the download
queue faster.
But if all the Requests returned from parse() can be queued in the
downloader, the first callback will belong to the Request which
finishes first ... and so on.

Please notice that I'm also a scrapy beginner and I might be wrong
about these theories. I'm not sure if there is any setting you can use
to force a certain order of callbacks execution. Maybe at some point
you need a DeferredList to wait for all the requests to finish before
starting calling the callbacks.

I believe that another consequence of the asynchronous requests is
that this setting "SCHEDULER_ORDER" doesn't work quite as expected.
I would expect that SCHEDULER_ORDER = BFO would reach leaves pages
(the websites can be seen as graphs of pages) first before processing
the siblings pages. Maybe I didn't realized what is the true meaning
of SCHEDULER_ORDER

I am certain that Pablo will shed some light over these issues soon :)

Florentin.

Pablo Hoffman

unread,
Oct 28, 2009, 6:37:17 AM10/28/09
to scrapy...@googlegroups.com
On Thu, Oct 22, 2009 at 04:10:55PM -0700, Dave Marble wrote:
>
> That's clear and I did figure that out from the documentation, though
> it could be made more clear. However, if I have a lot of sequential
> calls to make, it's ugly and a lot of extra code.
>
> No idea if it's something useful or do-able, but I'd prefer:
>
> ----- snip -----
> class CustomSpider(BaseSpider):
> domain_name = 'example.com'
>
> def start_requests(self):
> ExecuteImmediately(Request('http://example.com/page1'))
> ExecuteImmediately(FormRequest('http://example.com/page2'), \
> formdata={'user': 'john', 'pass': 'secret'})
> ExecuteImmediately(Request('http://example.com/page3'))
> ExecuteImmediately(Request('http://example.com/page4'))
> return Request('http://example.com/page5',
> callback=self.rest_of_crawler)
>
> def rest_of_crawler(self, response):
> # continue with rest of crawler
> ----- snip -----

I presume "ExecuteImmediately" would block and return the response?. That
probably won't work with Twisted aynchronous programming (which Scrapy is based
on). You may be able to achieve a "non-blocking like" behaviour by using
Twisted inlineCallbacks, but:

1. inlineCallbacks also add a lot of boilerplate code, which is what you're
trying to avoid

2. Scrapy is designed so that spider callbacks process the response and *return
immediately*, instead of executing the requests on spider side. This assumption
allows Scrapy to properly control the flow of Requests, Responses and Items
through the engine to avoid any of them from overloading the memory.

> or something like this:
>
> ----- snip -----
> class CustomSpider(BaseSpider):
> domain_name = 'example.com'
>
> def start_requests(self):
> list = []
> list.append(Request('http://example.com/page1'))
> list.append(FormRequest('http://example.com/page2'), \
> formdata={'user': 'john', 'pass': 'secret'})
> list.append(Request('http://example.com/page3'))
> list.append(Request('http://example.com/page4'))
> return Synchronous_Requests(list, callback=self.rest_of_crawler)
>
> def rest_of_crawler(self, response):
> # continue with rest of crawler
> ----- snip -----

This could make more sense as you're not *executing* the requests in the
spiders but *returning* them instead, only inside a wrapper
(Synchronous_Requests) to signal they must be executed in sequentially. I think
this all could be made on the spider side (ie. no need to change the core),
perhaps with an extra spider method: self.process_synchronous_requests() or
something like that. Also, how would the response be returned to the final
callback? As a list?. What if any of them fail? Your rest_of_crawler() callback
receives only one response, would that be the last response in the list?

As an example of a generic spider (which does something similar) take a look at
contrib.spiders.init.InitSpider which can be used for those sites which require
some "initialization phase", like logging in, setting the country/currency or
something like that. It's not in the official documentation, but it's a stable
contrib.

Pablo.

Pablo Hoffman

unread,
Oct 29, 2009, 5:10:23 PM10/29/09
to scrapy...@googlegroups.com
On Thu, Oct 22, 2009 at 04:36:51PM -0700, Florentin wrote:
>
> I think Dave was referring to the case where you return several
> requests from a response.
> def parse(self, response):
> url1 = 'http://www.google.com'
> url2 = 'http://www.yahoo.com'
> url3 = 'http://www.bing.com'
> return [Request(url1, callback=self.parse2), Request(url2,
> callback=self.parse2), Request(url3, callback=self.parse2)]
>
> def parse2(self, response):
> item = TestItem()
> hxs = HtmlXPathSelector(response)
> item['title'] = hxs.select('//title/text()').extract().pop()
> return TestItem
>
> Doing this, the current scrapy implementation will download all those
> links asynchronously and the engine won't wait until all the downloads
> have finished to
> execute parse2(). If you are using a File Export Pipeline, there will
> be no guarantee that your export file will have it's rows in the same
> order every time you run the spider. So you can't assume that first
> title in a CSV file belongs to google, second one is yahoo etc...

That's right, and it's due to the asynchronous nature of Scrapy. Requests may
arrive in different order due to differences in each site bandwidth and response
size. Even downloading the same page twice may cause it to arrive at different
times.

> Even if you add a priority parameter to the Request (like Request(url,
> priority=1, callback=self.test) ) there is no guarantee that your
> requests (and therefore the order of callback's execution which
> extracts the Items which in the end will go to a Pipeline ... you've
> lost me here i know :) ... are executed in the order of priority.

Correct, the priority is only for the scheduler, and only define the order in
which the requests are pulled from the scheduler. After they're sent to the
downloader, the time it takes to "finish" each one depends on each requests and
can't be known beforehand.

> I think priority matters only if your downloader's queue is full. In
> this case, Requests which have higher priority will enter the download
> queue faster.
> But if all the Requests returned from parse() can be queued in the
> downloader, the first callback will belong to the Request which
> finishes first ... and so on.

Exactly.

> Please notice that I'm also a scrapy beginner and I might be wrong
> about these theories. I'm not sure if there is any setting you can use
> to force a certain order of callbacks execution. Maybe at some point
> you need a DeferredList to wait for all the requests to finish before
> starting calling the callbacks.

If you really want to control the order of your requests, you need to setup a
request/callback chain like I explained to Dave in my previous responses. This,
however, is not the most common case as you typically want to download as
"fast" as possible, to avoid idle time. If you were downloading all requests
sequentially you'd loose the benefits of asynchronous programming.

By building your own request/callback chain you can have certain requests
executed sequentially, while others are being executed in parallel, like a
tree with branches.

In practice, sequential requests are very common for "initializing" a site
(login, set currency, etc) but may also be used for collecting additional data
for items. For example:

4 5 6
o----o----X
/
o----o----o
1 2 3\
o----o----X
7 8 9

1. set currenct
2. login
3. some item list
4 & 7. first page of item info - you collect some fields here
5 & 8. second page of item info - you continue adding more fields here
6 & 9. third page of certain - you collect some final fields for and item here
and return it. Here is the only callback that actually returns an item.

> I believe that another consequence of the asynchronous requests is
> that this setting "SCHEDULER_ORDER" doesn't work quite as expected.
> I would expect that SCHEDULER_ORDER = BFO would reach leaves pages
> (the websites can be seen as graphs of pages) first before processing
> the siblings pages. Maybe I didn't realized what is the true meaning
> of SCHEDULER_ORDER

Actually, it's just the opposite. With Breadth First Order (BFO) you visit
siblings first, then children. With Depth First Order (DFO) you visit leaves
first. But, as the setting itself says it, it only controls the order in which
the requests are *scheduled* (not *downloaded*) which is typically what you
want.

However, it's worth noting that the SCHEDULER_ORDER is only applied to requests
of the *same priority* which means that priority has more priority (if you'll
forgive the repetition) than scheduler order.

> I am certain that Pablo will shed some light over these issues soon :)

I hope I have :)

Pablo.

Reply all
Reply to author
Forward
0 new messages