Recursive crawling and returning items

929 views
Skip to first unread message

Jason Smith

unread,
Apr 16, 2010, 1:15:17 PM4/16/10
to scrapy-users
Hi there,

My first crawler works perfectly after some experimentation but I
cannot return items for further pipelining due to the structure I
created.


What i want to do is roughly the following:
-> crawl a list of similar websites (same domain) but I want to
provide the links dynamically (ie: from external application so I
think I cannot use start_url[])
-> these similar sites have to be crawled recursively to get all the
data I want
-> after crawling I want to process the data (ie: use it in pipeline)
The last part is what does not work in my current design.

So what I ended up with was a for loop over the urls where in every
loop a recursive crawling loop is started. The crawler does the
following:
----
def start_requests(self):
# open an url which contains a list of urls
request = Request(settings['URL_FILE'],
callback=self.parse_url_list)
yield request

def parse_url_list(self, response):
(... code to extract the urls from the HTML ...)
for url in url_list:
request = Request(url, callback=self.parse)
yield request

def parse(self, response):
(... this is were I do the real extraction of data ...)

(... due to the nature of the site i do recursive requests to get
all the data ...)
request = Request(next_page, callback=self.parse)
yield request
--------------

But how can I ever return the items in this design? I guess my design
is seriously flawed but I cannot come up with a different one which
works. My guess is that all these yields make it impossible to get the
data back with a return statement but my understanding of yield and
python is not sufficient to come up with a working solution to this
problem.

Any ideas?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Rolando Espinoza La Fuente

unread,
Apr 16, 2010, 1:39:52 PM4/16/10
to scrapy...@googlegroups.com
You can yield requests or items as many times as you need.

def parse(self, response):
# extract items from response
for item in items:
yield item

# extract next requests from response
for req in requests:
yield req


~Rolando

Jason Smith

unread,
Apr 16, 2010, 3:57:55 PM4/16/10
to scrapy-users
On Apr 16, 7:39 pm, Rolando Espinoza La Fuente <dark...@gmail.com>
wrote:
> You can yield requests or items as many times as you need.
>
> def parse(self, response):
>      # extract items from response
>      for item in items:
>           yield item

So the "yield item" in your example will do the same (in terms of
functionality) as return item in a "normal" crawler? Gonna try that,
thanks. Will post the results here.

Jason

Jason Smith

unread,
Apr 16, 2010, 4:10:32 PM4/16/10
to scrapy-users
Thank you Ronaldo, that works beautifully. Thanks also for the quick
response! Scrapy is elegant and powerful, I'm pretty sure there are
many different projects for which I will be going to use it.
Reply all
Reply to author
Forward
0 new messages