How can I both yield requests and return items?

Sam Denton

unread,

May 8, 2012, 1:56:29 PM5/8/12

to scrapy...@googlegroups.com

I am trying to turn a web site (not a single page) into a document. Each page has HTML for both navigation and content. My plan is to start at the first page and parse the navigation info, yielding requests to parse the listed pages for content. However, I'd also like to parse the initial page for content, ideally without re-fetching. The example spiders I've seen have a 'parse' method that either yields requests or returns a list of items. How can I do both? Yielding a list of items doesn't seem to work.

class MySpider(BaseSpider):

[...]

def parse(self, response):

'''parse a response, yielding new requests for each link in any table of contents'''

hxs = HtmlXPathSelector(response)

base_url = response.url

links = hxs.select(self.toc_xpath)

for index, link in enumerate(links):

href, text = link.select('@href').extract(), link.select('text()').extract()

## print index, urljoin(base_url, href[0])

yield Request(urljoin(base_url, href[0]), callback=self.parse2)

# I would also like to gather any content on this page...

yield self.parse2(response) # But this doesn't work :(

def parse2(self, response):

'''parse a response, returning any useful content'''

hxs = HtmlXPathSelector(response)

elements = hxs.select('//div[@id="main"]')

items = []

for element in elements:

item = ContentItem()

item['content'] = element.select(self.content_xpath).extract()

items.append(item)

return items

EDUARDO ANTONIO BUITRAGO ZAPATA

unread,

May 8, 2012, 4:11:57 PM5/8/12

to scrapy...@googlegroups.com

Hi,

I think you should manage two task in the same function, it doesn't matter that you return requests and items at the same time. You can return either a list of items or yield every item and every request.

2012/5/8 Sam Denton <sam...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/DdpulmsCp_kJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
EDUARDO BUITRAGO
Est. Msc. en Ingeniería - Sistemas y Computación - Universidad de los Andes
Ing. de Sistemas - Universidad Francisco de Paula Santander
Cisco Certified Network Associate - CCNA

Steven Almeroth

unread,

May 8, 2012, 4:23:38 PM5/8/12

to scrapy...@googlegroups.com

You are right, you can yield Requests and return a list of Items, but that is not what you are attempting. You are attempting to yield a list of Items instead of return'ing them. And since you already are using parse() as a generator function you cannot have both yield and return together. But you can have many yields.

Try this:

def parse(self, response):

hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)

for index, link in enumerate(links):
href, text = link.select('@href').extract(), link.select('text()').extract()

yield Request(urljoin(base_url, href[0]), callback=self.parse2)

for item in self.parse2(response):
yield item

Sam Denton

unread,

May 9, 2012, 9:26:54 AM5/9/12

to scrapy...@googlegroups.com

That seems simple enough! Thanks!

Reply all

Reply to author

Forward