I am trying to turn a web site (not a single page) into a document. Each page has HTML for both navigation and content. My plan is to start at the first page and parse the navigation info, yielding requests to parse the listed pages for content. However, I'd also like to parse the initial page for content, ideally without re-fetching. The example spiders I've seen have a 'parse' method that either yields requests or returns a list of items. How can I do both? Yielding a list of items doesn't seem to work.
def parse(self, response):
'''parse a response, yielding new requests for each link in any table of contents'''
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('@href').extract(), link.select('text()').extract()
## print index, urljoin(base_url, href[0])
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
# I would also like to gather any content on this page...
yield self.parse2(response) # But this doesn't work :(
def parse2(self, response):
'''parse a response, returning any useful content'''
hxs = HtmlXPathSelector(response)
elements = hxs.select('//div[@id="main"]')
items = []
for element in elements:
item = ContentItem()
item['content'] = element.select(self.content_xpath).extract()
items.append(item)
return items