How to make 2-levels nested parsing?

757 views
Skip to first unread message

Alex

unread,
Jul 27, 2010, 3:18:44 AM7/27/10
to scrapy-users
Hello All,

I need to get all links from google search result, and then go over
each link to get RSS link from it to fill such Item:

class GoogleparserItem(Item):
num = Field()
title = Field()
link_to_page = Field()
rss_link_on_page = Field()

1. I parse google result to extract links to the found pages and it
works fine - I get a list of parsed URLs from google search result
2. At the same time I need to open each parsed URL and parse that page
to check if it has RSS

Not sure how to make 2. Both data from 1 and 2 should be placed to the
same GoogleparserItem

Using this method I can get all data for GoogleparserItem except
'rss_link_on_page '

def parse(self, response):
hxs = HtmlXPathSelector(response)
qqq = hxs.select('/html/head/title/text()').extract()[0]
sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/
h3')
i = 1
for site in sites:
try:
item = GoogleparserItem()
title = remove_tags(site.select('a').extract().pop())
link = site.select('a/@href').extract().pop()
item['num'] = unicode(i)
item['title'] = title
item['link'] = link
i = i + 1
yield item
except Exception as exc:
print 'Exception in row ' + str(i) + ':' + str(exc)

Is there a way to make second level parsing (open already parsed URL
and parse the response)?
Note that GoogleparserItem contains data from both 1 and 2 steps.

Thanks for any suggestions.

Alex

unread,
Jul 27, 2010, 10:43:09 AM7/27/10
to scrapy-users
Already found solution,

Just need to use Request with meta

In parse() instead of

return item

do

yield Request(item['link_to_page'], meta={'item': item},
callback=self.parse_second)

then in prase_second() you can get item using

item = response.request.meta['item']
Reply all
Reply to author
Forward
0 new messages