Hey guys,
Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but certainly not Python).
Here is the issue I'm having. In my Scrapy project, I point to a page and hopefully grab everything I need for the item. However, some domains (I'm scraping a significant amount of separate domains) have certain item properties located in another page within the initial page (for example, "location" might only be found by clicking on the "Get Directions" link on the page). I can't seem to get those "secondary" pages to work (the initial item goes through the pipelines without those properties, and I never see another item with those properties come through).
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# {'url': '//a[@id="Location"]/@href', 'xpath': '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
if isinstance(parse_xpath, dict): # if True, then this place_property is in another URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property, meta={'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property})
else: # process normally
bl.add_xpath(event_property, template.get(event_property))
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
Basically, the part I'm confused about is where you see "yield Request". I only put it there to illustrate where the problem lies; I know that this will cause the item to get processed without the properties found at that Request. So in my example, if the Place().location property is located at another link on the page, I'd like to load that page and fill that property with the appropriate value. Even if a single loader can't do it, that's fine, maybe I can use loader.item or something. I don't know, that's pretty much where my Google trail has ended.
Is what I want possible? I would prefer to keep the request asynchronous somehow, but if I really have to, making a synchronous request would suffice. Can someone kinda lead me in the right direction? I'd appreciate it. Thanks!
--
Joey "JoeLinux" Espinosa