Process Multiple Requests For Single Item

81 views
Skip to first unread message

Joey Espinosa

unread,
Mar 5, 2014, 8:41:12 AM3/5/14
to scrapy...@googlegroups.com
Hey guys,

Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but certainly not Python).

Here is the issue I'm having. In my Scrapy project, I point to a page and hopefully grab everything I need for the item. However, some domains (I'm scraping a significant amount of separate domains) have certain item properties located in another page within the initial page (for example, "location" might only be found by clicking on the "Get Directions" link on the page). I can't seem to get those "secondary" pages to work (the initial item goes through the pipelines without those properties, and I never see another item with those properties come through).

class SiteSpider(Spider):
    site_loader
= SiteLoader
   
...
   
def parse(self, response):
        item
= Place()
        sel
= Selector(response)
        bl
= self.site_loader(item=item, selector=sel)
        bl
.add_value('domain', self.parent_domain)
        bl
.add_value('origin', response.url)
       
for place_property in item.fields:
            parse_xpath
= template.get(place_property)

           
# parse_xpath will look like either:
           
# '//path/to/property/text()'
           
# {'url': '//a[@id="Location"]/@href', 'xpath': '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
           
if isinstance(parse_xpath, dict):    # if True, then this place_property is in another URL
                url
= sel.xpath(parse_xpath['url_elem']).extract()
               
yield Request(url, callback=self.get_url_property, meta={'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property})
           
else:  # process normally
                bl
.add_xpath(event_property, template.get(event_property))
       
yield bl.load_item()


   
def get_url_property(self, response):
        loader
= response.meta['loader']
        parse_xpath
= response.meta['parse_xpath']
        place_property
= response.meta['place_property']
        sel
= Selector(response)
        loader
.add_value(place_property, sel.xpath(parse_xpath['xpath'])
       
return loader


Basically, the part I'm confused about is where you see "yield Request". I only put it there to illustrate where the problem lies; I know that this will cause the item to get processed without the properties found at that Request. So in my example, if the Place().location property is located at another link on the page, I'd like to load that page and fill that property with the appropriate value. Even if a single loader can't do it, that's fine, maybe I can use loader.item or something. I don't know, that's pretty much where my Google trail has ended.

Is what I want possible? I would prefer to keep the request asynchronous somehow, but if I really have to, making a synchronous request would suffice. Can someone kinda lead me in the right direction? I'd appreciate it. Thanks!

--
Joey "JoeLinux" Espinosa

Joey Espinosa

unread,
Mar 5, 2014, 8:47:56 AM3/5/14
to scrapy...@googlegroups.com
HOLY TYPOS. Sorry. Revised:

class SiteSpider(Spider):
    site_loader
= SiteLoader
   
...
   
def parse(self, response):
        item
= Place()
        sel
= Selector(response)
        bl
= self.site_loader(item=item, selector=sel)
        bl
.add_value('domain', self.parent_domain)
        bl
.add_value('origin', response.url)
       
for place_property in item.fields:

            parse_xpath
= self.template.get(place_property)


           
# parse_xpath will look like either:
           
# '//path/to/property/text()'
           
# {'url': '//a[@id="Location"]/@href', 'xpath': '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
           
if isinstance(parse_xpath, dict):    # if True, then this place_property is in another URL
                url
= sel.xpath(parse_xpath['url_elem']).extract()
               
yield Request(url, callback=self.get_url_property, meta={'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property})
           
else:  # process normally

                bl
.add_xpath(place_property, parse_xpath)

       
yield bl.load_item()

   
def get_url_property(self, response):
        loader
= response.meta['loader']
        parse_xpath
= response.meta['parse_xpath']
        place_property
= response.meta['place_property']
        sel
= Selector(response)
        loader
.add_value(place_property, sel.xpath(parse_xpath['xpath'])
       
return loader


--
Joey "JoeLinux" Espinosa


Pablo Hoffman

unread,
Apr 18, 2014, 3:34:01 PM4/18/14
to scrapy-users
You shouldn't return/yield the item until it's complete. In other words, you should return the item in the "get_url_property" callback, not the main one.  Each item must be returned only once, and only once its data has been fully populated. 


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages