scrapy not retrieving all items requested

24 views
Skip to first unread message

Raf Roger

unread,
Sep 18, 2016, 3:26:50 PM9/18/16
to scrapy-users
Hi,

i have a behavior i need to understand.
My scrapy script request 53 urls (and i check on the webpages, there are 53 urls corresponding to my request) but it returns only 43 items scrapped.

if my code is:
  
  allowed_domains = ['vsetkyfirmy.sk']
  start_urls
= [
   
'https://www.vsetkyfirmy.sk/autokempy/',
 
]
  rules
= [
   
Rule(
       
LinkExtractor(
        restrict_xpaths
=(u'//*[text()[contains(., "Ďalšie")]]')),
        callback
='parse_start_url',
        follow
= True
   
)
 
]
  page_num
= 1
  counter
= 1
   
 
def parse_start_url(self, response):
    urls
= Selector(response).xpath('//td/a[contains(@id, "detaily")]/@href').extract()
   
for u in urls:
     
yield {'link' : u}


it returns me correctly 53 urls

but if my code is:

  allowed_domains = ['vsetkyfirmy.sk']
  start_urls
= [
   
'https://www.vsetkyfirmy.sk/autokempy/',
 
]
  rules
= [
   
Rule(
       
LinkExtractor(
        restrict_xpaths
=(u'//*[text()[contains(., "Ďalšie")]]')),
        callback
='parse_start_url',
        follow
= True
   
)
 
]
  page_num
= 1
  counter
= 1
   
 
def parse_start_url(self, response):
    urls
= Selector(response).xpath('//td/a[contains(@id, "detaily")]/@href').extract()
   
for u in urls:
     
yield scrapy.Request(u, callback=self.parse_company)


     
 
def parse_company(self, response):
    job
= Selector(response).xpath('//body/div/table[2]/tbody/tr[3]/td[2]/a/text()').extract()
    name
= Selector(response).xpath('//body/div/table[1]/tbody/tr[1]/td[1]/h1/span/text()').extract()

   
yield {
     
"count" : self.counter,
     
"job" : job,
     
"company page url" : response.url,
     
"company" : name,
   
}
   
self.counter = self.counter + 1

it returns me only 43.

why ?
thx

Erik Dominguez

unread,
Nov 4, 2016, 9:25:04 PM11/4/16
to scrapy-users

The most likely scenario is that your links are getting filtered. Try adding the dont_filter=True inside your Request instances.

yield scrapy.Request(u, callback=self.parse_company,     dont_filter=True)

Reply all
Reply to author
Forward
0 new messages