Scrapy sometimes downloading response and some times not

24 views

Skip to first unread message

shiva krishna

unread,

Jul 17, 2012, 8:15:36 AM7/17/12

to scrapy...@googlegroups.com

I am presently working on scrapy, below is my spider.py code

class ExampleSpider(BaseSpider):

name = "example"

allowed_domains = {"careers-preftherapy.icims.com"}

start_urls = [

"https://careers-preftherapy.icims.com/jobs/search"

]

def parse(self, response):

hxs = HtmlXPathSelector(response)

pageCount = hxs.select('//td[@class = "iCIMS_JobsTablePaging"]/table/tr/td[2]/text()').extract()[0].rstrip().lstrip()[-2:].strip()

for i in range(1,int(pageCount)+1):

yield Request("https://careers-preftherapy.icims.com/jobs/search?pr=%d"%i, callback=self.parsePage)

def parsePage(self, response):

hxs = HtmlXPathSelector(response)

urls_list_odd_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/@href').extract()

print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"

urls_list_even_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/@href').extract()

print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"

urls_list = []

urls_list.extend(urls_list_odd_id)

urls_list.extend(urls_list_even_id)

for i in urls_list:

yield Request(i.encode('utf-8'), callback=self.parseJob)

def parseJob(self, response):

pass

Here after opening the page i am achieving pagination like

https://careers-preftherapy.icims.com/jobs/search?pr=1

https://careers-preftherapy.icims.com/jobs/search?pr=2

...........so on

I yielded request for each url(suppose here there are 6 pages).When scrapy reached 1st url

i am trying to collect all href tags from the first url

`(https://careers-preftherapy.icims.com/jobs/search?pr=1)`

and when it reaches second url same collecting all href tags.

Now in my code as u see there are totally 20 href tags in each page in that 10 href tags are under `td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]` \

and remaining are under `td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]` .

What the problem is here scrapy some times downloading the tags and some times not i dont know whats happening, i mean when we run spider file two times it is downloading and when another time its returning an empty list like below

**1st time run:**

2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)

[] >>>>>>>odddddd>>>>>>>>>>>>>>>>

[] >>>>>>>Evennnn>>>>>>>>>>>>>>>>

**Second time run**

2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)

[u'https://careers-preftherapy.icims.com/jobs/1836/job', u'https://careers-preftherapy.icims.com/jobs/1813/job', u'https://careers-preftherapy.icims.com/jobs/1763/job']>>>>>>>odddddd>>>>>>>>>>>>>>>>

[preftherapy.icims.com/jobs/1811/job', u'https://careers-preftherapy.icims.com/jobs/1787/job']>>>>>>>Evennnn>>>>>>>>>>>>>>>>

My question is why it is sometimes downloading and sometimes not, please try to reply me its really helpful for me.