I am presently working on scrapy, below is my spider.py code
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = {"careers-preftherapy.icims.com"}
start_urls = [
"https://careers-preftherapy.icims.com/jobs/search"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
pageCount = hxs.select('//td[@class =
"iCIMS_JobsTablePaging"]/table/tr/td[2]/text()').extract()[0].rstrip().lstr ip()[-2:].strip()
for i in range(1,int(pageCount)+1):
yield
Request("https://careers-preftherapy.icims.com/jobs/search?pr=%d"%i,
callback=self.parsePage)
def parsePage(self, response):
hxs = HtmlXPathSelector(response)
urls_list_odd_id =
hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTable Odd
iCIMS_JobsTableField_1"]/a/@href').extract()
print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
urls_list_even_id =
hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTable Even
iCIMS_JobsTableField_1"]/a/@href').extract()
print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"
urls_list = []
urls_list.extend(urls_list_odd_id)
urls_list.extend(urls_list_even_id)
for i in urls_list:
yield Request(i.encode('utf-8'), callback=self.parseJob)
def parseJob(self, response):
pass
Here after opening the page i am achieving pagination like
https://careers-preftherapy.icims.com/jobs/search?pr=1
https://careers-preftherapy.icims.com/jobs/search?pr=2
...........so on
I yielded request for each url(suppose here there are 6 pages).When scrapy
reached 1st url
i am trying to collect all href tags from the first url
`(https://careers-preftherapy.icims.com/jobs/search?pr=1)`
and when it reaches second url same collecting all href tags.
Now in my code as u see there are totally 20 href tags in each page in that
10 href tags are under `td[@class="iCIMS_JobsTableOdd
iCIMS_JobsTableField_1"]` \
and remaining are under `td[@class="iCIMS_JobsTableEven
iCIMS_JobsTableField_1"]` .
What the problem is here scrapy some times downloading the tags and some
times not i dont know whats happening, i mean when we run spider file two
times it is downloading and when another time its returning an empty list
like below
**1st time run:**
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET
https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer:
https://careers-preftherapy.icims.com/jobs/search)
[] >>>>>>>odddddd>>>>>>>>>>>>>>>>
[] >>>>>>>Evennnn>>>>>>>>>>>>>>>>
**Second time run**
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET
https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer:
https://careers-preftherapy.icims.com/jobs/search)
[u'https://careers-preftherapy.icims.com/jobs/1836/job',
u'https://careers-preftherapy.icims.com/jobs/1813/job',
u'https://careers-preftherapy.icims.com/jobs/1763/job']>>>>>>>odddddd>>>>>>>>>>>>>>>>
[preftherapy.icims.com/jobs/1811/job',
u'https://careers-preftherapy.icims.com/jobs/1787/job']>>>>>>>Evennnn>>>>>>>>>>>>>>>>
My question is why it is sometimes downloading and sometimes not, please
try to reply me its really helpful for me.
Thanks in advance.....