Hello everybody,
I'm new to scrapy and i enconter a small problem using it.
my goal is the following : i'm doing datamining for a french society (i'm french) and i would like to enrich my data scraping some information on these pages :
http://www.capgeris.com/hebergements-personnes-agees-1402/-a1.htm,
http://www.capgeris.com/hebergements-personnes-agees-1402/-a2.htm ... etc.
each time the same information (name and host capacity).
here is my code :
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["
capgeris.com"]
start_urls = [
"
http://www.capgeris.com/hebergements-personnes-agees-1402/"
]
rules = [Rule(SgmlLinkExtractor(allow=['-a\d+.htm']), 'parse_torrent')]
def parse(self, response):
x = HtmlXPathSelector(response)
items = []
item = DmozItem()
item['type_etab'] = x.select('//div[@class="detretraite"]/ul[1]/li').extract()
item['nb_places'] = x.select('//div[@class="detretraite"]/ul[2]/li').extract()
items.append(item)
return items
in fact i don't manage to do the kind of loop to cover each of the pages :
http://www.capgeris.com/hebergements-personnes-agees-1402/-aINTEGER.htm with INTEGER from 1 to (between 9000 and 10000)
The problem in this scrip because the robot works if i give him just one page to scrape on.
Thanks for help !