Hello,
I have a problem with the last scrapy's release on my windows 10 system. Actually, the goal of my project is to crawl some french government web page to have every article of the law from a code. For exemple, I'm trying to crawl the "code général des impôts" from this page
http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160202 . Moreover, I have to crawl each article linked in this summary. At the end, I have to put every titles of the summuray in a database with its article associated.
I tried some things to do this. I did that with the latest tutorial of scrapy to write my script.
this is the begining of my script :
import scrapy
from tutorial.items import GouvItem
class GouvSpider(scrapy.Spider):
name = "gouv"
start_urls = [
]
Then the script allowing me to crawl each title of the summary :
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = GouvItem()
if len(sel.xpath('span/text()').extract()) > 0:
item['title1'] = sel.xpath('span/text()').extract()
if len(sel.xpath('ul/li/span/text()').extract()) > 0:
item['title2'] = sel.xpath('ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/span/text()').extract()) > 0:
item['title3'] = sel.xpath('ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()) > 0:
item['title4'] = sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0:
item['title5'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0:
item['title6'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()
if len(sel.xpath('ul/li/ul/li/span/a/@href').extract()) > 0:
item['link'] = sel.xpath('ul/li/ul/li/span/a/@href').extract()
yield item
And now i'm trying to cawl each article called in the summary with this script :
def parse(self, response):
for href in response.xpath("//a/@href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_article)
def parse_article(self, response):
for art in response.xpath("//div[@class='corpsArt']"):
item = GouvItem()
item['article'] = art.xpath('p/text()').extract()
yield item
To try these things faster, I don't crawl the summary and each article at the same time to test because it takes a long time. So, with the crawl of articles, my issue is that the things that it return to me are random. I mean that if I execute my script two times it will return to me two differents answers... I don't get why...
I hope that you will be able to help me and that you understood my subject =) I tried to explain it as much as possible.
Thank you so much !!