Scrapy Random answer with the same script

18 views

Skip to first unread message

fifou weiron

unread,

Feb 5, 2016, 6:30:02 AM2/5/16

to scrapy-users

Hello,

I have a problem with the last scrapy's release on my windows 10 system. Actually, the goal of my project is to crawl some french government web page to have every article of the law from a code. For exemple, I'm trying to crawl the "code général des impôts" from this page http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160202 . Moreover, I have to crawl each article linked in this summary. At the end, I have to put every titles of the summuray in a database with its article associated.

I tried some things to do this. I did that with the latest tutorial of scrapy to write my script.

this is the begining of my script :

import scrapy

from tutorial.items import GouvItem

class GouvSpider(scrapy.Spider):

name = "gouv"

allowed_domains = ["legifrance.gouv.fr"]

start_urls = [

"http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128"

]

Then the script allowing me to crawl each title of the summary :

def parse(self, response):

for sel in response.xpath('//ul/li'):

item = GouvItem()

if len(sel.xpath('span/text()').extract()) > 0:

item['title1'] = sel.xpath('span/text()').extract()

if len(sel.xpath('ul/li/span/text()').extract()) > 0:

item['title2'] = sel.xpath('ul/li/span/text()').extract()

if len(sel.xpath('ul/li/ul/li/span/text()').extract()) > 0:

item['title3'] = sel.xpath('ul/li/ul/li/span/text()').extract()

if len(sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()) > 0:

item['title4'] = sel.xpath('ul/li/ul/li/ul/li/span/text()').extract()

if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0:

item['title5'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/span/text()').extract()

if len(sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()) > 0:

item['title6'] = sel.xpath('ul/li/ul/li/ul/li/ul/li/ul/li/span/text()').extract()

if len(sel.xpath('ul/li/ul/li/span/a/@href').extract()) > 0:

item['link'] = sel.xpath('ul/li/ul/li/span/a/@href').extract()

yield item

And now i'm trying to cawl each article called in the summary with this script :

def parse(self, response):

for href in response.xpath("//a/@href"):

url = response.urljoin(href.extract())

yield scrapy.Request(url, callback=self.parse_article)

def parse_article(self, response):

for art in response.xpath("//div[@class='corpsArt']"):

item = GouvItem()

item['article'] = art.xpath('p/text()').extract()

yield item

To try these things faster, I don't crawl the summary and each article at the same time to test because it takes a long time. So, with the crawl of articles, my issue is that the things that it return to me are random. I mean that if I execute my script two times it will return to me two differents answers... I don't get why...

I hope that you will be able to help me and that you understood my subject =) I tried to explain it as much as possible.

Thank you so much !!

Steven Almeroth

unread,

Mar 5, 2016, 7:21:37 PM3/5/16

to scrapy-users

Weiron, response order is not guaranteed because the server you are crawling may return responses differently each time; in other words if you make a request and then make another request, the server may return the second request's response before the first, or not; but, this is beyond our control.

Try setting `CONCURRENT_REQUESTS = 1` but the downloader is also "asynchronous" so response order is still not guaranteed.

http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Also if you make requests manually you can set the `priority`.

http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

Finally you can listen for the `spider-idle` signal in a middleware and manually feed in requests in your own queue.