loop on scrapy

yohann...@hotmail.fr

unread,

Jun 25, 2013, 4:38:36 AM6/25/13

to scrapy...@googlegroups.com

Hello everybody,

I'm new to scrapy and i enconter a small problem using it.

my goal is the following : i'm doing datamining for a french society (i'm french) and i would like to enrich my data scraping some information on these pages :

http://www.capgeris.com/hebergements-personnes-agees-1402/-a1.htm, http://www.capgeris.com/hebergements-personnes-agees-1402/-a2.htm ... etc.

each time the same information (name and host capacity).

here is my code :

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["capgeris.com"]
    start_urls = [
        "http://www.capgeris.com/hebergements-personnes-agees-1402/"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=['-a\d+.htm']), 'parse_torrent')]

    def parse(self, response):
        x = HtmlXPathSelector(response)
        items = []
        item = DmozItem()
        item['type_etab'] = x.select('//div[@class="detretraite"]/ul[1]/li').extract()
        item['nb_places'] = x.select('//div[@class="detretraite"]/ul[2]/li').extract()
        items.append(item)
        return items

in fact i don't manage to do the kind of loop to cover each of the pages : http://www.capgeris.com/hebergements-personnes-agees-1402/-aINTEGER.htm with INTEGER from 1 to (between 9000 and 10000)

The problem in this scrip because the robot works if i give him just one page to scrape on.

Thanks for help !

Prannoy Pilligundla

unread,

Jun 25, 2013, 5:08:29 AM6/25/13

to scrapy...@googlegroups.com

For scraping all those pages you need to include all those links in the start_urls
Do one thing
Write all those links in a csv file and then read it from the script

import csv

class DmozSpider(CrawlSpider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

cr = csv.reader(open("Write path to ur csv file","rb"))

start_urls = []

for row in cr:

start_urls.append(row[0])#assuming all links are there in first column

rules = [Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="box box-shadow link-box"]'), 'parse_torrent')]

def parse(self, response):

hxs = HtmlXPathSelector(response)

sites = hxs.select('//div[@class="detretraite"]/ul[2]/li')

items = []

for site in sites:

item = BasicsItem()

item['type_etab'] = site.select('text()').extract()

item['nb_places'] = site.select('@href').extract()

items.append(item)

return items

Hope this helps

Cheers

Prannoy Pilligundla

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Paul Tremberth

unread,

Jun 25, 2013, 6:01:41 AM6/25/13

to scrapy...@googlegroups.com

Bonjour (I'm French also)

In a CrawlSpider you should not override the parse() method

http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules (Warning box)

change your parse to parse_item or something and you should get the individual links

(you did define 'parse_torrent' as callback to your Rule but did not define it)

The page you gave as start_url also have next page links, but these links are generated by a tiny bit of javascript

so you'll probably need to create Requests for these in addition to what SgmlLinkExtractor detects

Regards,

Paul.

Paul Tremberth

unread,

Jun 25, 2013, 6:03:21 AM6/25/13

to scrapy...@googlegroups.com

For the JavaScript links in the bottom of the page,

you can use a process_value argument to SgmlLinkExtractor

See http://doc.scrapy.org/en/latest/topics/link-extractors.html#scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor

Paul Tremberth

unread,

Jun 25, 2013, 6:12:24 AM6/25/13

to scrapy...@googlegroups.com

Well, it's even simpler, only the "Suivant" link seems to be generated by JavaScript

So, you could just follow the links in the TD element with class "pgs" (which contains links to page 1, 2, 3... 10, 100, 200...)

with a Rule like that

rules = (

Rule(SgmlLinkExtractor(allow=('a\d+\.htm$',),),callback='parse_item'),

Rule(SgmlLinkExtractor(restrict_xpaths='//td[@class="pgs"]'),),

)

yohann...@hotmail.fr

unread,

Jun 26, 2013, 3:52:42 AM6/26/13

to scrapy...@googlegroups.com

Hi Prannoy and Paul.

I finally manage to scrape my data yesterday afternoon. (one of my classmate helped me in fact.)

Thanks Prannoy for your idea but i think Paul gave me a best way than you to solve my problem. Anyway, i didn't know how to write all my link on a text.

For Paul, thanks a lot, it was exactly that. It took me some time to understand what you were saying because i was on the wrong direction trying to loop on all the URLs insteed of following the links in the main page.

Your code was perfectly right too.

The result is really good because my robot never returns at the same URL's two times ( that i was not sure) and it actulally managed to scrap any single information i wantes about all institutions on the database of the website (this is crazy).

So now i have to work on my text file to add to my dataset these informations.

Thanks again,
Yohann

Paul Tremberth

unread,

Jun 26, 2013, 4:01:57 AM6/26/13

to scrapy...@googlegroups.com

Salut Yohann

De rien, j'avais testé un peu le crawler avant de poster le code ;)

Yeah, Scrapy is crazy good, and fast and smart!

The link are deduplicated by default (see http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class)

so you dont scrape URLs twice

Glad I could help

Bon courage pour la suite

Paul.

yohann...@hotmail.fr

unread,

Jul 1, 2013, 8:29:39 AM7/1/13

to scrapy...@googlegroups.com

for information, the code which works :

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from tutorial.items import EhpadItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["capgeris.com"]
    start_urls = [
        "http://www.capgeris.com/hebergements-personnes-agees-1402"

    ]
    rules = [Rule(SgmlLinkExtractor(allow=['a\d+\.htm$']), 'parse_item'),
    Rule(SgmlLinkExtractor(restrict_xpaths='//td[@class="pgs"]'),)]

    def parse_item(self, response):
        #filename = response.url.split("/")[-2]
        #open(filename, 'wb').write(response.body)

x = HtmlXPathSelector(response)
items = []

item = EhpadItem()

item['type_etab'] = x.select('//div[@class="detretraite"]/ul[1]/li').extract()
item['nb_places'] = x.select('//div[@class="detretraite"]/ul[2]/li').extract()

        item['nom'] = x.select('//title/text()').extract()
        items.append(item)
        return items

Reply all

Reply to author

Forward