loop on scrapy

1,151 views
Skip to first unread message

yohann...@hotmail.fr

unread,
Jun 25, 2013, 4:38:36 AM6/25/13
to scrapy...@googlegroups.com

Hello everybody,

I'm new to scrapy and i enconter a small problem using it.

my goal is the following : i'm doing datamining for a french society (i'm french) and i would like to enrich my data scraping some information on these pages :

http://www.capgeris.com/hebergements-personnes-agees-1402/-a1.htm,         http://www.capgeris.com/hebergements-personnes-agees-1402/-a2.htm ... etc.

each time the same information (name and host capacity).

here is my code :


from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["capgeris.com"]
    start_urls = [
        "http://www.capgeris.com/hebergements-personnes-agees-1402/"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=['-a\d+.htm']), 'parse_torrent')]

    def parse(self, response):
        x = HtmlXPathSelector(response)
        items = []
        item = DmozItem()
        item['type_etab'] = x.select('//div[@class="detretraite"]/ul[1]/li').extract()
        item['nb_places'] = x.select('//div[@class="detretraite"]/ul[2]/li').extract()
        items.append(item)
        return items


in fact i don't manage to do the kind of loop to cover each of the pages : http://www.capgeris.com/hebergements-personnes-agees-1402/-aINTEGER.htm  with INTEGER from 1 to (between 9000 and 10000)

The problem in this scrip because the robot works if i give him just one page to scrape on.

Thanks for help !

Prannoy Pilligundla

unread,
Jun 25, 2013, 5:08:29 AM6/25/13
to scrapy...@googlegroups.com
For scraping all those pages you need to include all those links in the start_urls
Do one thing
Write all those links in a csv file and then read it from the script

import csv
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
cr = csv.reader(open("Write path to ur csv file","rb"))
start_urls = []
for row in cr:
   start_urls.append(row[0])#assuming all links are there in first column
rules = [Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="box box-shadow link-box"]'), 'parse_torrent')]
   
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="detretraite"]/ul[2]/li')
items = []
for site in sites:
item = BasicsItem()
item['type_etab'] = site.select('text()').extract()
item['nb_places'] = site.select('@href').extract()
items.append(item)
return items

Hope this helps

Cheers
Prannoy Pilligundla




--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Paul Tremberth

unread,
Jun 25, 2013, 6:01:41 AM6/25/13
to scrapy...@googlegroups.com
Bonjour (I'm French also)
In a CrawlSpider you should not override the parse() method
change your parse to parse_item or something and you should get the individual links
(you did define 'parse_torrent' as callback to your Rule but did not define it)

The page you gave as start_url also have next page links, but these links are generated by a tiny bit of javascript
so you'll probably need to create Requests for these in addition to what SgmlLinkExtractor detects

Regards,
Paul.

Paul Tremberth

unread,
Jun 25, 2013, 6:03:21 AM6/25/13
to scrapy...@googlegroups.com
For the JavaScript links in the bottom of the page,
you can use a process_value argument to SgmlLinkExtractor

Paul Tremberth

unread,
Jun 25, 2013, 6:12:24 AM6/25/13
to scrapy...@googlegroups.com
Well, it's even simpler, only the "Suivant" link seems to be generated by JavaScript
So, you could just follow the links in the TD element with  class "pgs" (which contains links to page 1, 2, 3... 10, 100, 200...)
with a Rule like that

    rules = (
        Rule(SgmlLinkExtractor(allow=('a\d+\.htm$',),),callback='parse_item'),
        Rule(SgmlLinkExtractor(restrict_xpaths='//td[@class="pgs"]'),),
    )

yohann...@hotmail.fr

unread,
Jun 26, 2013, 3:52:42 AM6/26/13
to scrapy...@googlegroups.com

Hi Prannoy and Paul.

I finally manage to scrape my data yesterday afternoon. (one of my classmate helped me in fact.)

Thanks Prannoy for your idea but i think Paul gave me a best way than you to solve my problem. Anyway, i didn't know how to write all my link on a text.

For Paul, thanks a lot, it was exactly that. It took me some time to understand what you were saying because i was on the wrong direction trying to loop on all the URLs insteed of following the links in the main page.

Your code was perfectly right too.

The result is really good because my robot never returns at the same URL's two times ( that i was not sure) and it actulally managed to scrap any single information i wantes about all institutions on the database of the website (this is crazy).

So now i have to work on my text file to add to my dataset these informations.

Thanks again,
Yohann

Paul Tremberth

unread,
Jun 26, 2013, 4:01:57 AM6/26/13
to scrapy...@googlegroups.com
Salut Yohann
De rien, j'avais testé un peu le crawler avant de poster le code ;)

Yeah, Scrapy is crazy good, and fast and smart!
so you dont scrape URLs twice

Glad I could help
Bon courage pour la suite
Paul.

yohann...@hotmail.fr

unread,
Jul 1, 2013, 8:29:39 AM7/1/13
to scrapy...@googlegroups.com

for information, the code which works :


from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import EhpadItem


class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["capgeris.com"]
    start_urls = [
        "http://www.capgeris.com/hebergements-personnes-agees-1402"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=['a\d+\.htm$']), 'parse_item'),
    Rule(SgmlLinkExtractor(restrict_xpaths='//td[@class="pgs"]'),)]

    def parse_item(self, response):
        #filename = response.url.split("/")[-2]
        #open(filename, 'wb').write(response.body)

        x = HtmlXPathSelector(response)
        items = []
        item = EhpadItem()

        item['type_etab'] = x.select('//div[@class="detretraite"]/ul[1]/li').extract()
        item['nb_places'] = x.select('//div[@class="detretraite"]/ul[2]/li').extract()
        item['nom'] = x.select('//title/text()').extract()
        items.append(item)
        return items
 
Reply all
Reply to author
Forward
0 new messages