Recursively Scraping with Scrapy

lass Nah

unread,

Jul 9, 2013, 5:54:51 AM7/9/13

to scrapy...@googlegroups.com

hi guys, i have a technical problem in my code : (scraping data from win web site ): it coudn't crawl the next pages. please help:

************************the spider: items.py*********************************************

from scrapy.item import Item, Field
class ProjetvinnicolasItem(Item):
# define the fields for your item here like:
nomVin = Field()
appelation = Field()
millesime= Field()
prix = Field()

************************the spider: test2.py*********************************************

#!/usr/bin/python

#-*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import HtmlXPathSelector

from ProjetVinNicolas1.items import Projetvinnicolas1Item

import sys

import codecs

### Kludge to set default encoding to utf-8

reload(sys)

sys.setdefaultencoding('utf-8')

class MySpider(CrawlSpider):

name = "vino"

allowed_domains = ["nicolas.com"]

start_url = ["http://www.nicolas.com/fr/commander_bordeaux.html/"]

rules = (

Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="glo_pagination_centre"]/a[1]')),

Rule(SgmlLinkExtractor(allow=r'18_409~\d+~10\.htm'), callback='parse_items', follow=True),

)

def parse_items(self, response):

hxs = HtmlXPathSelector(response)

res = hxs.select('//table[@class="cpt_fav_table_commande"]/tr[position()>1]')

items = []

for res in res:

item = Projetvinnicolas1Item()

item ["nomVin"] = res.select('td[3]/a/text()').extract()

item ["appelation"] = res.select('td[5]/text()').extract()

item ["millesime"] = res.select('td[7]/text()').extract()

item ["prix"] = res.select('td[9]/b/text()').extract()

items.append(item)

return items

Paul Tremberth

unread,

Jul 9, 2013, 7:01:54 AM7/9/13

to scrapy...@googlegroups.com

Hi

in your sample page, the links for next pages are handled by a little bit of javascript

...

<a href="javascript:val('/page.php/fr/18_409~80~10.htm')" class="glo_pagination_link_page">9</a> -

...

so you'll need to add a processing function to your SgmlLinkExtractor to strip the javascript:val('') (via a process_value argument)

See http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor for an example

Paul Tremberth

unread,

Jul 10, 2013, 6:49:28 AM7/10/13

to scrapy...@googlegroups.com

Working spider with process_value example @ https://gist.github.com/redapple/5965317#file-nicolas_spider-py

Reply all

Reply to author

Forward