I have the a scrapy code that keeps on visiting and scraping a single link. It never stops, just keeps scraping a single link which shouldn't happen as the documentation says it has a dedup url filter already built in .
When I checked under scrapy/spider.py I could see that dont_filter was set to True. So I changed it to False, but it didn't help.
def make_requests_from_url(self, url):
return Request(url, dont_filter=False)
My code is as follows. Where could I be going wrong ? The start_url only has 1 link to a page a.html. And it keeps scraping a.html recursively.
================================
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
Class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["
datacaredubai.com"]
start_urls = ["
http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['overview'] = site.xpath('//*[@id="overview"]/div/div[1]/div/div/div/dl[1]/dd').extract()
item['specs'] = site.xpath('//*[@id="specs"]/div/div[1]/div/div/dl/dd[1]').extract()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['full']= site.xpath('//*[@id="overview"]//dd').extract()
item['req_url']= response.url
items.append(item)
return items