Scrapy Crawls and Scrape Duplicate URLS, despite unique and dont_filter settings

204 views

Skip to first unread message

Akash jain

unread,

Dec 31, 2014, 1:05:13 AM12/31/14

to scrapy...@googlegroups.com

I have the a scrapy code that keeps on visiting and scraping a single link. It never stops, just keeps scraping a single link which shouldn't happen as the documentation says it has a dedup url filter already built in .

When I checked under scrapy/spider.py I could see that dont_filter was set to True. So I changed it to False, but it didn't help.

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=False)

My code is as follows. Where could I be going wrong ? The start_url only has 1 link to a page a.html. And it keeps scraping a.html recursively.
================================

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem

Class DmozSpider(CrawlSpider):

    name = "dmoz"
    allowed_domains = ["datacaredubai.com"]
    start_urls = ["http://www.datacaredubai.com/aj/link.html"]

    rules = (
    Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
    )

    def parse_item(self, response):
        sel = Selector(response)
        sites = sel.xpath('//*')
        items = []
        for site in sites:
            item = DmozItem()
            item['overview'] = site.xpath('//*[@id="overview"]/div/div[1]/div/div/div/dl[1]/dd').extract()
            item['specs'] = site.xpath('//*[@id="specs"]/div/div[1]/div/div/dl/dd[1]').extract()
          item['title']= site.xpath('/html/head/meta[3]').extract()
            item['full']= site.xpath('//*[@id="overview"]//dd').extract()
            item['req_url']= response.url

            items.append(item)
        return items

Reply all

Reply to author

Forward

0 new messages