0.13 duplicate middleware not work?

570 views
Skip to first unread message

xyzgrid

unread,
Aug 4, 2011, 1:06:40 PM8/4/11
to scrapy...@googlegroups.com
hi,all:
I create my project to crawl one site.
In order to avoid revisiting content page again,I enable DuplicatesFilterMiddleware.

SCHEDULER_MIDDLEWARES = {
'scrapy.contrib.schedulermiddleware.duplicatesfilter.DuplicatesFilterMiddleware': 2,
}

DUPEFILTER_CLASS = "c1.serialDupefilter.RequestFingerprintDupeFilter"

RequestFingerprintDupeFilter:


    def open_spider(self, spider):
        self.fingerprints[spider] = set()
        settings = scrapy.conf.get_project_settings()
#        self.log(" get dedup file name %s " % settings)
        input = open(settings.get('DEDUP_FILE') ,"r")
        self.log("read %s " % settings.get('DEDUP_FILE'))
        print "read %s " % settings.get('DEDUP_FILE')
        for line in input.readlines():
            self.log("We have seen %s " % line,log.WARNING)
            print "We have seen %s " % line
            self.fingerprints[spider].add(line)
        input.close()
        print "read %s  finished " % settings.get('DEDUP_FILE')
        self.log("read %s  finished " % settings.get('DEDUP_FILE'))


my spider:
ArgiGovCnSpider.py:


    def parse(self,response):
        hxs = HtmlXPathSelector(response)
        count =0
        for link in hxs.select("//a[string-length(@href)>10]"):
            anchor =""
            raw_data = link.select("./script/text()").extract()

            m = re.search("<a .*>(.*)</a>",raw_data[0],0)
            if m is not None:
                anchor = m.group(1)

            url = link.select("./@href").extract()[0]
#            self.log("handle: %s count:%d" % (link.select("./@href").extract(),count))
#            count = count + 1
            metas = {}
            metas['anchor']= anchor
            metas['refer']= response.url
            url = urljoin_rfc(response.url,url)

            self.log("link %s,%s" % (url,anchor))

            request = Request(url,callback=self.parse_doc,meta=metas)
            metas['fp'] =  request_fingerprint(request)
            self.log("this is url %s" % url)
            yield Request(url,callback=self.parse_doc,meta=metas,dont_filter=False)

but.when i use
scrapy crawl ArgiGovCn
I found all the new reqeusts are sented ,and I got the result,
then, the log shows
after all new requests handled by my prase_doc ,then the RequestFingerprintDupeFilter begin to open_spider ------Is this too late to call?

Pablo Hoffman

unread,
Aug 5, 2011, 8:53:46 PM8/5/11
to scrapy...@googlegroups.com
Scheduler middleware was removed in 0.13:
http://doc.scrapy.org/0.13/topics/architecture.html

Duplicate filtering is now done in the scheduler itself:
http://dev.scrapy.org/browser/scrapy/core/scheduler.py#L45

The default dupe filter is the same (based on request fingerprints) but it was
moved to scrapy.dupefilter module, and renamed to RFPDupeFilter:
http://dev.scrapy.org/browser/scrapy/dupefilter.py#L22

The setting used to define the dupe filtering class is the same:
http://dev.scrapy.org/browser/scrapy/settings/default_settings.py#L91

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>

alex

unread,
Aug 6, 2011, 10:09:12 PM8/6/11
to scrapy-users
Thank Pablo.
Maybe I dont clear my question. My question is my filter did be
called by scrapy. But it called after my spider parse function.(my
parse function parse a page, yeild some page request). And all the
requests have been fininshed Then I saw my filter log "beging to load
request fingerprint", It too late to do so.

xyzgrid

unread,
Aug 8, 2011, 11:10:20 PM8/8/11
to scrapy-users
I upgrade my 0.13 from scrapy-384a63fb1abb to
scrapy-45e237ef5d4e. It works!

ps: I have to modify my /usr/local/bin/scrapy
from #!/usr/bin/python
to #!/bin/env python
may be this is kind for everyone.

Thank Pablo again!



2011/8/7 alex <xyz...@gmail.com>
Reply all
Reply to author
Forward
0 new messages