xyzgrid
unread,Aug 4, 2011, 1:06:40 PM8/4/11Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to scrapy...@googlegroups.com
hi,all:
I create my project to crawl one site.
In order to avoid revisiting content page again,I enable DuplicatesFilterMiddleware.
SCHEDULER_MIDDLEWARES = {
'scrapy.contrib.schedulermiddleware.duplicatesfilter.DuplicatesFilterMiddleware': 2,
}
DUPEFILTER_CLASS = "c1.serialDupefilter.RequestFingerprintDupeFilter"
RequestFingerprintDupeFilter:
def open_spider(self, spider):
self.fingerprints[spider] = set()
settings = scrapy.conf.get_project_settings()
# self.log(" get dedup file name %s " % settings)
input = open(settings.get('DEDUP_FILE') ,"r")
self.log("read %s " % settings.get('DEDUP_FILE'))
print "read %s " % settings.get('DEDUP_FILE')
for line in input.readlines():
self.log("We have seen %s " % line,log.WARNING)
print "We have seen %s " % line
self.fingerprints[spider].add(line)
input.close()
print "read %s finished " % settings.get('DEDUP_FILE')
self.log("read %s finished " % settings.get('DEDUP_FILE'))
my spider:
ArgiGovCnSpider.py:
def parse(self,response):
hxs = HtmlXPathSelector(response)
count =0
for link in hxs.select("//a[string-length(@href)>10]"):
anchor =""
raw_data = link.select("./script/text()").extract()
m = re.search("<a .*>(.*)</a>",raw_data[0],0)
if m is not None:
anchor = m.group(1)
url = link.select("./@href").extract()[0]
# self.log("handle: %s count:%d" % (link.select("./@href").extract(),count))
# count = count + 1
metas = {}
metas['anchor']= anchor
metas['refer']= response.url
url = urljoin_rfc(response.url,url)
self.log("link %s,%s" % (url,anchor))
request = Request(url,callback=self.parse_doc,meta=metas)
metas['fp'] = request_fingerprint(request)
self.log("this is url %s" % url)
yield Request(url,callback=self.parse_doc,meta=metas,dont_filter=False)
but.when i use
scrapy crawl ArgiGovCn
I found all the new reqeusts are sented ,and I got the result,
then, the log shows
after all new requests handled by my prase_doc ,then the RequestFingerprintDupeFilter begin to open_spider ------Is this too late to call?