How to enable the Scrapy's duplicate urls filter for start_urls?

65 views
Skip to first unread message

Antoine Brunel

unread,
May 2, 2016, 4:04:34 PM5/2/16
to scrapy-users
Hello,

I found out that Scrapy's duplicate url filter RFPDupeFilter is disabled for urls set in start_urls. 
How can I enable it?

Thanks!

张昊

unread,
May 3, 2016, 5:44:08 AM5/3/16
to scrapy-users
It will not work as you expect. You have to get rid of duplicate urls by your own code, maybe the "in" operation can help you. The RFPDupeFilter only works  in visited urls.

在 2016年5月3日星期二 UTC+8上午4:04:34,Antoine Brunel写道:

Paul Tremberth

unread,
May 3, 2016, 7:21:43 AM5/3/16
to scrapy-users
Hi Antoine,

you can override the start_requests method of your spider.
the default is this (explicitly disabling filtering):

    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True)


    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)



Regards,
Paul.

Antoine Brunel

unread,
May 4, 2016, 10:15:24 AM5/4/16
to scrapy-users
It works perfectly Paul, thank you very much! 
Reply all
Reply to author
Forward
0 new messages