How to enable the Scrapy's duplicate urls filter for start

Antoine Brunel

unread,

May 2, 2016, 4:04:34 PM5/2/16

to scrapy-users

Hello,

I found out that Scrapy's duplicate url filter RFPDupeFilter is disabled for urls set in start_urls.

How can I enable it?

Thanks!

张昊

unread,

May 3, 2016, 5:44:08 AM5/3/16

to scrapy-users

It will not work as you expect. You have to get rid of duplicate urls by your own code, maybe the "in" operation can help you. The RFPDupeFilter only works in visited urls.

在 2016年5月3日星期二 UTC+8上午4:04:34，Antoine Brunel写道：

Paul Tremberth

unread,

May 3, 2016, 7:21:43 AM5/3/16

to scrapy-users

Hi Antoine,

you can override the start_requests method of your spider.

the default is this (explicitly disabling filtering):

    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True)

You can change it to (default for Request is dont_filter=False):

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)

Regards,

Paul.

Antoine Brunel

unread,

May 4, 2016, 10:15:24 AM5/4/16

to scrapy-users

It works perfectly Paul, thank you very much!

Reply all

Reply to author

Forward

How to enable the Scrapy's duplicate urls filter for start_urls?

Antoine Brunel

张昊

Paul Tremberth

Antoine Brunel