dupefilter redirects

31 views
Skip to first unread message

Michael Mata

unread,
Dec 16, 2010, 2:46:51 PM12/16/10
to scrapy-users
Using the crawl spider, I noticed that if a site redirects to an
already visited page, the page ends ends up getting crawled. Has
dupefilter already worked its magic before we get to this point? Is
it feasible to look at the final page URL before crawling the page?

Pablo Hoffman

unread,
Dec 16, 2010, 4:19:31 PM12/16/10
to scrapy...@googlegroups.com
Hi Michael,

Yes, the dupe filter only catches requests after they leave the spider, so
redirected pages are ignored by the dupe filter.

Since the dupefilter and the redirect middleware components are decoupled now,
it would be awkward to implement what you suggest, but nevertheless I think it
would be useful.

Could you create a ticket in the issue tracker so we don't forget about this?

Thanks!

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Michael Mata

unread,
Dec 16, 2010, 5:00:24 PM12/16/10
to scrapy-users
Sure. I created ticket #299 to track this enhancement issue.
Reply all
Reply to author
Forward
0 new messages