As you know, one of the main components of scrapy is the Scheduler, it
is in charge of determine in what order scrapy must download requests.
But one of the feature we introduced long time ago to it and we were
regreting was the builtin "requests filtering" functionality.
few minutes ago, duplicate filtering was striped out from scheduler
component and reimplemented as a spidermiddleware.
new projects comes with this middleware enabled by default, but projects
created with a revision previous to r845 dont.
Add new spider middelware path to project's SPIDER_MIDDLEWARES setting,
I recomend you to add it to the bottom of the list, most near the spider the
better, but it is up to you.
SPIDER_MIDDLEWARES = (
# Engine side
...,
...,
'scrapy.contrib.spidermiddleware.duplicatesfilter.DuplicatesFilterMiddleware',
# Spider side
)
also take a look to commit comment for more details:
http://dev.scrapy.org/ticket/49#comment:3
Thanks!
dan