I just wanted to let you know that the persistent scheduler has finally landed
in trunk (thanks Daniel and Shane for the useful feedback reviewing the patch).
Here's the commit, for the curious: http://dev.scrapy.org/changeset/2737
If you're interested in this functionality, we'd like to hear your feedback.
In order to use it, you just have to set the JOBDIR setting, which is a
directory where the crawl state will be persisted. If you don't, only in-memory
queues will be used (as it has worked so far).
For example:
scrapy crawl somespider --set JOBDIR=crawl1
Then you can stop the crawl at any time by issuing a warm shutdown (for
example, hitting Ctrl-C) and wait until the requests in progress are finished
(this is important, don't hit Ctrl-C twice to force a cold shutdown).
And you can resume the crawl later by running the exact same command (ie.
giving it the same JOBDIR).
Looking forward to hearing your feedback,
Pablo.