Persistent scheduler (to pause and resume crawls)

Pablo Hoffman

unread,

Aug 2, 2011, 11:19:26 AM8/2/11

to Scrapy Users

Hi guys,

I just wanted to let you know that the persistent scheduler has finally landed
in trunk (thanks Daniel and Shane for the useful feedback reviewing the patch).

Here's the commit, for the curious: http://dev.scrapy.org/changeset/2737

If you're interested in this functionality, we'd like to hear your feedback.

In order to use it, you just have to set the JOBDIR setting, which is a
directory where the crawl state will be persisted. If you don't, only in-memory
queues will be used (as it has worked so far).

For example:

scrapy crawl somespider --set JOBDIR=crawl1

Then you can stop the crawl at any time by issuing a warm shutdown (for
example, hitting Ctrl-C) and wait until the requests in progress are finished
(this is important, don't hit Ctrl-C twice to force a cold shutdown).

And you can resume the crawl later by running the exact same command (ie.
giving it the same JOBDIR).

Looking forward to hearing your feedback,
Pablo.

massabuntu

unread,

Sep 15, 2011, 2:58:37 AM9/15/11

to scrapy...@googlegroups.com

Saw it only now!

This is great!

Sergey

unread,

Mar 16, 2012, 4:52:06 AM3/16/12

to scrapy...@googlegroups.com

It looks very nice. If you will implements this the persistent scheduler with Mongodb - it will be very nice(distributed crawler, very nice performance for large crawler)

ilovett

unread,

Jun 4, 2012, 1:59:04 AM6/4/12

to scrapy...@googlegroups.com

Hmmm, wish I read about the 'important: dont hit ctrl-c twice' -- I did this and now I when I try to resume my spider using scrapy crawl browse -s JOBDIR=crawls/initial-crawl