Persistent scheduler (to pause and resume crawls)

609 views
Skip to first unread message

Pablo Hoffman

unread,
Aug 2, 2011, 11:19:26 AM8/2/11
to Scrapy Users
Hi guys,

I just wanted to let you know that the persistent scheduler has finally landed
in trunk (thanks Daniel and Shane for the useful feedback reviewing the patch).

Here's the commit, for the curious: http://dev.scrapy.org/changeset/2737

If you're interested in this functionality, we'd like to hear your feedback.

In order to use it, you just have to set the JOBDIR setting, which is a
directory where the crawl state will be persisted. If you don't, only in-memory
queues will be used (as it has worked so far).

For example:

scrapy crawl somespider --set JOBDIR=crawl1

Then you can stop the crawl at any time by issuing a warm shutdown (for
example, hitting Ctrl-C) and wait until the requests in progress are finished
(this is important, don't hit Ctrl-C twice to force a cold shutdown).

And you can resume the crawl later by running the exact same command (ie.
giving it the same JOBDIR).

Looking forward to hearing your feedback,
Pablo.

massabuntu

unread,
Sep 15, 2011, 2:58:37 AM9/15/11
to scrapy...@googlegroups.com
Saw it only now!

This is great!

Sergey

unread,
Mar 16, 2012, 4:52:06 AM3/16/12
to scrapy...@googlegroups.com
It looks very nice. If you will  implements this the persistent scheduler with Mongodb - it will be very nice(distributed crawler, very nice performance for large crawler)

ilovett

unread,
Jun 4, 2012, 1:59:04 AM6/4/12
to scrapy...@googlegroups.com
Hmmm, wish I read about the 'important: dont hit ctrl-c twice' -- I did this and now I when I try to resume my spider using scrapy crawl browse -s JOBDIR=crawls/initial-crawl

I just get that nothing happens... doesn't seem to resume...

Any way around fixing this?
Reply all
Reply to author
Forward
0 new messages