Using Scrapy from script

3,090 views
Skip to first unread message

JoE

unread,
Oct 20, 2010, 7:21:53 PM10/20/10
to scrapy-users
There have been a lot of requests for a way to control scrapy from a
script rather than the stand alone application.

This is the simplest method I have been able to come up with:

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scraper.settings')
#Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess

class CrawlerScript():

def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)

def _item_passed(self, item):
self.items.append(item)

def start(self):
self.deferred = self.crawler.start()

def stop(self):
self.crawler.stop()

def addSpider(self, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)

# Usage
if __name__ == "__main__":
log.start()

c = CrawlerScript()
c.addSpider('spider1')
c.addSpider('spider2')
c.addSpider('spider3')
c.start()
print c.items
c.stop()


This seems to work fine when used like above, but one problem I come
across is when I want to start the crawler and then add another spider
and run it again, as in:

c = CrawlerScript()
c.addSpider('spider1')
c.addSpider('spider2')
c.start()
print c.items
c.stop()
c.addSpider('spider3')
c.start()
print c.items
c.stop()


What happens is the crawler will run the first two spiders, print the
items, run the third spider, and then hang indefinitely and becomes
uninterruptable. I have to run "kill -9" to stop it.

My only guess is either I am doing something wrong or this is a bug in
scrapy.

Also, c.stop() doesn't appear to do anything because I get the exact
same results whether I include it or not.

The reason I want this functionality is that I am trying to come up
with a way to continuously rerun spiders based on some control logic,
as in:

c = CrawlerScript()
while true:
c.addSpider('spider')
c.start()
c.stop()
yield c.items
if some_control_logic():
break


I'm suprised that controlling scrapy from scripts like an API is not
built into scrapy because that is mainly what I intend to use it for.

Using a stand-alone application and/or daemon is not very useful to me.

JoE

unread,
Oct 20, 2010, 7:24:23 PM10/20/10
to scrapy-users


On Oct 20, 4:21 pm, JoE <joehil...@gmail.com> wrote:
> There have been a lot of requests for a way to control scrapy from a
> script rather than the stand alone application.
>
> This is the simplest method I have been able to come up with:
>
> #!/usr/bin/python
> import os
> os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scraper.settings')
> #Must be at the top before other imports
>
> from scrapy import log, signals, project
> from scrapy.xlib.pydispatch import dispatcher
> from scrapy.conf import settings
> from scrapy.crawler import CrawlerProcess
>
> class CrawlerScript():
>
>     def __init__(self):
>                 self.crawler = CrawlerProcess(settings)
>                 if not hasattr(project, 'crawler'):
>                         self.crawler.install()
>                 self.crawler.configure()
>                 self.items = []
>                 dispatcher.connect(self._item_passed, signals.item_passed)
>
>     def _item_passed(self, item):
>         self.items.append(item)
>
>     def start(self):
>         self.deferred = self.crawler.start()
^^ This line should be: ^^

JoE

unread,
Oct 20, 2010, 7:24:32 PM10/20/10
to scrapy-users


On Oct 20, 4:21 pm, JoE <joehil...@gmail.com> wrote:
> There have been a lot of requests for a way to control scrapy from a
> script rather than the stand alone application.
>
> This is the simplest method I have been able to come up with:
>
> #!/usr/bin/python
> import os
> os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scraper.settings')
> #Must be at the top before other imports
>
> from scrapy import log, signals, project
> from scrapy.xlib.pydispatch import dispatcher
> from scrapy.conf import settings
> from scrapy.crawler import CrawlerProcess
>
> class CrawlerScript():
>
>     def __init__(self):
>                 self.crawler = CrawlerProcess(settings)
>                 if not hasattr(project, 'crawler'):
>                         self.crawler.install()
>                 self.crawler.configure()
>                 self.items = []
>                 dispatcher.connect(self._item_passed, signals.item_passed)
>
>     def _item_passed(self, item):
>         self.items.append(item)
>
>     def start(self):
>         self.deferred = self.crawler.start()
^^ This line should be: ^^

Pablo Hoffman

unread,
Oct 21, 2010, 3:31:54 PM10/21/10
to scrapy...@googlegroups.com
Hi JoE,

The problem you're experiencing is due to a well-known limitation of Twisted
which doesn't support restarting reactors:
http://twistedmatrix.com/trac/wiki/FrequentlyAskedQuestions#WhycanttheTwistedsreactorberestarted

So if you want to run multiple crawlers, you need to start one reactor (at
first, when the process starts) and then run the multiple crawlers.

Here's an example:
http://snippets.scrapy.org/snippets/9/

I reckon it would be nicer if reactors were restartable, because it would hide
the asynchronous API inside the reactor.start() blocking call, so you don't
have to worry about using threads for simulating a blocking behaviour. But,
until someones fixes the restartable reactor issue, there's no alternative.

According to the FAQ entry, it shouldn't be too difficult to fix this, and the
main reason why it hasn't been done is the lack of interest in the feature, but
I've never looked into this in detail, so I couldn't say.

Pablo.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Joe Hillenbrand

unread,
Oct 21, 2010, 5:04:25 PM10/21/10
to scrapy...@googlegroups.com
I've been trying to use that snippet a lot, and it simply does not work.

For one, there is an error. It is missing this import:
from twisted.internet import threads, reactor, defer

Second, it drops me into a shell, which is not at all what I want.

Third, when I run:
items = crawler.crawl('spider')

I get the same uninterruptable hang (kill -9) that I got in my script, the only difference being that the spider doesn't even run.

JoE

unread,
Oct 22, 2010, 10:52:36 AM10/22/10
to scrapy-users
Ah ha! I have found a solution using multiprocessing.

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings')
#Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)

def _item_passed(self, item):
self.items.append(item)

def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)

def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)

# Usage
if __name__ == "__main__":
log.start()

items = list()
crawler = CrawlerScript()
for i in range(10):
items.append(crawler.crawl('spider'))
print items



This is great because I can call crawler.crawl('spider') as much as I
want without an issue. I've tried using twisted threads and python
threading, and both kept hitting the same restart issue. This is the
first solution that actually works the way I want.

On Oct 21, 2:04 pm, Joe Hillenbrand <joehil...@gmail.com> wrote:
> I've been trying to use that snippet a lot, and it simply does not work.
>
> For one, there is an error. It is missing this import:
> from twisted.internet import threads, reactor, defer
>
> Second, it drops me into a shell, which is not at all what I want.
>
> Third, when I run:
> items = crawler.crawl('spider')
>
> I get the same uninterruptable hang (kill -9) that I got in my script, the
> only difference being that the spider doesn't even run.
>
> On Thu, Oct 21, 2010 at 12:31 PM, Pablo Hoffman <pablohoff...@gmail.com>wrote:
>
>
>
> > Hi JoE,
>
> > The problem you're experiencing is due to a well-known limitation of
> > Twisted
> > which doesn't support restarting reactors:
>
> >http://twistedmatrix.com/trac/wiki/FrequentlyAskedQuestions#Whycantth...
> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@google groups.com>
> > .
> > > For more options, visit this group at
> >http://groups.google.com/group/scrapy-users?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "scrapy-users" group.
> > To post to this group, send email to scrapy...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsubscribe@google groups.com>
> > .

Pablo Hoffman

unread,
Oct 23, 2010, 1:19:55 AM10/23/10
to scrapy...@googlegroups.com
Hey JoE,

That's a rather elegant way to circumvent the reactor restart issue with the
multiprocessing library, thanks for sharing!. Btw, would you mind posting that
code in http://snippets.scrapy.org for future reference?

Pablo.

> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

Joe Hillenbrand

unread,
Oct 24, 2010, 4:15:57 PM10/24/10
to scrapy...@googlegroups.com
Sure, no problem. I was actually planning to do that but was waiting to hear what you thought of if first.

Also, is there any reason this functionality (or something like it) couldn't be built into scrapy as an API?

Joe Hillenbrand

unread,
Oct 24, 2010, 4:32:30 PM10/24/10
to scrapy...@googlegroups.com

Pablo Hoffman

unread,
Oct 24, 2010, 5:53:03 PM10/24/10
to scrapy...@googlegroups.com
On Sun, Oct 24, 2010 at 01:15:57PM -0700, Joe Hillenbrand wrote:
> Sure, no problem. I was actually planning to do that but was waiting to hear
> what you thought of if first.
>
> Also, is there any reason this functionality (or something like it) couldn't
> be built into scrapy as an API?

No reason. On the contrary, it would be useful, but we'd have to define the
API, write some tests and, if possible, document it.

I think the simplest API would be to return an iterator over the scraped items,
so you would call it like this:

crawler = CrawlerScript(settings)
for item in crawler.crawl_spider("spider_name"):
print "Got item: %s" % item

What do you think?

Error handling is another thing to think about.

But it doesn't need to be perfect from the start. Until it's stable enough, we
can add this into the scrapy.contrib_exp package (which is used for
experimental features).

Pablo.

Joe Hillenbrand

unread,
Oct 24, 2010, 9:30:13 PM10/24/10
to scrapy...@googlegroups.com
On Sun, Oct 24, 2010 at 2:53 PM, Pablo Hoffman <pabloh...@gmail.com> wrote:
On Sun, Oct 24, 2010 at 01:15:57PM -0700, Joe Hillenbrand wrote:
> Sure, no problem. I was actually planning to do that but was waiting to hear
> what you thought of if first.
>
> Also, is there any reason this functionality (or something like it) couldn't
> be built into scrapy as an API?

No reason. On the contrary, it would be useful, but we'd have to define the
API, write some tests and, if possible, document it.

I think the simplest API would be to return an iterator over the scraped items,
so you would call it like this:

   crawler = CrawlerScript(settings)
   for item in crawler.crawl_spider("spider_name"):
       print "Got item: %s" % item

What do you think?

Yeah, that would be much better. 


Error handling is another thing to think about.

But it doesn't need to be perfect from the start. Until it's stable enough, we
can add this into the scrapy.contrib_exp package (which is used for
experimental features). 

Cool, I'll work on porting it to there. 
 
Pablo.

Steven Almeroth

unread,
Sep 11, 2011, 7:33:14 PM9/11/11
to scrapy...@googlegroups.com
Has anyone got any new insights to this script?  I'm trying to get this updated to Scrapy 0.12.

barcklay

unread,
Oct 3, 2011, 1:17:35 AM10/3/11
to scrapy-users
I'm trying to do same. I got this snippet working (using the updated
version from the comments): http://snippets.scrapy.org/snippets/7/
It's very simple but when I try to re-use the crawler, I run into the
twisted reactor restart issue mentioned above.
I suppose the next step it to try to use the script above in place of
the simpler snippet.
Reply all
Reply to author
Forward
0 new messages