how to run multiple spiders concurrently in code?

3,689 views
Skip to first unread message

carrier24sg

unread,
Dec 26, 2011, 7:54:54 PM12/26/11
to scrapy-users
How do I start several spiders concurrently as in scrapy crawl spider1
spider2 ... (for 0.14) but in code? Apparently CrawlerProcess do not
allow more than 1 spider.

Pablo Hoffman

unread,
Dec 27, 2011, 6:37:01 PM12/27/11
to scrapy...@googlegroups.com
Short answer: Running multiple spiders into the same scrapy crawl
process is no longer supported (since 0.14) in favour of using scrapyd
to run multiple spiders (one per process).

Longer answer: Support for running multiple spiders into the same
process will be restored for 0.16 (and it's in the roadmap for 1.0) by
instantiating multiple scrapy.crawler.Crawler classes, which will allow
to override any scrapy setting per spider (something you cannot do with
the old approach of running multiple spiders in the same process).

carrier24sg

unread,
Dec 28, 2011, 2:06:33 AM12/28/11
to scrapy-users
I am curious (as I'm not familiar with twisted):

(1) does one spider per process with multiple spiders running means
multiple reactors?
(2) does one spider per process affect logging to the same file and
exporting the items to the same file?
(3) U mentioned on another thread to use multiple spiders for one
process, use the Crawler object and configure the reactor. I've
browsed through the code, can I simply register another spider using
the ExecutionEngine.open_spider function?

Pablo Hoffman

unread,
Jan 16, 2012, 9:13:59 AM1/16/12
to scrapy...@googlegroups.com
(1) No, you can't have more than one reactor running in a single process
(this is a Twisted limitation). You may have multiple crawlers running
in a single Twisted reactor (hence, in a single process).

(2) This is a good question. Logging will probably be kept as a single,
globally available facility - meaning there won't be a different logger
per spider (remember that log lines contain the spider name)

(3) No, this is how it used to work before, where you could open more
than a single spider per Crawler. In the future, you will instantiate
multiple Crawlers (one per spider), and crawl a single spider with each
of them. The Crawler may receive a Downloader as constructor argument,
so you can share the same Downloader within many Crawlers.

I hope that's clear. This is something not implemented yet, but the idea
is to implement it for Scrapy 0.16.

Pablo Hoffman

unread,
Mar 14, 2013, 12:22:40 PM3/14/13
to scrapy...@googlegroups.com
Hi Beto,

These ideas have been pretty much implemented by now. Scrapy 0.16 is singleton-free and you can have multiple Crawler objects running in a single process/twisted-reactor. There's a small section in the documentation that explains how:

Pablo.


On Tue, Mar 12, 2013 at 10:37 AM, Beto Boullosa <betobo...@gmail.com> wrote:
Hi, Pablo, I have one question, since this thread is a little bit older already: has anything of this (multiple crawlers per process/spider) been implemented in Scrapy 0.16 or 0.17 yet? Are these ideas in the roadmap?

Thanks for your attention,
Beto

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Beto Boullosa

unread,
Mar 14, 2013, 1:08:39 PM3/14/13
to scrapy...@googlegroups.com
Hi, Pablo,

Thanks for your answer. I had already figured out this piece of documentation, nice one. I understand now that I can have multiple crawlers in my process, I've made some tests and it works fine. 

Nevertheless, I'm facing a bigger problem now: I'm trying to develop some kind of "crawler consumer" that would consume a queue with the description of the domains to be crawled. I'm using RabbitMQ to do all the queue stuff: the consumer pops from the queue the next domain to be crawled, then instantiates a new scrapy crawler, then runs it and so on.

Problem is: by some odd reason, when I integrate both RabbitMQ and Scrapy, the crawler that I instantiate never crawls anything, although its spider is initiated ok and everything else seems fine. It's as though the crawler callbacks for scraping the items are never reached.

Incidentally, if I force to close the connection with the RabbitMQ queue (or if I shutdown RabbitMQ completely), then the crawlers work again.

So, it looks like there is some kind of interference between the RabbitMQ mechanism and the scrapy/twisted internals that, for some reason, blocks the reactor from working properly when both are running.

I've also tested creating every spider in its own thread, but the problem remains.

Would you have any ideas to share on that?

Thanks a lot,
Beto

Pablo Hoffman

unread,
Mar 18, 2013, 11:04:05 PM3/18/13
to scrapy...@googlegroups.com
If the RabbitMQ library you are using provides a blocking API, you have two options:

1. poll (instead of doing a blocking read) to check for more work
2. do a blocking read, but do it in a thread. Leave the main thread for running twisted reactor (and scrapy Crawlers).

Don't spawn multiple thread for multiple Crawlers, that's not how it works. All crawlers should run (asynchronously) in the same thread where the Twisted reactor is running.

For option 2, you should probably use callFromThread to ensure thread safety (see Using Threads in Twisted)

Beto Boullosa

unread,
Mar 22, 2013, 9:13:48 AM3/22/13
to scrapy...@googlegroups.com
Hi, Pablo,

Thanks for your answer. Option 2 with callFromThread is working fine. :)

Now we're doing some tests to integrate it with Celery, but we haven't succeeded yet, mainly because we haven't found a way of running Celery outside the main thread.

Cheers,
Beto

Pablo Hoffman

unread,
Mar 22, 2013, 9:19:28 AM3/22/13
to scrapy...@googlegroups.com
I think you *should* be able to run twisted reactor in a non-main thread, if you disable (OS) signals.

Beto Boullosa

unread,
Mar 22, 2013, 9:33:44 AM3/22/13
to scrapy...@googlegroups.com
Thanks for your quick answer! :)

What do you mean by the OS signals? Anything configurable in twisted itself?


Beto Boullosa

unread,
Mar 25, 2013, 1:02:48 PM3/25/13
to scrapy...@googlegroups.com
Hi, Pablo, 

I've already figured out what you meant. I've succeeded in disabling the signals through reactor.run(installSignalHandlers=False).

Doing this we were able to run the reactor in another thread without problems. And thus, we've succesfully managed to control the flow of execution of reactor from Celery in the main thread, using reactor.callFromThread.

Thanks again!
Beto

carrier24sg

unread,
Apr 2, 2013, 10:38:22 PM4/2/13
to scrapy-users
Hi Beto,

just wanna check with you if you have success with integrating scrapy
with celery.

Right now I with to integrate scrapy into my celeryd workers. Somehow
i follow http://stackoverflow.com/questions/11528739/running-scrapy-spiders-in-a-celery-task
but so far wasn't able to get multiprocessing working with the that.

Any help you can render?

Pablo?

On Mar 26, 1:02 am, Beto Boullosa <b...@boullosa.org> wrote:
> Hi, Pablo,
>
> I've already figured out what you meant. I've succeeded in disabling the
> signals through *reactor.run(installSignalHandlers=False).*
> *
> *
> Doing this we were able to run the reactor in another thread without
> problems. And thus, we've succesfully managed to control the flow of
> execution of reactor from Celery in the main thread,
> using reactor.callFromThread.
>
> Thanks again!
> Beto
>
>
>
>
>
>
>
> On Fri, Mar 22, 2013 at 10:33 AM, Beto Boullosa <b...@boullosa.org> wrote:
> > Thanks for your quick answer! :)
>
> > What do you mean by the OS signals? Anything configurable in twisted
> > itself?
>
> > On Fri, Mar 22, 2013 at 10:19 AM, Pablo Hoffman <pablohoff...@gmail.com>wrote:
>
> >> I think you *should* be able to run twisted reactor in a non-main thread,
> >> if you disable (OS) signals.
>
> >> On Fri, Mar 22, 2013 at 10:13 AM, Beto Boullosa <b...@boullosa.org>wrote:
>
> >>> Hi, Pablo,
>
> >>> Thanks for your answer. Option 2 with callFromThread is working fine. :)
>
> >>> Now we're doing some tests to integrate it with Celery, but we haven't
> >>> succeeded yet, mainly because we haven't found a way of running Celery
> >>> outside the main thread.
>
> >>> Cheers,
> >>> Beto
>
> >>> On Tue, Mar 19, 2013 at 12:04 AM, Pablo Hoffman <pablohoff...@gmail.com>wrote:
>
> >>>> If the RabbitMQ library you are using provides a blocking API, you have
> >>>> two options:
>
> >>>> 1. poll (instead of doing a blocking read) to check for more work
> >>>> 2. do a blocking read, but do it in a thread. Leave the main thread for
> >>>> running twisted reactor (and scrapy Crawlers).
>
> >>>> Don't spawn multiple thread for multiple Crawlers, that's not how it
> >>>> works. All crawlers should run (asynchronously) in the same thread where
> >>>> the Twisted reactor is running.
>
> >>>> For option 2, you should probably use callFromThread to ensure thread
> >>>> safety (see Using Threads in Twisted<http://twistedmatrix.com/documents/12.0.0/core/howto/threading.html>
> >>>> )
>
> >>>> On Thu, Mar 14, 2013 at 2:08 PM, Beto Boullosa <b...@boullosa.org>wrote:
>
> >>>>> Hi, Pablo,
>
> >>>>> Thanks for your answer. I had already figured out this piece of
> >>>>> documentation, nice one. I understand now that I can have multiple crawlers
> >>>>> in my process, I've made some tests and it works fine.
>
> >>>>> Nevertheless, I'm facing a bigger problem now: I'm trying to develop
> >>>>> some kind of "crawler consumer" that would consume a queue with the
> >>>>> description of the domains to be crawled. I'm using RabbitMQ to do all the
> >>>>> queue stuff: the consumer pops from the queue the next domain to be
> >>>>> crawled, then instantiates a new scrapy crawler, then runs it and so on.
>
> >>>>> Problem is: by some odd reason, when I integrate both RabbitMQ and
> >>>>> Scrapy, the crawler that I instantiate never crawls anything, although its
> >>>>> spider is initiated ok and everything else seems fine. It's as though the
> >>>>> crawler callbacks for scraping the items are never reached.
>
> >>>>> Incidentally, if I force to close the connection with the RabbitMQ
> >>>>> queue (or if I shutdown RabbitMQ completely), then the crawlers work again.
>
> >>>>> So, it looks like there is some kind of interference between the
> >>>>> RabbitMQ mechanism and the scrapy/twisted internals that, for some reason,
> >>>>> blocks the reactor from working properly when both are running.
>
> >>>>> I've also tested creating every spider in its own thread, but the
> >>>>> problem remains.
>
> >>>>> Would you have any ideas to share on that?
>
> >>>>> Thanks a lot,
> >>>>> Beto
>
> >>>>> On Thu, Mar 14, 2013 at 1:22 PM, Pablo Hoffman <pablohoff...@gmail.com
> >>>>> > wrote:
>
> >>>>>> Hi Beto,
>
> >>>>>> These ideas have been pretty much implemented by now. Scrapy 0.16 is
> >>>>>> singleton-free and you can have multiple Crawler objects running in a
> >>>>>> single process/twisted-reactor. There's a small section in the
> >>>>>> documentation that explains how:
>
> >>>>>>http://doc.scrapy.org/en/latest/topics/practices.html#running-multipl...
> >>>>>>>> > On Dec 28, 7:37 am, Pablo Hoffman<pablohoff...@gmail.com**>
> >>>>>>> For more options, visithttps://groups.google.com/groups/opt_out.
>
> >>>>>>  --
> >>>>>> You received this message because you are subscribed to the Google
> >>>>>> Groups "scrapy-users" group.
> >>>>>> To unsubscribe from this group and stop receiving emails from it,
> >>>>>> send an email to scrapy-users...@googlegroups.com.
> >>>>>> To post to this group, send email to scrapy...@googlegroups.com.
> >>>>>> Visit this group athttp://groups.google.com/group/scrapy-users?hl=en
> >>>>>> .
> >>>>>> For more options, visithttps://groups.google.com/groups/opt_out.
>
> >>>>>  --
> >>>>> You received this message because you are subscribed to the Google
> >>>>> Groups "scrapy-users" group.
> >>>>> To unsubscribe from this group and stop receiving emails from it, send
> >>>>> an email to scrapy-users...@googlegroups.com.
> >>>>> To post to this group, send email to scrapy...@googlegroups.com.
> >>>>> Visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
> >>>>> For more options, visithttps://groups.google.com/groups/opt_out.
>
> >>>>  --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "scrapy-users" group.
> >>>> To unsubscribe from this group and stop receiving emails from it, send
> >>>> an email to scrapy-users...@googlegroups.com.
> >>>> To post to this group, send email to scrapy...@googlegroups.com.
> >>>> Visit this group athttp://groups.google.com/group/scrapy-users?hl=en.
> >>>> For more options, visithttps://groups.google.com/groups/opt_out.
>
> >>>  --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "scrapy-users" group.
> >>> To
>
> ...
>
> read more »

Beto Boullosa

unread,
Apr 4, 2013, 1:18:04 PM4/4/13
to scrapy...@googlegroups.com
Hi, I've been able to run scrapy with celery by putting the reactor in another thread, something like this:

thread.start_new_thread(init_reactor,())

def init_reactor():
        reactor.run(installSignalHandlers=False)

Doing this, the reactor will be run from another thread and won't interfere with celery processing. Then, your celery tasks will do something like this in order to control scrapy (simplified example code):

@celery.task
def crawl():
      threads.blockingCallFromThread(reactor, do_crawl)

def do_crawl():
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(mySpider)
    crawler.start()

carrier24sg

unread,
Apr 4, 2013, 10:02:32 PM4/4/13
to scrapy-users
Hi Beto,

thanks for your advice. One question, running blockingCallFromThread
returns as soon as do_crawl function returns without waiting for the
spider. how can i block the call until the spider finishes crawling?


On Apr 5, 1:18 am, Beto Boullosa <b...@boullosa.org> wrote:
> Hi, I've been able to run scrapy with celery by putting the reactor in
> another thread, something like this:
>
> thread.start_new_thread(init_reactor,())
>
> def init_reactor():
>         reactor.run(installSignalHandlers=False)
>
> Doing this, the reactor will be run from another thread and won't interfere
> with celery processing. Then, your celery tasks will do something like this
> in order to control scrapy (simplified example code):
>
> @celery.task
> def crawl():
>       threads.blockingCallFromThread(reactor, do_crawl)
>
> def do_crawl():
>     crawler = Crawler(settings)
>     crawler.configure()
>     crawler.crawl(mySpider)
>     crawler.start()
>
>
>
>
>
>
>
> On Tue, Apr 2, 2013 at 11:38 PM, carrier24sg <carrier2...@gmail.com> wrote:
> > Hi Beto,
>
> > just wanna check with you if you have success with integrating scrapy
> > with celery.
>
> > Right now I with to integrate scrapy into my celeryd workers. Somehow
> > i follow
> >http://stackoverflow.com/questions/11528739/running-scrapy-spiders-in...
> ...
>
> read more »

Beto Boullosa

unread,
Apr 4, 2013, 10:34:38 PM4/4/13
to scrapy...@googlegroups.com
Unfortunately, we haven't managed to acomplish that. The spider runs asynchronously and we haven't found a way of waiting or blocking until it finishes. We've tried to capture the end of crawling through a callback, but it didn't work, apparently because scrapy is running in another thread outside celery main thread.

Any ideas on how to do that are welcome.

carrier24sg

unread,
Apr 4, 2013, 10:41:31 PM4/4/13
to scrapy-users
also you don't need crawler.install()? I tried to omit
crawler.install() and got the exception cannot import name crawler
> ...
>
> read more »

Beto Boullosa

unread,
Apr 9, 2013, 2:48:59 PM4/9/13
to scrapy...@googlegroups.com
Sorry for the late answer: no, I don't use crawler.install().


> ...
>
> read more »

Reply all
Reply to author
Forward
0 new messages