The problem you're experiencing is due to a well-known limitation of Twisted
which doesn't support restarting reactors:
http://twistedmatrix.com/trac/wiki/FrequentlyAskedQuestions#WhycanttheTwistedsreactorberestarted
So if you want to run multiple crawlers, you need to start one reactor (at
first, when the process starts) and then run the multiple crawlers.
Here's an example:
http://snippets.scrapy.org/snippets/9/
I reckon it would be nicer if reactors were restartable, because it would hide
the asynchronous API inside the reactor.start() blocking call, so you don't
have to worry about using threads for simulating a blocking behaviour. But,
until someones fixes the restartable reactor issue, there's no alternative.
According to the FAQ entry, it shouldn't be too difficult to fix this, and the
main reason why it hasn't been done is the lack of interest in the feature, but
I've never looked into this in detail, so I couldn't say.
Pablo.
> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
That's a rather elegant way to circumvent the reactor restart issue with the
multiprocessing library, thanks for sharing!. Btw, would you mind posting that
code in http://snippets.scrapy.org for future reference?
Pablo.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
No reason. On the contrary, it would be useful, but we'd have to define the
API, write some tests and, if possible, document it.
I think the simplest API would be to return an iterator over the scraped items,
so you would call it like this:
crawler = CrawlerScript(settings)
for item in crawler.crawl_spider("spider_name"):
print "Got item: %s" % item
What do you think?
Error handling is another thing to think about.
But it doesn't need to be perfect from the start. Until it's stable enough, we
can add this into the scrapy.contrib_exp package (which is used for
experimental features).
Pablo.
On Sun, Oct 24, 2010 at 01:15:57PM -0700, Joe Hillenbrand wrote:No reason. On the contrary, it would be useful, but we'd have to define the
> Sure, no problem. I was actually planning to do that but was waiting to hear
> what you thought of if first.
>
> Also, is there any reason this functionality (or something like it) couldn't
> be built into scrapy as an API?
API, write some tests and, if possible, document it.
I think the simplest API would be to return an iterator over the scraped items,
so you would call it like this:
crawler = CrawlerScript(settings)
for item in crawler.crawl_spider("spider_name"):
print "Got item: %s" % item
What do you think?
Error handling is another thing to think about.
But it doesn't need to be perfect from the start. Until it's stable enough, we
can add this into the scrapy.contrib_exp package (which is used for
experimental features).
Pablo.