Designing multiple spiders orchestration

84 views
Skip to first unread message

Jérémy Subtil

unread,
Apr 8, 2014, 8:51:17 AM4/8/14
to scrapy...@googlegroups.com
Hi there,

I'm designing a crawler scraping the same kind of item from different sources A, B, C. First, items are found from source A. Then some business logic defines if the same item should be looked up from other sources B and C. If so, new requests are sent from the corresponding spiders B and C.

AFAIK, a spider can't communicate with another spider.

To solve this problem through the scrapy stack, I suppose I have to write a Python script instantiating spiders A, B and C. This script would embed the orchestration logic by listening on spiders signals like item_scraped and spider_idle. Are resources handled correctly by scrapy this way?

An alternative would be to deploy the spiders to scrapyd, while the orchestration logic would be written in a separate program communicating with scrapyd and spiders through REST. Resources and queue would be handled in a better way through scrapyd, wouldn't it?

My feeling is that scrapyd + external orchestration through REST is a better approach. Do you think so too?

Thanks in advance for your feedback.

Cheers,

Jeremy

Bill Ebeling

unread,
Apr 11, 2014, 7:45:10 AM4/11/14
to scrapy...@googlegroups.com
If I was tasked with writing spiders that scraped based on other spiders activity, I would let one spider run fully, persist the data, then read the data into the next spider. 

If it was for some reason critical that the item is processed immediately, then I would write one spider, allowing all relevant domains, and use logic to route the requests..  maybe have a bunch of methods that scrape sites and they call a router method as callback.  That router method investigates the item and calls the next required scraping method. When the item isn't routed anywhere it finally gets sent to the pipeline.

Nice and simple.
Reply all
Reply to author
Forward
0 new messages