Hi there,
I'm designing a crawler scraping the same kind of item from different sources A, B, C. First, items are found from source A. Then some business logic defines if the same item should be looked up from other sources B and C. If so, new requests are sent from the corresponding spiders B and C.
AFAIK, a spider can't communicate with another spider.
To solve this problem through the scrapy stack, I suppose I have to write a Python script instantiating spiders A, B and C. This script would embed the orchestration logic by listening on spiders signals like item_scraped and spider_idle. Are resources handled correctly by scrapy this way?
An alternative would be to deploy the spiders to scrapyd, while the orchestration logic would be written in a separate program communicating with scrapyd and spiders through REST. Resources and queue would be handled in a better way through scrapyd, wouldn't it?
My feeling is that scrapyd + external orchestration through REST is a better approach. Do you think so too?
Thanks in advance for your feedback.
Cheers,
Jeremy