Yes, you can run as many instances of a single spider in parallel, and it will spawn a different process per run (to use many cores). This was one of the design goals of Scrapyd: to circumvent Python concurrency limitations and be able to use many cores. The max_proc setting of scrapyd allow to set how many concurrent processes you want to run, and it defaults to the number of cores available in the system. Of course, each run is a different process and completely isolated from the other ones so it doesn't share the request queue. You can partition the start urls (if you have a predefined list of urls to crawl), and give each run a different partition. Another thing that has been mentioned (but I haven't tested myself) is
scrapy-redis which allows you to spawn many spider runs sharing the same requests queue.
Good luck!
On Sat, Sep 29, 2012 at 11:47 AM, Ilya Persky
<ilya....@gmail.com> wrote:
Hello guys!
Recently I've came across the documentation section which describes scrapyd. It is said there that scrapyd can run several spiders in parallel. So my questions are:
1) Can scrapyd run several instances of _one_ spider at a time? Say, I have a CPU-bounded spider and it would be a great thing to see it running on two cores. If yes - how exactly can I do that?
2) Again, if this is possible - would these spiders share one request queue or I'd need some additional programming to make this work?
Thank you in advance!
Regards,
Ilya.
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/Uks3wzSM8dAJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.