Multiple spiders interleaving requests

Nikolay Surovenko

unread,

Mar 24, 2012, 4:41:16 AM3/24/12

to scrapy-users

Hi, I would like to crawl multiple sites (thousands) and interleave
requests so that each IP doesn't get more than 1 at a time, or 1
request per second (n seconds).

I would like to have site-specific code for each of those sites, and
save persistent state to (rarely) check for updates.

Here are my options, as far as I understand Scrapy:
a) write spider code for each of the sites, set download delay = 1,
concurrent requests = 1, launch all spiders through scrapyd, in their
own process.
This one is probably a terrible idea.
b) construct a single spider from my domain specific code and use
limit connections per IP = 1
Doesn't seem to be the proper way to use spiders. Persistent states
would be intermingled for all domains, so my only option for update
will be 'update everything'.
c) use multiple crawler-spider objects with 1 shared Downloader, and
limit connections per IP
I'm concerned about managing thousands of persistent states
independently, will it have a noticeable effect on performance?

Anything I'm missing? Which way is the best?

Nikolay Surovenko.

Matthew Leon

unread,

Mar 26, 2012, 1:24:56 PM3/26/12

to scrapy...@googlegroups.com

> a) write spider code for each of the sites, set download delay = 1,
> concurrent requests = 1, launch all spiders through scrapyd, in their
> own process.
> This one is probably a terrible idea.

Why? It's certainly the most straightforward way to do things. Are you worried about performance? Why don't you try it out and see how it scales?

You can always test it by just issuing "scrapy crawl spider1 spider2 spider3 ..."

-Matthew

Nikolay Surovenko

unread,

Mar 26, 2012, 1:53:43 PM3/26/12

to scrapy...@googlegroups.com

> Why? It's certainly the most straightforward way to do things. Are you worried about performance? Why don't you try it out and see how it scales?
>
> You can always test it by just issuing "scrapy crawl spider1 spider2 spider3 ..."

crawl: error: running 'scrapy crawl' with more than one spider is no
longer supported

My primary concern is that I will be crawling many sites that are
trying to stop people stealing their content (student research
papers). Ironically, I'm not interested in their content per se, I
want to index it to for our source-tracking app. With these annoying
sites I would have extra long delays, and each individual spider would
spend a lot of time idling.

I dug into some scrapy code now, and I think I will, indeed, go with
option a). I probably won't have more than 50 of these extra-clever
sites, and scrapy might get multiple spiders running in 0.16.