Concurrent settings in scrapy/scrapyd

163 views

Skip to first unread message

k bez

unread,

Jan 6, 2017, 12:53:43 PM1/6/17

to scrapy-users

I have 2 scrapyd instances with max_proc=8 each.
I am aware for CONCURRENT_REQUESTS_PER_IP and CONCURRENT_REQUESTS_PER_DOMAIN settings but i read in a previous post that they are for each slot/proc.
I tested it and even if i set 1 on both settings its keep downloads 8x2 concurrent items of same domain.
Is there some way to limit concurrent requests when i have max_proc > 1 in scrapyd?

Nikolaos-Digenis Karagiannis

unread,

Jan 7, 2017, 6:24:42 AM1/7/17

to scrapy-users

Those settings limit the concurrent requests per ip/domain per downloader.
Since each process has a separate downloader,
you will have that amount more requests per downloader.

What you want to do totally makes sense.
You are crawling from the same machine.
If multiple processes start crawling the same host
you may get blocked.

There are several ways to solve this:

You can ensure that your crawls don't overlap (in terms of crawled hosts)
by configuring your spiders and whatever schedules their runs.

You can write a custom downloader
that communicates with other instances of itself
and they all together keep track of some global slot(per ip/domain) utilization.
I wonder if the developers would be interested in integrating such a solution in scrapy.

Finally, it's theoretically possible in a router/firewall.
Traffic shaping usually works the other way round
so I don't know if you'll find enough information on this.
This is different that limiting the requests per ip,
it would only limit the bandwidth per ip.

Reply all

Reply to author

Forward

0 new messages