About scrapy's concurrency model

525 views
Skip to first unread message

Edward Cui

unread,
Jun 6, 2012, 10:29:58 PM6/6/12
to scrapy-users
Hey guys,

Now I plan to use scrapy in a more distributed approach, and I'm not
sure if the spiders/pipelines/downloaders/schedulers and engine are
all hosted in separate processes or threads, could anyone share some
info about this? and could we change the process/thread count for each
component? I know now there are two settings "CONCURRENT_REQUESTS" and
"CONCURRENT_ITEMS", they will determine the concurrent threads for
downloaders and pipelines, right? and if I want to deploy spiders/
pipelines/downloaders in different machines, I need to serialize the
items/requests/responses, right?

Appreciate very much for your helps!!

Thanks,
Edward.

Shane Evans

unread,
Jun 6, 2012, 11:02:20 PM6/6/12
to scrapy...@googlegroups.com
On 07/06/12 11:29, Edward Cui wrote:
Hey guys,

Now I plan to use scrapy in a more distributed approach, and I'm not
sure if the spiders/pipelines/downloaders/schedulers and engine are
all hosted in separate processes or threads, could anyone share some
info about this?
Everything in scrapy is in a single python process.

It uses Twisted, and the network concurrency comes from issuing multiple requests asynchronously instead of using threads and blocking requests.

 and could we change the process/thread count for each
component? I know now there are two settings "CONCURRENT_REQUESTS" and
"CONCURRENT_ITEMS", they will determine the concurrent threads for
downloaders and pipelines, right? and if I want to deploy spiders/
pipelines/downloaders in different machines, I need to serialize the
items/requests/responses, right?
If you want to implement a distributed crawler, you can use scapy on each machine (at least one process per core) and have some way to distribute the work to each.

The best way to do this really depends on your crawling task.

If you just want to run existing spiders on more than one machine, you can set up multiple scrapyd processes (one on each machine) and write some job control logic to submit and control the jobs. Alternatively, you could use Scrapinghub's Scrapy cloud to do this, along with a lot more.

If you have a very large crawl and want to use multiple machines, you could try either:
  • Splitting up into parts and letting each spider crawl a section (see my answer on SO for some ideas)
  • Replacing some scrapy components to read from remote shared datastructures, instead of the local filesystem. Scrapy-redis is an example of this approach.


Cheers,

Shane

xucheng

unread,
Jun 6, 2012, 11:17:11 PM6/6/12
to scrapy...@googlegroups.com
Just as Edward says, u may try gearman for job dispatching .

Keep in touch !


2012/6/7 Shane Evans <shane...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Geek Gamer

unread,
Jun 6, 2012, 11:36:47 PM6/6/12
to scrapy...@googlegroups.com
there are two scenarios

1. where you want to crawl more sites using multiple machines, this
can be achieved simply by using scrapyd with some centralized queue
like redis as already mentioned (I am using mysql based queue since i
already need mysql for other updates)

2. you want to use multiple machines to crawl same site in parallel,
this would be a very interesting situation giving better utilization
of resources and also help in improving the limit per ip for any
particular site. As I understand there is a requests queue for each
crawler process, and I assume this can also be centralized and have
processes pick up requests from it. I would be very interested to hear
if anyone else has tried it.

Edward Cui

unread,
Jun 7, 2012, 3:55:11 AM6/7/12
to scrapy...@googlegroups.com
Thanks Shane! 

Edward Cui

unread,
Jun 7, 2012, 3:56:17 AM6/7/12
to scrapy...@googlegroups.com
Thanks Jerry! 
2012/6/7 Shane Evans <shane...@gmail.com>
To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.

Edward Cui

unread,
Jun 7, 2012, 3:56:51 AM6/7/12
to scrapy...@googlegroups.com
Thanks Umar!
>> scrapy-users+unsubscribe@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/scrapy-users?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users+unsubscribe@googlegroups.com.

Shane Evans

unread,
Jun 7, 2012, 6:06:12 AM6/7/12
to scrapy...@googlegroups.com
On 07/06/12 12:36, Geek Gamer wrote:
> there are two scenarios
>
> 1. where you want to crawl more sites using multiple machines, this
> can be achieved simply by using scrapyd with some centralized queue
> like redis as already mentioned (I am using mysql based queue since i
> already need mysql for other updates)
This is the first approach I mentioned as " write some job control logic
to submit and control the jobs". While redis is a good choice for this
approach too, scrapy-redis is something different. It is an example of
crawling the same site from multiple machines, like your point 2 below.
The idea is to replace the Scheduler with one that stores request queues
and duplicate filters in redis.
Reply all
Reply to author
Forward
0 new messages