Scrapy slows down when the scheduler queue gets big

392 views
Skip to first unread message

Federico Feroldi

unread,
Feb 25, 2009, 10:13:34 AM2/25/09
to scrapy-users
Hi all,

I'm crawling a very large site and I've noticed that the crawler slows
down a lot when the scheduler pending queue goes beyond 100k requests.
My first thought was about the heapq modules that is used to implement
the priority queues, do you think that its efficiency drops on very
large queues?
What else can be the bottleneck?
Do somebody have experience with crawling of very large sites?

cheers

-
Federico Feroldi
http://cloudify.me

Pablo Hoffman

unread,
Mar 22, 2009, 1:10:41 PM3/22/09
to scrapy...@googlegroups.com
Federico, is Scrapy more usable now, for very large sites, after ther
merge of your Priority Queue implementation?

Pablo.

Federico Feroldi

unread,
Mar 23, 2009, 9:44:40 AM3/23/09
to scrapy-users
Hi Pablo,
unfortunately not, I think that Scrapy has a limitation in scalability
for very connected sites (think about social networks where you may
have thousands of connection between pages) where the queue grows a
lot and so the used memory.
I experienced a memory usage of about 5GB during one crawling, so if
you crawl very large site you must have a lot of memory to keep of the
objects. I've tried to implement a persistent priority queue based on
a key/value dbm but I've found that it's very complicated to manage
the deferred objects and you cannot easily serialize/deserialize the
requests from the priority queue because you loose the references to
the callbacks.
On another side, even with a lot of memory (so no swapping), I've
found that the performance drops during the crawl, I don't know if
it's due to the size of queue or maybe because there are too many
objects references and the python vm becomes less efficient on
addressing.
Here's some data of one of these crawls, as you can see the crawler
starts with a speed of about 7 pages per second and later drops to 0.5
pages per second when the size of the queue grows to more than 400k
items.
I believe that to make the scrapy framework really scalable you must
add the option to have a persistent storage for the queue and be sure
to have a O(1) scheduler.
It would also nice to be able to decouple the downloader/spider/
scheduler elements with something like a message queue to be able to
run many instances of them on multiple hosts.

Time (s) Items Pages Queue items/s pages/s items/page queued/page
114.37 2763 398 7848 24.16 3.48 6.94 19.72
264.57 6151 736 16833 23.25 2.78 8.36 22.87
326.38 7531 864 19837 23.07 2.65 8.72 22.96
738.62 15886 1630 37981 21.51 2.21 9.75 23.30
1,172.55 27064 2523 57737 23.08 2.15 10.73 22.88
1,327.45 30648 2815 64171 23.09 2.12 10.89 22.80
1,623.96 36477 3288 73804 22.46 2.02 11.09 22.45
1,953.17 41115 3688 80954 21.05 1.89 11.15 21.95
2,156.45 42864 3831 83583 19.88 1.78 11.19 21.82
3,150.46 51575 4536 96032 16.37 1.44 11.37 21.17
4,201.18 59269 5153 105699 14.11 1.23 11.50 20.51
5,295.83 67220 5841 115858 12.69 1.10 11.51 19.84
6,595.78 75952 6590 126885 11.52 1.00 11.53 19.25
64,307.18 333010 29965 409072 5.18 0.47 11.11 13.65
Reply all
Reply to author
Forward
0 new messages