How to increase Scrapy crawling speed?

1,128 views

Skip to first unread message

tanpure...@gmail.com

unread,

Oct 14, 2013, 7:56:48 AM10/14/13

to scrapy...@googlegroups.com

I am using Scrapy tool to crawl websites and extract data in a json file. But i have found that for some sites the crawler takes ages to crawl the complete website. My question is how can I increase the crawler speed so that the time taken to crawl is minimized.
I tried setting values for CONCURRENT_ITEMS, CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN in my Spider file. But the Website which I am using to extract data from is quite big and its becoming difficult for me to test the time taken to crawl entire site by trial and error method for these values as i have to wait for quite a time to execute the crawler and then again change the values and test.
So then I i tried to add limit to number of pages crawled using Extension "scrapy.contrib.closespider.CloseSpider" and set CLOSESPIDER_PAGECOUNT = 100..but still it takes a lot of time. I also reduced the value to 5 still not working for me. Is the reason because I have set rule :
rules = [ Rule(SgmlLinkExtractor(allow=()), follow=True, callback=('parse_item')) ] .
Any help will be appreciated..

Rolando Espinoza La Fuente

unread,

Oct 14, 2013, 8:36:43 AM10/14/13

to scrapy...@googlegroups.com

You can consider running multiple spiders in parallel for the large websites. But be aware that crawling at speed of light can be considered (D)DoS against the website.

scrapy-redis (https://github.com/darkrho/scrapy-redis) can help you to distribute the request queue through multiple spider processes. This assuming your bottleneck is the bandwidth and you run each spider in a different host. If your bottleneck is the cpu then you need to add either more power or multiple hosts.

Also, if your bottleneck is the cpu, you might want to consider fine tuning the parsing, i.e., using lxml directly instead the SgmlLinkExtractor or a lxml-based link extractor, avoid to create lists (link extractors, selectors) and use generators instead, and so on.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

0 new messages