what makes scrapy so much faster then other spiders/data miners?

scrapbook

unread,

Apr 19, 2011, 2:40:35 PM4/19/11

to scrapy-users

Hi I would love some detailed information about scrapy and why it runs
so fast?

Im amazed at the speed of it compared to when I used to scrape with
php and curl

Thanks.

bruce

unread,

Apr 19, 2011, 7:11:16 PM4/19/11

to scrapy...@googlegroups.com

who said it is??

i would imagine if you're really comparing apple to apples, ie, using
the same threading model/approach, there shouldn't be a great deal of
overhead difference...

but one would have to understand the gut level approaches of both processes.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>

asmith

unread,

Apr 27, 2011, 1:39:51 PM4/27/11

to scrapy-users

sync vs. async

scrapy uses twisted under the covers: http://twistedmatrix.com

a nice discussion of asynchronous programming can be found here:
http://krondo.com/?p=1209

try these two scripts out -- each one gets invoked with a number and a
url. the number is the number of times to pull that url. you should
see that the larger the number of urls to fetch the better the async
model performs.

-------8<---------sync-url.py----------8<----------
#!/usr/bin/python

import time
import urllib

def main(args):
limit = args[0]
url = args[1]
start = time.time()
for i in xrange(1, int(limit)):
urllib.urlopen(url)
print "finished request %s" % i
end = time.time()
print end - start

if __name__ == '__main__':
import sys
main(sys.argv[1:])

-------8<---------async-url.py----------8<----------
#!/usr/bin/python

from twisted.internet import reactor
from twisted.web.client import getPage

import time

jobs = list()
start = 0

def shutdown():
global start
print time.time() - start
reactor.stop()

def cb(result, jobid):
print "finished request %d" % jobid
jobs.remove(jobid)
if not jobs:
shutdown()

def main(args):
global start, jobs
limit, url = args
jobs = range(1, int(limit))
start = time.time()
for i in jobs:
d = getPage(url)
d.addCallback(cb, i)

if __name__ == '__main__':
import sys
main(sys.argv[1:])
reactor.run()

Reply all

Reply to author

Forward