what makes scrapy so much faster then other spiders/data miners?

207 views
Skip to first unread message

scrapbook

unread,
Apr 19, 2011, 2:40:35 PM4/19/11
to scrapy-users
Hi I would love some detailed information about scrapy and why it runs
so fast?

Im amazed at the speed of it compared to when I used to scrape with
php and curl

Thanks.

bruce

unread,
Apr 19, 2011, 7:11:16 PM4/19/11
to scrapy...@googlegroups.com
who said it is??

i would imagine if you're really comparing apple to apples, ie, using
the same threading model/approach, there shouldn't be a great deal of
overhead difference...

but one would have to understand the gut level approaches of both processes.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>

asmith

unread,
Apr 27, 2011, 1:39:51 PM4/27/11
to scrapy-users
sync vs. async

scrapy uses twisted under the covers: http://twistedmatrix.com

a nice discussion of asynchronous programming can be found here:
http://krondo.com/?p=1209

try these two scripts out -- each one gets invoked with a number and a
url. the number is the number of times to pull that url. you should
see that the larger the number of urls to fetch the better the async
model performs.

-------8<---------sync-url.py----------8<----------
#!/usr/bin/python

import time
import urllib

def main(args):
limit = args[0]
url = args[1]
start = time.time()
for i in xrange(1, int(limit)):
urllib.urlopen(url)
print "finished request %s" % i
end = time.time()
print end - start

if __name__ == '__main__':
import sys
main(sys.argv[1:])


-------8<---------async-url.py----------8<----------
#!/usr/bin/python

from twisted.internet import reactor
from twisted.web.client import getPage

import time

jobs = list()
start = 0

def shutdown():
global start
print time.time() - start
reactor.stop()

def cb(result, jobid):
print "finished request %d" % jobid
jobs.remove(jobid)
if not jobs:
shutdown()

def main(args):
global start, jobs
limit, url = args
jobs = range(1, int(limit))
start = time.time()
for i in jobs:
d = getPage(url)
d.addCallback(cb, i)

if __name__ == '__main__':
import sys
main(sys.argv[1:])
reactor.run()
Reply all
Reply to author
Forward
0 new messages