Running scrapy in scale

120 views
Skip to first unread message

adi....@cortica.com

unread,
Dec 3, 2014, 7:01:52 AM12/3/14
to scrapy...@googlegroups.com
Hi,
I am building a back-end which one of its modules needs to do web scraping of various sites. The URL is originated by an end user, therefore the domain is known before-hand, but the full URL is dynamic.

The back-end is planned to support thousands of requests per second.
I like what I see for scrapy regarding feature coverage, extensibility, ease of use and more, but I am concerned of those 2 points:

1. Passing the URL  in real-time as an argument to scrapy, where only the domain (therefore, the specific spider) is known
2. I've read that in order to invoke scrapy via API one should use scrapyd with json API, which invokes a process per scraping. It means that a process per request runs, and this is not scalable (imagine each request takes 1.5 second).

Please advise,

Travis Leleu

unread,
Dec 3, 2014, 10:24:36 AM12/3/14
to scrapy...@googlegroups.com
Hi Adi,

I believe scrapy would meet your needs, especially since you have a decentralized queue to feed the urls into it.

1. If you use the start_requests() method (see more: http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests), you can just consume from the queue to feed URLs into scrapy.  You can pop the queue, modify the URL as needed, and yield it to the scrapy core engine.

2. scrapyd is a convenient way to send jobs around to different systems, without having to copy your codebase.  It's essentially a deployment tool.  Scrapy is pretty efficient for web scraping.  Scraping is I/O bound, and scrapy uses Twisted, an async http framework.  So scrapy fires off a request, then forgets about it until the request comes back through Twisted.  In the interim, it can process or fire off other requests.

Processing requirements vary, but I would expect you could have hundreds, if not thousands, of concurrent scraping requests using a medium sized ec2 server.

In my experience, the only shortcomings of scrapy are the architectural complexity (takes some time to master), and the lack of javascript support.  So many sites are one page apps that load their content via js, and scrapy (to my knowledge) can't do anything with that.

Hope this helps,
Travis

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Erik van de Ven

unread,
May 10, 2015, 6:01:01 AM5/10/15
to scrapy...@googlegroups.com, m...@travisleleu.com
Use this for javascript, works perfect: https://github.com/brandicted/scrapy-webdriver

Op woensdag 3 december 2014 16:24:36 UTC+1 schreef Travis Leleu:

Andres Vargas - zodman

unread,
May 11, 2015, 8:50:25 PM5/11/15
to scrapy...@googlegroups.com, m...@travisleleu.com
what about the speed ?
--
Andres Vargas
www.zodman.com.mx
Reply all
Reply to author
Forward
0 new messages