Running scrapy in scale

adi....@cortica.com

unread,

Dec 3, 2014, 7:01:52 AM12/3/14

to scrapy...@googlegroups.com

Hi,

I am building a back-end which one of its modules needs to do web scraping of various sites. The URL is originated by an end user, therefore the domain is known before-hand, but the full URL is dynamic.

The back-end is planned to support thousands of requests per second.

I like what I see for scrapy regarding feature coverage, extensibility, ease of use and more, but I am concerned of those 2 points:

1. Passing the URL in real-time as an argument to scrapy, where only the domain (therefore, the specific spider) is known

2. I've read that in order to invoke scrapy via API one should use scrapyd with json API, which invokes a process per scraping. It means that a process per request runs, and this is not scalable (imagine each request takes 1.5 second).

Please advise,

Travis Leleu

unread,

Dec 3, 2014, 10:24:36 AM12/3/14

to scrapy...@googlegroups.com

Hi Adi,

I believe scrapy would meet your needs, especially since you have a decentralized queue to feed the urls into it.

1. If you use the start_requests() method (see more: http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests), you can just consume from the queue to feed URLs into scrapy. You can pop the queue, modify the URL as needed, and yield it to the scrapy core engine.

2. scrapyd is a convenient way to send jobs around to different systems, without having to copy your codebase. It's essentially a deployment tool. Scrapy is pretty efficient for web scraping. Scraping is I/O bound, and scrapy uses Twisted, an async http framework. So scrapy fires off a request, then forgets about it until the request comes back through Twisted. In the interim, it can process or fire off other requests.

Processing requirements vary, but I would expect you could have hundreds, if not thousands, of concurrent scraping requests using a medium sized ec2 server.

In my experience, the only shortcomings of scrapy are the architectural complexity (takes some time to master), and the lack of javascript support. So many sites are one page apps that load their content via js, and scrapy (to my knowledge) can't do anything with that.

Hope this helps,

Travis

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Erik van de Ven

unread,

May 10, 2015, 6:01:01 AM5/10/15

to scrapy...@googlegroups.com, m...@travisleleu.com

Use this for javascript, works perfect: https://github.com/brandicted/scrapy-webdriver

Op woensdag 3 december 2014 16:24:36 UTC+1 schreef Travis Leleu:

Andres Vargas - zodman

unread,

May 11, 2015, 8:50:25 PM5/11/15

to scrapy...@googlegroups.com, m...@travisleleu.com

what about the speed ?

--

Andres Vargas
www.zodman.com.mx

Reply all

Reply to author

Forward