Development box with Scrapy(d)s, ES, MySQL, Redis and Spark.

147 views
Skip to first unread message

Dimitris Kouzis - Loukas

unread,
Apr 2, 2016, 1:01:56 PM4/2/16
to scrapy-users
Hello,

I've just realised that some of you might find my Vagrant box quite useful. I used it for my book and works on Win, Mac and Linux. It contains (see diagram below) a few scrapyd servers, Elasticsearch, Redis, MySQL and Apache Spark. Its size is ~2.6GB and once you download it, it starts within seconds. Note that this box shouldn't be used in production since it doesn't follow several best practises in order to provide significant functionality while remaining easy to use and able to run on a typical laptop (VM requires 2GBs of RAM). 

Here's how to use it:

1. Download and add it (assumes you have Virtualbox/Vagrant):

$ vagrant box add scrapybook scrapybook.box

2. Get rid of book-specific stuff:

$ mv Vagrantfile.dockerhost.boxed Vagrantfile.dockerhost
$ ls -A1 | egrep -v "(Vagrant|insecure_key)" | xargs rm -r

That's it! You should have only those 3 files at the end of this process:
 
$ ls
Vagrantfile   Vagrantfile.dockerhost  insecure_key

3. You can start the system by doing:

$ vagrant up --no-parallel

It takes just a few seconds. This is the system this command sets up (click for larger):


It has 8 independent servers with different software on them. It's much more realistic than the typical dev environment that has everything in a single box. You can connect to the dev machine:

$ vagrant ssh
$ cd book

Whatever you do in the book directory is reflected on your host's directory and vice versa. This means you can use your native IDE to edit your files, even if you are on Windows and still run/test your code from Vagrant's Linux box. You might find that running Scrapy code in Ubuntu 14.04 is smoother and faster that developing on your host environment despite the fact it's running in a VM. Anyway, from the dev machine you have all the tools you need to use MySQL, Redis and ElasticSearch:

$ mysql -h mysql -uroot -ppass
$ redis-cli -h redis

You can also access them from your host machine e.g. you can open http://localhost:9200 on your web browser. You can also install whatever you like with the usual sudo apt-get etc. You can access the Spark server directly. As an example of what you can do with it, open another terminal (command prompt) and connect to the Spark server: 

$ vagrant ssh spark

Then you can type-in a minimal Spark streaming application that monitors the special directory /root/items, that I've set-up to be written by an ftp service which runs on the same server:

$ pyspark
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 5)
>>> ssc.textFileStream("file:///root/items").pprint()
>>> ssc.start()

This is an easy way to connect your dev and scrapyd machines without using infrastructure that requires tons of CPU and RAM. In production it's trivial to replace this functionality by using e.g. S3 or Kafka. If you go back to your dev terminal, you can use it with a trivial Scrapy application:

$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
$ echo '        return {"foo": "bar"}' >> tutorial/spiders/example.py
$ scrapy crawl example -o ftp://anonymous@spark/foobar.$RANDOM.jl

When you run the scrapy crawl command you see a {"foo":"bar"} item printed on Spark's side.

I so hope you find this Vagrant box useful.

Cheers,
Dimitris
Reply all
Reply to author
Forward
0 new messages