Hello all,
I've recently been offered to write a 50-70 pages beginner's guide on Scrapy which would be aimed at non
Python users and web developers who are building apps relying on data scraped from the web.
I would be happy to know what the Scrapy community thinks about this project, and to hear any of your ideas!
I'm still working on the outline, but so far, I'm thinking about the following:
• Install + development environment setup• Writing first project• Scraping a template based website
• Define the data you want to collect
• Exporting data to generic formats (XML, JSON, CSV, JSON, YAML, etc) and writing your own item exporter
• Storing the data into database (mongoDB?, PostgreSQL?, ...)
• Scraping an RSS/ATOM feed
• Scraping a highly consulted website• Writing a middleware to limit scraping to recent pages (aka; avoid scraping a static page multiple times)• Scraping AJAX pages / using AJAX calls as an API• Scraping in the cloud (I'm thinking Scrapy Cloud and Heroku or OpenShift)
• Scraping JS-generated pages (ghost.py? webkit?)
If you think I've missed a major point, if you have any questions and/or suggestions, I'd be happy to hear it.
Cheers
--
B
phantomjs is a good idea to scrape js-pages
在 2013年3月6日星期三UTC+8上午5时56分47秒,Balthazar Rouberol写道:
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
how about introduce the scrapyd and performance optimization. thanks
Hi
+1 for distributed crawling