Hello all -
I mentioned this in a previous email to the list, but I wanted to
introduce an Open Source project I've been working on, called
pjscrape. It's a small-ish Javascript framework meant to simplify page-
scraping using (Py)PhantomJS and jQuery:
https://github.com/nrabinowitz/pjscrape
All I've ever wanted in a web-scraping tool is the ability to use
jQuery selectors and functions to get the data I'm interested in, with
a tool I can run from the command line, without having to use an
actual browser (so it can run in the background, or with cron, or with
celery, or...). The PhantomJS project made this incredibly easy to put
together.
Features:
* Client-side, Javascript-based scraping environment with full access
to jQuery functions
* Easy, flexible syntax for setting up one or more scrapers
* Recursive/crawl scraping
* Delay scrape until a "ready" condition occurs (checks for $
(document).ready() by default)
* Load your own scripts on the page before scraping
* Modular architecture for logging and writing/formatting scraped
items
* Client-side utilities for common tasks
* Growing set of unit tests
Basic syntax, in a config file:
pjs.addScraper(
// url or array of urls
'
http://www.example.com/page.html',
// function or array of functions, returning text, an
object, or an array of same,
// run in the client via page.evaluate()
function() {
return $('h1').first().text();
}
);
Which you can then run like this:
phantomjs /path/to/pjscrape.js my_config.js
I hope you find this interesting - while there are some things I'd
still need to write a custom script for, I think this probably covers
95% of the web scraping I might do, and I do a fair amount of
scraping. I know the unit testing question has come up on this list,
so you might be interested in how I run the unit tests here (using
Python with a simple server for the test framework). I'd love to hear
any thoughts, comments, or code critique you might have - thanks again
for all your work on PhantomJS, it's a fantastic project.
-Nick