Introducing pjscrape - web-scraping using PhantomJS and jQuery

1,368 views
Skip to first unread message

Nick R

unread,
Jul 6, 2011, 9:00:25 PM7/6/11
to phantomjs
Hello all -

I mentioned this in a previous email to the list, but I wanted to
introduce an Open Source project I've been working on, called
pjscrape. It's a small-ish Javascript framework meant to simplify page-
scraping using (Py)PhantomJS and jQuery: https://github.com/nrabinowitz/pjscrape

All I've ever wanted in a web-scraping tool is the ability to use
jQuery selectors and functions to get the data I'm interested in, with
a tool I can run from the command line, without having to use an
actual browser (so it can run in the background, or with cron, or with
celery, or...). The PhantomJS project made this incredibly easy to put
together.

Features:
* Client-side, Javascript-based scraping environment with full access
to jQuery functions
* Easy, flexible syntax for setting up one or more scrapers
* Recursive/crawl scraping
* Delay scrape until a "ready" condition occurs (checks for $
(document).ready() by default)
* Load your own scripts on the page before scraping
* Modular architecture for logging and writing/formatting scraped
items
* Client-side utilities for common tasks
* Growing set of unit tests

Basic syntax, in a config file:

pjs.addScraper(
// url or array of urls
'http://www.example.com/page.html',
// function or array of functions, returning text, an
object, or an array of same,
// run in the client via page.evaluate()
function() {
return $('h1').first().text();
}
);

Which you can then run like this:

phantomjs /path/to/pjscrape.js my_config.js

I hope you find this interesting - while there are some things I'd
still need to write a custom script for, I think this probably covers
95% of the web scraping I might do, and I do a fair amount of
scraping. I know the unit testing question has come up on this list,
so you might be interested in how I run the unit tests here (using
Python with a simple server for the test framework). I'd love to hear
any thoughts, comments, or code critique you might have - thanks again
for all your work on PhantomJS, it's a fantastic project.

-Nick

Ariya Hidayat

unread,
Jul 7, 2011, 7:50:07 PM7/7/11
to phan...@googlegroups.com
Hi Nick,

pjscrape looks really good indeed, excellent job! I think now we can
use it as one of the real-world examples on how to use PhantomJS.

I would enlist the project in
http://code.google.com/p/phantomjs/wiki/WhoUsesPhantomJS, unless you
have any objection.

Thank you!

Regards,

Ariya

Nick Rabinowitz

unread,
Jul 7, 2011, 7:54:43 PM7/7/11
to phan...@googlegroups.com
Please do! Glad you like it.

-Nick

Joe Norton

unread,
Jul 8, 2011, 2:02:04 AM7/8/11
to phan...@googlegroups.com
Indeed Nick, this is really good stuff. I have had a copy of your source on my text editor all week as I try and figure out how you do it! I'm working on my own scraper for doing SEO analysis, but I won't link to it here yet as its not where I'd like it to be.
Reply all
Reply to author
Forward
0 new messages