Web Scraping Frameworks for Node.JS? (e.g. like Python's Scrapy)

6,595 views
Skip to first unread message

Victor Hooi

unread,
Jan 15, 2014, 9:09:48 PM1/15/14
to nod...@googlegroups.com
Hi,

I'm wondering if anybody knows of any web-scraping frameworks in Node.JS?

Previously, there was node.io (https://github.com/chriso/node.io), however, the project was recently discontinued.

Googling for Node.JS and web scraping, most of the guides online just talk about using requests and cheerio - it works, but you need to handle a whole bunch of things yourself (throttling, distributing jobs, configuration, managing jobs etc.).

On the Python side, I know of Scrapy (https://github.com/scrapy/scrapy), which is using Twisted for asynchronicity

On the Ruby side, Nokogiri (http://nokogiri.org/) is meant to be good, although I haven't dived into it much.

Is there anything equivalent in the Node world? Or what approaches are people using to tackle this problem?

Cheers,
Victor

// ravi

unread,
Jan 15, 2014, 10:36:20 PM1/15/14
to nod...@googlegroups.com
On Jan 15, 2014, at 9:09 PM, Victor Hooi <victo...@gmail.com> wrote:

I'm wondering if anybody knows of any web-scraping frameworks in Node.JS?

Previously, there was node.io (https://github.com/chriso/node.io), however, the project was recently discontinued.

Googling for Node.JS and web scraping, most of the guides online just talk about using requests and cheerio - it works, but you need to handle a whole bunch of things yourself (throttling, distributing jobs, configuration, managing jobs etc.).


There are a few modules (node-crawler, simple-crawler, etc) that might help you. Ultimately you may have to wrap something around PhantomJS to deal with JS modifications to the DOM (which can in turn be a bit of a pain since PhantomJS for various reasons has to be run independently).

—ravi

Matt

unread,
Jan 16, 2014, 10:44:39 AM1/16/14
to nod...@googlegroups.com
And PhantomJS crashes randomly a lot (well, "a lot" depends on how much you're doing), so you have to deal with that. And the libraries for controlling it all suck except for the one I wrote (obviously I state that completely unbiasedly! /s).

But no, I don't know of anything that deals with all the issues around throttling and so on. But it's not that hard to use something like Kue or just async.queue to get some sane level of throttling implemented.


--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
 
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

// ravi

unread,
Jan 16, 2014, 11:11:44 AM1/16/14
to nod...@googlegroups.com
On Jan 16, 2014, at 10:44 AM, Matt <hel...@gmail.com> wrote:
And PhantomJS crashes randomly a lot (well, "a lot" depends on how much you're doing), so you have to deal with that. And the libraries for controlling it all suck except for the one I wrote (obviously I state that completely unbiasedly! /s).


If you are going to shamelessly promote yourself :-) you might as well give us the link! Kidding aside, I am curious to see what you have.

—ravi

Arvind Gupta

unread,
Jan 16, 2014, 11:16:02 AM1/16/14
to nod...@googlegroups.com
You can also use scraper for web scrapping. Have a look at it.

Matt

unread,
Jan 16, 2014, 11:19:50 AM1/16/14
to nod...@googlegroups.com
On Thu, Jan 16, 2014 at 11:11 AM, // ravi <ravi-...@g8o.net> wrote:
On Jan 16, 2014, at 10:44 AM, Matt <hel...@gmail.com> wrote:
And PhantomJS crashes randomly a lot (well, "a lot" depends on how much you're doing), so you have to deal with that. And the libraries for controlling it all suck except for the one I wrote (obviously I state that completely unbiasedly! /s).


If you are going to shamelessly promote yourself :-) you might as well give us the link! Kidding aside, I am curious to see what you have.

node-phantom-simple on npm.

This was developed after 12 months of trying different phantom modules, having weird failures with each (some don't work under cluster, some don't work under load, some just randomly fail). It's used in production at the last company I worked at, and has proved pretty rock solid compared to the other options.

Matt. 

// ravi

unread,
Jan 16, 2014, 11:25:15 AM1/16/14
to nod...@googlegroups.com
On Jan 16, 2014, at 11:19 AM, Matt <hel...@gmail.com> wrote:

node-phantom-simple on npm.

This was developed after 12 months of trying different phantom modules, having weird failures with each (some don't work under cluster, some don't work under load, some just randomly fail). It's used in production at the last company I worked at, and has proved pretty rock solid compared to the other options.


Thank you, I’ll give it a go next time I need to use PhantomJS,

—ravi


Tim Killian

unread,
Jan 20, 2014, 1:44:47 PM1/20/14
to nod...@googlegroups.com
Someone has already mentioned using cheerio, and I second that. I built a basic web crawler/scraper using nothing but the requests and cheerio libraries and it worked great. If you already know JQuery you already know how to use Cheerio, which was also a big plus for me.

Mikeal Rogers

unread,
Jan 20, 2014, 1:46:38 PM1/20/14
to nod...@googlegroups.com
I wrote spider for this but I haven't kept up with maintaining it:


If you end up using it and have improvements to make i'll add you as a contributor.

-Mikeal

Matthew Page

unread,
Jan 22, 2014, 4:15:25 PM1/22/14
to nod...@googlegroups.com
I have used noodle and had pretty good results.


--

Alexey Petrushin

unread,
Jan 24, 2014, 5:02:07 AM1/24/14
to nod...@googlegroups.com
Let's start with simpler question - what browser emulator works with node.js? (I meant full emulation, not just HTML processor like cheerio)

I know two options:

- Zombie.js - a nice thing, simple and fast but not very stable.
- Selenium - have all possible features but slow and complex to use (can be used from node.js via adapter)

Any other? I heard Fantom.js also may work with node.js but not sure about it.

Matt

unread,
Jan 24, 2014, 9:26:17 AM1/24/14
to nod...@googlegroups.com
PhantomJS works extremely well if you can deal with the occasional random segfaults, and is much faster than Selenium.


--

Alexey Petrushin

unread,
Jan 24, 2014, 11:30:36 AM1/24/14
to nod...@googlegroups.com
> PhantomJS ...

But does it works with node.js? I heard it needs to maintain its own control over the loop.

// ravi

unread,
Jan 24, 2014, 12:02:16 PM1/24/14
to nod...@googlegroups.com
On Jan 24, 2014, at 11:30 AM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
> PhantomJS ...

But does it works with node.js? I heard it needs to maintain its own control over the loop.



IIUC it does not integrate with NodeJS for the reason you mention: control over the loop. Here’s the section from the FAQ:


Q: Why is PhantomJS not written as Node.js module?

A: The short answer: "No one can serve two masters."

A longer explanation is as follows.

As of now, it is technically very challenging to do so.

Every Node.js module is essentially "a slave" to the core of Node.js, i.e. "the master". In its current state, PhantomJS (and its included WebKit) needs to have the full control (in a synchronous matter) over everything: event loop, network stack, and JavaScript execution.

If the intention is just about using PhantomJS right from a script running within Node.js, such a "loose binding" can be achieved by launching a PhantomJS process and interact with it.


—ravi


Mark Hahn

unread,
Jan 24, 2014, 3:11:47 PM1/24/14
to nodejs
I have no problem using phantom as a child process. You can communicate with it while it is running.  I would imagine one could write a module to make the interaction quite transparent.


// ravi

unread,
Jan 24, 2014, 3:48:06 PM1/24/14
to nod...@googlegroups.com
On Jan 24, 2014, at 3:11 PM, Mark Hahn <ma...@reevuit.com> wrote:
I have no problem using phantom as a child process. You can communicate with it while it is running.  I would imagine one could write a module to make the interaction quite transparent.


There are some modules, I think, that in fact claim to implement what you suggest. I just interface directly with PhantomJS using child_process (but oh what I would give to get my trusty Unix fork()/exec() back!). I agree that this method is quite usable, but in the child process I often catch myself adding node-like code out of habit only to have it crash. Which is a small nit. I am happy with PhantomJS.

—ravi

Alexey Petrushin

unread,
Jan 24, 2014, 5:14:49 PM1/24/14
to nod...@googlegroups.com
> I have no problem using phantom as a child process ...

How do you control it? Does it support commands issued via stout or somehow else?

Alexey Petrushin

unread,
Jan 24, 2014, 5:18:15 PM1/24/14
to nod...@googlegroups.com
I meant interactive control of phantom.js via child_process (not issuing just one command by supplying argv when it start) is it possible?

// ravi

unread,
Jan 24, 2014, 5:37:08 PM1/24/14
to nod...@googlegroups.com
On Jan 24, 2014, at 5:18 PM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
I meant interactive control of phantom.js via child_process (not issuing just one command by supplying argv when it start) is it possible?


I haven’t had the need, so this is not from experience, but you may be able to use stdin/stdout support in PhantomJS:


Combined with child.stdin.write on the Node end, this might provide a unidirectional way to send data to the PhantomJS process.

Caveat emptor[?],

—ravi


Jamie Popkin

unread,
Jan 24, 2014, 11:56:49 PM1/24/14
to nod...@googlegroups.com
At the risk of sounding stupid... Can't you just use jQuery? It's got everything you need for fetching and parsing web content.

Matt

unread,
Jan 25, 2014, 2:18:39 PM1/25/14
to nod...@googlegroups.com

On Fri, Jan 24, 2014 at 5:14 PM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
> I have no problem using phantom as a child process ...

How do you control it? Does it support commands issued via stout or somehow else?

Use node-phantom-simple. It gives you full API access.

Victor Hooi

unread,
Feb 9, 2014, 3:07:22 PM2/9/14
to nod...@googlegroups.com
Hi,

If you look at the docs and feature list for Scrapy, you'll see it has a whole bunch of scraping features more than just selecting a DOM element. E.g.:


So for us, this would also cover handling things like throttling, configuration, distributing jobs, managing jobs etc.

PhantomJS seems to be the way to go, from other people's comments.

Most of the full-featured Node frameworks seem to be inactive. E.g.:


Only one I've found which is still actively maintained, which Matthew Page above mention is Noodle:


Cheers,
Victor


On Sat, Jan 25, 2014 at 3:56 PM, Jamie Popkin <pop...@gmail.com> wrote:
At the risk of sounding stupid... Can't you just use jQuery? It's got everything you need for fetching and parsing web content.
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to a topic in the Google Groups "nodejs" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nodejs/0E76dy0mgwI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nodejs+un...@googlegroups.com.

Alexey Petrushin

unread,
Feb 12, 2014, 8:12:07 AM2/12/14
to nod...@googlegroups.com
After working one with https://github.com/sgentle/phantomjs-nodehttps://github.com/sgentle/node-phantomjs must say it's unstable and hard to work with, don't recommend it.


On Thursday, 16 January 2014 06:09:48 UTC+4, Victor Hooi wrote:

Matt

unread,
Feb 12, 2014, 9:04:56 AM2/12/14
to nod...@googlegroups.com

On Wed, Feb 12, 2014 at 8:12 AM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
After working one with https://github.com/sgentle/phantomjs-nodehttps://github.com/sgentle/node-phantomjs must say it's unstable and hard to work with, don't recommend it.

That's because you didn't find node-phantom-simple. Which I wrote specifically because those other libraries are unstable.

(Although you can't get around the instability of phantomjs itself, which does crash rather too often for my likes).

Alexey Petrushin

unread,
Feb 12, 2014, 12:33:17 PM2/12/14
to nod...@googlegroups.com
Tried it, it's really simpler. But still have same problem - works fine on my Mac OS but wen I try to deploy it to EC2 Ubuntu Server phantomjs crushes without even being able to log a dump. If I run it standalone it works ok. Phantomjs v 1.9

Anyone tried to deploy to EC2 what OS do yo use? Maybe crash happens on Ibunty only?

Matt

unread,
Feb 12, 2014, 1:47:20 PM2/12/14
to nod...@googlegroups.com

On Wed, Feb 12, 2014 at 12:33 PM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
Tried it, it's really simpler. But still have same problem - works fine on my Mac OS but wen I try to deploy it to EC2 Ubuntu Server phantomjs crushes without even being able to log a dump. If I run it standalone it works ok. Phantomjs v 1.9

Odd, it was developed on Ubuntu, so I have no idea why that is.

How did you install Phantom? Most people who have problems installed it via npm. I highly recommend installing by downloading from phantomjs.org and compiling yourself.

Matt.

Alexey Petrushin

unread,
Feb 12, 2014, 2:50:18 PM2/12/14
to nod...@googlegroups.com
Tried multiple ways - via apt-get & compiling from source. Strange, it works ok on my local machine, and also works ok in standalone mode on ubuntu. But when you try to couple it with node.js it crashes. 

I'll try Linux EC2 Image tomorrow, maybe it's some weirdness in Ubuntu Server that causes problems.

Alexey Petrushin

unread,
Feb 13, 2014, 9:32:56 AM2/13/14
to nod...@googlegroups.com
Guess what - problem with RAM - when I switched from micro to medium EC2 instance problem dissapeared ...


On Wednesday, 12 February 2014 22:47:20 UTC+4, Matt Sergeant wrote:

Alexey Petrushin

unread,
Apr 25, 2014, 8:42:12 PM4/25/14
to nod...@googlegroups.com
I finished such project recently - Crawler for JavaScript Sites, with Browser Emulator (Selenium). 

It's a private project, but I wrote some details about it and how it works, maybe it will be interested for someone.

Duy Nguyen

unread,
Apr 26, 2014, 9:29:56 AM4/26/14
to nod...@googlegroups.com
I did a scraper with phantomjs before, it works great but I think you should take a look at https://import.io/




--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Nguyen Hai Duy
Mobile : 0914 72 1900
Skype: nguyenhd2107
Yahoo: nguyenhd_lucky

Alexey Petrushin

unread,
Apr 26, 2014, 1:00:10 PM4/26/14
to nod...@googlegroups.com
In my case phantom.js had issues like: 

- sometimes site uses broken HTML and jQuery gives different result in phantom than in the Chrome
- there was cases when I can't trigger 'click' event by phantom, when the site uses some strange ways to register onclick function.
- pahntom.js is incompatible with node.js, there are some non-standard bindings, I tried 3 such bindings but for me all of them worked very unstable, so I just given up at the end.

import.io - an interesting idea, seems like usefull service. But, sadly in our case it was a little more complicated, there are lots of complex interactions (like click here wait till something appears there if it appears next go here if it not appears go there etc.). I doubth you can program such behavior using GUI or some sort of DSL. 

Also, we use it heavily and it consumes huge amount of resources (99% consumes Selenium + Browser Emulators), it costly even if you pay only for the physical servers. If on the other hands you use services provided by other company and pay twice - for the servers and for their service - it would cost us even more. In our case it was cheaper to spend one month in developing such service by ourselves.

Matt

unread,
Apr 27, 2014, 12:56:52 AM4/27/14
to nod...@googlegroups.com

On Sat, Apr 26, 2014 at 1:00 PM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
- pahntom.js is incompatible with node.js, there are some non-standard bindings, I tried 3 such bindings but for me all of them worked very unstable, so I just given up at the end.

You know - the authors of said bindings (myself in particular) are very open to bug reports and fixing any issues. I've done many updates to node-phantom-simple in the last year based on bugs reported.

I will say that it's unfortunate that there are so many broken bindings on npm, especially given they grabbed the "big" names (I'm looking at you, "phantom" and "node-phantom" on npm). But node-phantom-simple follows the node style (error first), and has been battle tested at a large scale user. If you find bugs it helps everyone if you report them.

Matt. 

Alexey Petrushin

unread,
Apr 27, 2014, 2:12:29 PM4/27/14
to nod...@googlegroups.com
Yes, it is true, and thanks for your work. I don't meant it to sounded like complains, just mentioned that it's right now not very stable.

And, also it seems that the cause of bugs not in the bindings but in phantom.js itself, seems like it doesn't support such non standard way to communicate very well.

Matt

unread,
Apr 27, 2014, 2:39:55 PM4/27/14
to nod...@googlegroups.com
On Sun, Apr 27, 2014 at 2:12 PM, Alexey Petrushin <alexey.p...@gmail.com> wrote:
Yes, it is true, and thanks for your work. I don't meant it to sounded like complains, just mentioned that it's right now not very stable.

Like I said, please point out where this is the case and it will get fixed.
 
And, also it seems that the cause of bugs not in the bindings but in phantom.js itself, seems like it doesn't support such non standard way to communicate very well.

Phantom is definitely not the most stable software (it segfaults regularly). The communication between phantom and node was hard to get right, but node-phantom-simple definitely does it the right way for stability compared to other libraries.

Matt.

harish k

unread,
May 19, 2014, 1:17:12 AM5/19/14
to nod...@googlegroups.com
Hi,
I just wrote a easy-to-use scrapper in JS as a part of my past work. 
It can extract data from html pages based on predefined schema which consists of css selectors and a data extraction function.
It uses cheerio for dom parsing.

Zugravu Eugen Marius

unread,
Jul 3, 2014, 9:15:52 AM7/3/14
to nod...@googlegroups.com
Reply all
Reply to author
Forward
Message has been deleted
0 new messages