I'm wondering if anybody knows of any web-scraping frameworks in Node.JS?Previously, there was node.io (https://github.com/chriso/node.io), however, the project was recently discontinued.Googling for Node.JS and web scraping, most of the guides online just talk about using requests and cheerio - it works, but you need to handle a whole bunch of things yourself (throttling, distributing jobs, configuration, managing jobs etc.).
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
And PhantomJS crashes randomly a lot (well, "a lot" depends on how much you're doing), so you have to deal with that. And the libraries for controlling it all suck except for the one I wrote (obviously I state that completely unbiasedly! /s).
On Jan 16, 2014, at 10:44 AM, Matt <hel...@gmail.com> wrote:If you are going to shamelessly promote yourself :-) you might as well give us the link! Kidding aside, I am curious to see what you have.And PhantomJS crashes randomly a lot (well, "a lot" depends on how much you're doing), so you have to deal with that. And the libraries for controlling it all suck except for the one I wrote (obviously I state that completely unbiasedly! /s).
node-phantom-simple on npm.This was developed after 12 months of trying different phantom modules, having weird failures with each (some don't work under cluster, some don't work under load, some just randomly fail). It's used in production at the last company I worked at, and has proved pretty rock solid compared to the other options.
--
> PhantomJS ...
But does it works with node.js? I heard it needs to maintain its own control over the loop.
Q: Why is PhantomJS not written as Node.js module?
A: The short answer: "No one can serve two masters."
A longer explanation is as follows.
As of now, it is technically very challenging to do so.
Every Node.js module is essentially "a slave" to the core of Node.js, i.e. "the master". In its current state, PhantomJS (and its included WebKit) needs to have the full control (in a synchronous matter) over everything: event loop, network stack, and JavaScript execution.
If the intention is just about using PhantomJS right from a script running within Node.js, such a "loose binding" can be achieved by launching a PhantomJS process and interact with it.
I have no problem using phantom as a child process. You can communicate with it while it is running. I would imagine one could write a module to make the interaction quite transparent.
I meant interactive control of phantom.js via child_process (not issuing just one command by supplying argv when it start) is it possible?
> I have no problem using phantom as a child process ...
How do you control it? Does it support commands issued via stout or somehow else?
At the risk of sounding stupid... Can't you just use jQuery? It's got everything you need for fetching and parsing web content.
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "nodejs" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nodejs/0E76dy0mgwI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nodejs+un...@googlegroups.com.
After working one with https://github.com/sgentle/phantomjs-node & https://github.com/sgentle/node-phantomjs must say it's unstable and hard to work with, don't recommend it.
Anyone tried to deploy to EC2 what OS do yo use? Maybe crash happens on Ibunty only?
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
For more options, visit https://groups.google.com/d/optout.To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
- pahntom.js is incompatible with node.js, there are some non-standard bindings, I tried 3 such bindings but for me all of them worked very unstable, so I just given up at the end.
Yes, it is true, and thanks for your work. I don't meant it to sounded like complains, just mentioned that it's right now not very stable.
And, also it seems that the cause of bugs not in the bindings but in phantom.js itself, seems like it doesn't support such non standard way to communicate very well.