1) You should consider using the node `request` to scrape instead of cURL.
2) Any scraping is only going to return what you request. This is only
going to be the initially provided static content. You are getting this
from the server, not the client. There is no way to get anything from the
client.
3) You will have to simulate the client and run the JS inside of your app.
The easiest way to do this is to use a "headless" client. I suggest you
use Zombie at http://zombie.labnotes.org
On Sat, Oct 6, 2012 at 1:34 PM, Narek Musakhanyan <nmusa...@gmail.com>wrote:
> Hey guys . I tried to scrape a data from a website using PHP cURL lib but
> I failed since cURl allows you to scrape only static content . But the
> content I want to scrape changes via javascript(AJAX) since cURL cant
> hanfle that I couldnt handle scraping via cURL . So I heard the this type
> of things can be done via node . Basically I need to make my node app
> handle this js wait for some time until AJAX is done and the pass it to php
> .So is it possible to do via node.js ? I dont know node and I have to start
> from scratch so I am here you to point out the right node framework to use
> to get the result I explained .
Only just picked it up last week, but it worked well enough-- node.io. It exposes a
jQuery-esque interface for querying scraped pages. Extremely high level, "just works"
scraping module, in my book!
It also has a fairly sizable task-processing system built in, which I have not used.
On Sat, Oct 06, 2012 at 01:34:03PM -0700, Narek Musakhanyan wrote:
> Hey guys . I tried to scrape a data from a website using PHP cURL lib but
> I failed since cURl allows you to scrape only static content . But the
> content I want to scrape changes via javascript(AJAX) since cURL cant
> hanfle that I couldnt handle scraping via cURL . So I heard the this type
> of things can be done via node . Basically I need to make my node app
> handle this js wait for some time until AJAX is done and the pass it to
> php .So is it possible to do via node.js ? I dont know node and I have to
> start from scratch so I am here you to point out the right node framework
> to use to get the result I explained .
Good suggestions so far, though i highly recommend you check out phantomjs.org. Phantom is a headless version of WebKit which is the rendering engine behind Chrome & Safari. It's the most comprehensive solution to handling AJAX content when scraping in my book since it's technically the same as interacting with a page loaded by your browser.
On Saturday, October 6, 2012 at 3:04 PM, rektide wrote:
> Only just picked it up last week, but it worked well enough-- node.io. It exposes a
> jQuery-esque interface for querying scraped pages. Extremely high level, "just works"
> scraping module, in my book!
> It also has a fairly sizable task-processing system built in, which I have not used.
> On Sat, Oct 06, 2012 at 01:34:03PM -0700, Narek Musakhanyan wrote:
> > Hey guys . I tried to scrape a data from a website using PHP cURL lib but
> > I failed since cURl allows you to scrape only static content . But the
> > content I want to scrape changes via javascript(AJAX) since cURL cant
> > hanfle that I couldnt handle scraping via cURL . So I heard the this type
> > of things can be done via node . Basically I need to make my node app
> > handle this js wait for some time until AJAX is done and the pass it to
> > php .So is it possible to do via node.js ? I dont know node and I have to
> > start from scratch so I am here you to point out the right node framework
> > to use to get the result I explained .
On Sat, Oct 6, 2012 at 8:46 PM, Dave Kuhn <david.s.k...@gmail.com> wrote:
> Good suggestions so far, though i highly recommend you check out
> phantomjs.org. Phantom is a headless version of WebKit which is the
> rendering engine behind Chrome & Safari. It's the most comprehensive
> solution to handling AJAX content when scraping in my book since it's
> technically the same as interacting with a page loaded by your browser.
> On Saturday, October 6, 2012 at 3:04 PM, rektide wrote:
> Only just picked it up last week, but it worked well enough-- node.io. It
> exposes a
> jQuery-esque interface for querying scraped pages. Extremely high level,
> "just works"
> scraping module, in my book!
> It also has a fairly sizable task-processing system built in, which I have
> not used.
> On Sat, Oct 06, 2012 at 01:34:03PM -0700, Narek Musakhanyan wrote:
> Hey guys . I tried to scrape a data from a website using PHP cURL lib but
> I failed since cURl allows you to scrape only static content . But the
> content I want to scrape changes via javascript(AJAX) since cURL cant
> hanfle that I couldnt handle scraping via cURL . So I heard the this type
> of things can be done via node . Basically I need to make my node app
> handle this js wait for some time until AJAX is done and the pass it to
> php .So is it possible to do via node.js ? I dont know node and I have to
> start from scratch so I am here you to point out the right node framework
> to use to get the result I explained .
From: nodejs@googlegroups.com [mailto:nodejs@googlegroups.com] On Behalf
Of Dave Kuhn
Sent: Saturday, October 06, 2012 11:46 PM
To: nodejs@googlegroups.com
Subject: Re: [nodejs] Dynamic content scrape with Node.js
Good suggestions so far, though i highly recommend you check out
phantomjs.org. Phantom is a headless version of WebKit which is the
rendering engine behind Chrome & Safari. It's the most comprehensive
solution to handling AJAX content when scraping in my book since it's
technically the same as interacting with a page loaded by your browser.
> Good suggestions so far, though i highly recommend you check out > phantomjs.org. Phantom is a headless version of WebKit which is the > rendering engine behind Chrome & Safari. It's the most comprehensive > solution to handling AJAX content when scraping in my book since it's > technically the same as interacting with a page loaded by your browser.
True, you can get pretty far doing that but it gets difficult when crucial bits of information are hidden inside script tags and the like. Not to mention managing cookies for ASP.NET pages amongst others is a pain in the butt. You can avoid all that hassle with a fully resolved DOM and automatic support for cookies which Phantom JS will give you.
On Tuesday, October 9, 2012 at 12:25 AM, greelgorke wrote:
> why so complicated? just find out the url of the ajax request and do it yourself with whatever lib you want...
> Am Montag, 8. Oktober 2012 18:53:27 UTC+2 schrieb Chad Engler:
> > This is probably the same person who asked this question on StackOverflow:
> > From: nod...@googlegroups.com [mailto:nod...@googlegroups.com] On Behalf Of Dave Kuhn
> > Sent: Saturday, October 06, 2012 11:46 PM
> > To: nod...@googlegroups.com
> > Subject: Re: [nodejs] Dynamic content scrape with Node.js
> > Good suggestions so far, though i highly recommend you check out phantomjs.org (http://phantomjs.org). Phantom is a headless version of WebKit which is the rendering engine behind Chrome & Safari. It's the most comprehensive solution to handling AJAX content when scraping in my book since it's technically the same as interacting with a page loaded by your browser.
> > On Saturday, October 6, 2012 at 3:04 PM, rektide wrote:
> > > Only just picked it up last week, but it worked well enough-- node.io (http://node.io). It exposes a
> > > jQuery-esque interface for querying scraped pages. Extremely high level, "just works"
> > > scraping module, in my book!
> > > It also has a fairly sizable task-processing system built in, which I have not used.