Hey All,
I'm working on a webpage scraper that uses PhantomJS and one of the pages I'm scraping will only load about 40% of the time or so. I'm not sure if there is something I'm doing wrong or if its something wrong in phantom.
The URL I'm trying to hit is '
http://www.truefitcollegecounseling.com'. The script below is supposed to print all the iframe sources (src). The trouble is, its a crapshoot whether or not the code will print the iframes or not. This isn't the only code I've written where the site doesn't load every time I run against this site.
To run this, just point phantom to a the javascript file. Example: /path/to/phantom/bin/phantom ~/path/to/file/test.js
You'll also need a copy of the latest jQuery. Just change the page.injectJs() to the correct path.
Some other details:
- Code is being used in both CentOS 6.5 and LinuxMint 17.3 (where LinuxMint is the dev environment), both OSes exhibit the same behavior.
- The script would usually be called by a NodeJS project (using Node version 4.2.3) (in which case a phantom.exit(); would be added to the end of the iterateUrls function), however executing via the Node project or directly through phantom causes
- This code has been tested with PhantomJS v1.9.17, v1.9.18, and v1.9.19
- PhantomJS was downloaded through NPM.
Any thoughts? And, thanks in advance!
Code Sample:
page.settings.userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36";
page.onConsoleMessage = function(msg){
//system.stderr.writeline('console: ' + msg);
console.log(msg);
}
var phantomOps = function(url)
{
page.open(url, function(status){
page.injectJs("~/Downloads/jquery-2.1.4.min.js");
var iFrames = page.evaluate(function(){
var urlArray = [];
var frames = $("*").find("iframe");
for(var i = 0; i < frames.length; i++)
{
//console.log(frames[i].src);
urlArray.push(frames[i].src);
}
return urlArray;
});
for(var i = 0; i < iFrames.length; i++)
{
console.log(iFrames[i]);
}
console.log("Done - Press CTRL-C to exit.");
});
}
var iterateUrls = function()
{
for(var i = 0; i < urls.length; i++)
{
var url = urls.shift();
phantomOps(url);
}
}
iterateUrls();