Page doesn't always load in PhantomJS

115 wyświetleń
Przejdź do pierwszej nieodczytanej wiadomości

Anthony McKeever

nieprzeczytany,
11 sty 2016, 23:33:5411.01.2016
do phantomjs

Hey All,

I'm working on a webpage scraper that uses PhantomJS and one of the pages I'm scraping will only load about 40% of the time or so.  I'm not sure if there is something I'm doing wrong or if its something wrong in phantom.

The URL I'm trying to hit is 'http://www.truefitcollegecounseling.com'.  The script below is supposed to print all the iframe sources (src).  The trouble is, its a crapshoot whether or not the code will print the iframes or not.  This isn't the only code I've written where the site doesn't load every time I run against this site.

To run this, just point phantom to a the javascript file.  Example: /path/to/phantom/bin/phantom ~/path/to/file/test.js

You'll also need a copy of the latest jQuery.  Just change the page.injectJs() to the correct path.

Some other details:
  • Code is being used in both CentOS 6.5 and LinuxMint 17.3 (where LinuxMint is the dev environment), both OSes exhibit the same behavior.
  • The script would usually be called by a NodeJS project (using Node version 4.2.3) (in which case a phantom.exit(); would be added to the end of the iterateUrls function), however executing via the Node project or directly through phantom causes
  • This code has been tested with PhantomJS v1.9.17, v1.9.18, and v1.9.19
  • PhantomJS was downloaded through NPM.

Any thoughts?  And, thanks in advance!

Code Sample:
var webPage = require('webpage');
var page = webPage.create();
var urls = ['http://www.truefitcollegecounseling.com'];

page.settings.userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36";

page.onConsoleMessage = function(msg){
 //system.stderr.writeline('console: ' + msg);
 console.log(msg);
}

var phantomOps = function(url)
{
 page.open(url, function(status){

  page.injectJs("~/Downloads/jquery-2.1.4.min.js");
  var iFrames = page.evaluate(function(){
   var urlArray = [];
   var frames = $("*").find("iframe");

   for(var i = 0; i < frames.length; i++)
   {
    //console.log(frames[i].src);
    urlArray.push(frames[i].src);
   }

   return urlArray;
  });

  for(var i = 0; i < iFrames.length; i++)
  {
   console.log(iFrames[i]);
  }
  
  console.log("Done - Press CTRL-C to exit.");
 });
}

var iterateUrls = function()
{
 for(var i = 0; i < urls.length; i++)
 {
  var url = urls.shift();
  phantomOps(url);
 }
}

iterateUrls();

Odpowiedz wszystkim
Odpowiedz autorowi
Przekaż
Nowe wiadomości: 0