Node.js Multi-page Crawler

370 views
Skip to first unread message

Ali Reza

unread,
Mar 28, 2015, 6:10:05 PM3/28/15
to nod...@googlegroups.com
I try to crawl into website pages. here my sample code , i used stackoverflow just for test i dont want crawl stackoverflow.

i this code i want get every link in page and push in an array after that go to every link and search for Node (it's just test.)

here is my code :

var request = require('request');
var cheerio = require('cheerio');


var queue = [];
request(siteUrl, function(error, response, html){
if(!error){
var $ = cheerio.load(html);

// Extract All links in page

$('.question-summary').each(function(){
var url = $(this).children().last().children().first().children().first().attr('href');
queue.push("http://stackoverflow.com"+url);
});


}
// Search For Node.js on every question.
var i,
item;

for(i = 3; i < queue.length; i++) {
item = queue[i];
request(item, function(error, response, html){
var page = cheerio.load(html);
console.log(page);
});
}

})


after i run this code give this error :

typeerror cannot read property 'parent' of undefined

i think there something wrong with cheerio , but i dont know how fix this.

Adrien Risser

unread,
Mar 28, 2015, 9:47:53 PM3/28/15
to nod...@googlegroups.com
Ali,

Here's what I did:

$ npm install cheerio request
 
$ npm init
<snip>
 
$ cat package.json
<snip>
"dependencies": {
    "cheerio": "^0.19.0",
    "request": "^2.54.0"
  },
<snip>
 
$ node .
{ [Function]
  fn:
   { constructor: [Circular],
     _originalRoot:
      { type: 'root',
        name: 'root',
        attribs: {},
        children: [Object],
        next: null,
        prev: null,
        parent: null } },
  load: [Function],
  html: [Function],
  xml: [Function],
  text: [Function],
  parseHTML: [Function],
  root: [Function],
  contains: [Function],
  _root:
   { type: 'root',
     name: 'root',
     attribs: {},
     children: [ [Object], [Object], [Object] ],
     next: null,
     prev: null,
     parent: null },
  _options:
   { withDomLvl1: true,
     normalizeWhitespace: false,
     xmlMode: false,
     decodeEntities: true } }
{ [Function]
  fn:
   { constructor: [Circular],
     _originalRoot:
      { type: 'root',
        name: 'root',
        attribs: {},
...

No error, maybe check your package versions.

Cheers,

--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
To post to this group, send email to nod...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/bfc99f30-7116-471e-9919-a3eae643e9f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Adrien Risser,
Freelance Node.js Consultant

Satheesh Natesan

unread,
Mar 29, 2015, 9:33:17 PM3/29/15
to nod...@googlegroups.com
the issue may not be in cheerio. There is no error check for the request for links. Probably one of the links request failing and hence there is no html

Ali Reza

unread,
Mar 30, 2015, 12:42:21 PM3/30/15
to nod...@googlegroups.com
thanks Adrien , my package version was 
 "cheerio": "^0.18.0",
  "request": "^2.53.0"
 i updated to your version and it's work !
Reply all
Reply to author
Forward
0 new messages