Project question -> crawler behavior

71 views
Skip to first unread message

Igor Araújo

unread,
Dec 5, 2012, 6:55:16 PM12/5/12
to csc-32...@googlegroups.com
When I run the crawler and print the links, sometimes, print triggers one exception, saying that one character is beyond the ASCII mapping, and then, it does not show the link... This was supposed to happen?

And, our site shall run with the links database already generated right? I mean... The crawler is supposed to run once before the site starts running right? And the crawler must be run by the site itself or externally?

Wesley May

unread,
Dec 5, 2012, 7:40:30 PM12/5/12
to csc-32...@googlegroups.com
You can ignore links if they contain non-ASCII Unicode characters.

Yeah, you run the crawler once to build the database, and then the search engine uses that. The crawler is run manually, it isn't operated by the site.

Igor Araújo

unread,
Dec 6, 2012, 4:23:01 PM12/6/12
to csc-32...@googlegroups.com
Thank you!

I would like to ask one more thing (I tried to search for the answer but I didn't find nothing...)

For one disconnected graph, the pagerank will not converge to the value it is supposed to converge, right? I mean, the sum of all the pageranks won't be 1 or the number of sites (depending on the version of the implemented pagerank).

Igor Araújo

unread,
Dec 6, 2012, 4:29:53 PM12/6/12
to csc-32...@googlegroups.com
Actually, I would like to know if the page rank doest not converge if there are sites without outlinks...

Wesley May

unread,
Dec 6, 2012, 6:57:28 PM12/6/12
to csc-32...@googlegroups.com
My intuition is that PageRank should definitely converge if the graph is disconnected, or if it has dead-end nodes. After all, the internet is such a graph.

If you have a small example graph that the given algorithm doesn't converge on, let me know.

Igor Araújo

unread,
Dec 6, 2012, 8:54:00 PM12/6/12
to csc-32...@googlegroups.com
I'm asking you this because after implementing my own pagerank, I've noticed that if I have, for example 3 sites, A,B and C, where all the sites connect each other, the sum of the pageranks is the number of sites (for my implementation, this is the expected convergence). However, if I add one more site, D, that only has "inlinks", then the sum is not 4, as it should be.

However, I've noticed that for the pagerank that you gave us, the same happens... If you use the examples that are given, the first example does not converge (the sum of the pageranks is 0.27), but the second does (the sum is 1).

Wesley May

unread,
Dec 6, 2012, 10:34:59 PM12/6/12
to csc-32...@googlegroups.com
Maybe we don't agree on what "converge" means :D

It doesn't really matter what the final sum is I don't think. You can always normalize the page ranks to get any sum you want. As long as there's some final output that isn't like infinity or NaN for each page, then you should be good to go.

Igor Araújo

unread,
Dec 6, 2012, 10:54:38 PM12/6/12
to csc-32...@googlegroups.com
Hahaha... Definitely "convergence" wasn't a wise choice for how to explain the question haha.
Thank you for the replies, anyway.
Reply all
Reply to author
Forward
0 new messages