Computing LinkRank or PageRank on Common Crawl

srober...@gmail.com

unread,

Jun 10, 2014, 11:42:41 PM6/10/14

to common...@googlegroups.com

How can I compute LinkRank or PageRank on Common Crawl? My understanding is that all the data to do that should be there already...

Stephen Merity

unread,

Jun 16, 2014, 7:11:03 PM6/16/14

to common...@googlegroups.com

Hey there,

If you're interested in running LinkRank or PageRank, there are a number of options. If it's your first time learning about and running these algorithms, I'd suggest either a smaller dataset or a preprocessed one such as the Hyperlink Graph provided by the Web Data Commons which is based upon the Common Crawl dataset. You are correct in stating that the dataset has all the information you need already however -- each page has a list of its outgoing links, which is the only information these algorithms require.

PageRank is an iterative algorithm, meaning you need multiple passes over the data. To be as fast as possible, that dataset should all be in memory, which is where the large size of Common Crawl complicates things.

I'd suggest reading Wiki Pagerank using Hadoop to see how to implement the iterative algorithm I mentioned or using a pre-existing package that provides PageRank, such as GraphX [which runs on Spark, which runs on top of Hadoop -- it's turtles all the way down!].

Robert Meusel

unread,

Jul 15, 2014, 8:57:33 AM7/15/14

to common...@googlegroups.com

You can also have a look here: http://wwwranking.webdatacommons.org/

We used the data from the hyperlink graph, aggregated by host to calculate the different metices. We are also currently working on a new graph (2014 Data) which will be released soon.

Reply all

Reply to author

Forward