Question re host- and domain-level web graphs and rankings

Skip to first unread message

Ed Coughlan

Mar 29, 2022, 7:08:58 AMMar 29
to Common Crawl
Hello all,

I've read in a number of articles that only Google knows the actual PageRank score for a page ( 

If that's the case, how can Common Crawl produce a list of 90 million domain ranks as ranked by Harmonic Centrality or PageRank. Is it to do with the fact that Google's patent "Producing a ranking for pages using distances in a web-link graph" is available?

Thanks (and apologies if I'm missing something obvious).


Sebastian Nagel

Mar 29, 2022, 8:04:07 AMMar 29
Hi Ed,

the answer depends on whether page rank is read as
- algorithm
- (expired) patent
- trademark owned by Google

Both page rank and harmonic centrality are used as algorithms
to rank hosts (sites) resp. registered domains using the
hyperlinks between them known to Common Crawl's crawler.

Harmonic centrality ranks are also used as relevance signal
to prioritize which sites are crawled.

We use the Webgraph and LAW software libraries
to build the webgraphs and rank the nodes.

Finally, two recommended papers regarding page rank and centrality
measures in general:
There's also a nice talk by one of the authors, Paolo Boldi:
(there are newer talks by Boldi on the same topic on Youtube
but I haven't seen them - this one is definitely worth to watch)


On 3/29/22 13:08, 'Ed Coughlan' via Common Crawl wrote:
> Hello all,
> I've read in a number of articles that only Google knows the actual
> PageRank score for a page
> (
> <>). 

Ed Coughlan

Mar 29, 2022, 8:33:33 AMMar 29
Thanks Sebastian, that's very helpful!



You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit
To unsubscribe from this group and all its topics, send an email to
To view this discussion on the web visit

Bpm Tips

Jun 15, 2022, 2:26:42 PMJun 15
to Common Crawl
if you want to rank by number of urls that can be done with the columnar index

e.g. the list of all domains with count of number of urls is available at the following link.

query used 

val sqlDF = sqlContext.sql("SELECT distinct url_host_name as domain, count(*) as size from urls order by size desc")

 If you need to run certain spark sql queries on the columnar index let us know we can publicly post the query results in csv format.

Reply all
Reply to author
0 new messages