Question re host- and domain-level web graphs and rankings

51 views
Skip to first unread message

Ed Coughlan

unread,
Mar 29, 2022, 7:08:58 AMMar 29
to Common Crawl
Hello all,

I've read in a number of articles that only Google knows the actual PageRank score for a page (https://searchengineland.com/rip-google-pagerank-retrospective-244286). 

If that's the case, how can Common Crawl produce a list of 90 million domain ranks as ranked by Harmonic Centrality or PageRank. Is it to do with the fact that Google's patent "Producing a ranking for pages using distances in a web-link graph" is available?

Thanks (and apologies if I'm missing something obvious).

Ed


Sebastian Nagel

unread,
Mar 29, 2022, 8:04:07 AMMar 29
to common...@googlegroups.com
Hi Ed,

the answer depends on whether page rank is read as
- algorithm
- (expired) patent
- trademark owned by Google

Both page rank and harmonic centrality are used as algorithms
to rank hosts (sites) resp. registered domains using the
hyperlinks between them known to Common Crawl's crawler.

Harmonic centrality ranks are also used as relevance signal
to prioritize which sites are crawled.

We use the Webgraph and LAW software libraries
https://webgraph.di.unimi.it/
https://law.di.unimi.it/
to build the webgraphs and rank the nodes.

Finally, two recommended papers regarding page rank and centrality
measures in general:
https://vigna.di.unimi.it/ftp/papers/PageRankDependencies.pdf
https://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf
There's also a nice talk by one of the authors, Paolo Boldi:
https://events.yandex.ru/events/science-seminars/boldi-23sep
https://www.youtube.com/watch?v=cnGJtGP4gL4
(there are newer talks by Boldi on the same topic on Youtube
but I haven't seen them - this one is definitely worth to watch)

Best,
Sebastian

On 3/29/22 13:08, 'Ed Coughlan' via Common Crawl wrote:
> Hello all,
>
> I've read in a number of articles that only Google knows the actual
> PageRank score for a page
> (https://searchengineland.com/rip-google-pagerank-retrospective-244286
> <https://searchengineland.com/rip-google-pagerank-retrospective-244286>). 

Ed Coughlan

unread,
Mar 29, 2022, 8:33:33 AMMar 29
to common...@googlegroups.com
Thanks Sebastian, that's very helpful!

Best,

Ed

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/CaLnMHDTSUU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/cd612fe1-1e60-95be...@commoncrawl.org.

Bpm Tips

unread,
Jun 15, 2022, 2:26:42 PM (9 days ago) Jun 15
to Common Crawl
if you want to rank by number of urls that can be done with the columnar index

e.g. the list of all domains with count of number of urls is available at the following link.

query used 

val sqlDF = sqlContext.sql("SELECT distinct url_host_name as domain, count(*) as size from urls order by size desc")

 If you need to run certain spark sql queries on the columnar index let us know we can publicly post the query results in csv format.


Reply all
Reply to author
Forward
0 new messages