Page Rank in 2025

35 views
Skip to first unread message

Michele Bertasi

unread,
Dec 20, 2025, 12:58:03 PM (8 days ago) Dec 20
to Common Crawl
Hey all,

I'm looking for some feedback on a side project I worked on back in October, around finding expired domains that might be worth buying.


Let me know what you think!

Cheers,
Michele

Sebastian Nagel

unread,
Dec 20, 2025, 4:35:02 PM (8 days ago) Dec 20
to common...@googlegroups.com
Hi Michele,

thanks for the interesting blog post and for sharing the datasets.

Very interesting!

Actually, the first who extracted links to build graphs from
Common Crawl data were the Web Data Commons research group
at the University of Mannheim:
https://webdatacommons.org/hyperlinkgraph/index.html
http://wwwranking.webdatacommons.org/

Common Search also calculated PageRank on a host-level graph
built on Common Crawl data:
https://github.com/commonsearch/cosr-back/tree/master/spark/jobs

Since 2017 Common Crawl started to create it's own webgraphs
https://commoncrawl.org/web-graphs
https://commoncrawl.github.io/cc-webgraph-statistics/

It's a combination of Spark to extract the links and get a
numeric graph presentation. The job definitions are in
https://github.com/commoncrawl/cc-pyspark/

PageRank and Harmonic Centrality calculations are done based
on the Webgraph framework developed at the Laboratory for Web
Algorithmics, University of Milano:
https://law.di.unimi.it/
https://law.di.unimi.it/software.php#webgraph

Interestingly, the Webgraph framework isn't "off-core". It requires that
the webgraph fits into memory. Instead, it uses a very dense graph
representation. The domain-level graphs with about 100 million nodes can
be processed on a laptop with 32 GiB memory. Vigna and Boldi even wrote
about "in-core" computation:
https://www.quantware.ups-tlse.fr/FETNADINE/papers/P4.7.pdf

More about the Common Crawl webgraphs you'll find in
https://github.com/commoncrawl/cc-webgraph
https://github.com/commoncrawl/wac2025-webgraph-workshop/

You can calculate (Anti)TrustRank using the Webgraph tools. Just
initialize a vector distributing the score of 1.0 only over trusted
or spammy nodes. Then you calculate PageRank either on the graph
(AntiTrustRank) or its transpose (TrustRank). Here a tool to initialize
the preference vector:

https://github.com/commoncrawl/cc-webgraph/blob/main/src/main/java/org/commoncrawl/webgraph/CreatePreferenceVector.java

However, the issue I've had with both TrustRank and
AntiTrustRank was that you need very clean preference
vectors. That's hard to get:
- as you mentioned, large good domains may include
links to spam
- spam sites are short-lived, what worked last month
does not in the next.

I'll take a closer look at your "Spamicity" approach.

Regarding the "damping factor": yes, that's the explanation, see
https://www.cise.ufl.edu/~adobra/DaMn/talks/damn05-santini.pdf

Thanks again!

Best,
Sebastian


On 12/20/25 15:11, Michele Bertasi wrote:
> Hey all,
>
> I'm looking for some feedback on a side project I worked on back in
> October, around finding expired domains that might be worth buying.
>
> https://blog.mbrt.dev/posts/domain-resurrect/ <https://blog.mbrt.dev/
> posts/domain-resurrect/>

Michele Bertasi

unread,
Dec 22, 2025, 10:18:55 AM (6 days ago) Dec 22
to common...@googlegroups.com
Hi Sebastian,

Thanks for all the pointers! Quite embarrassing that I missed Common Crawl's webgraph. I should look into how different it is from my own dataset. I expect them to match quite closely.

Best,
Michele
Reply all
Reply to author
Forward
0 new messages