February 2026 Crawl and Web Graphs

23 views
Skip to first unread message

Thom Vaughan

unread,
Feb 24, 2026, 8:41:11 AM (yesterday) Feb 24
to Common Crawl
Howdy,


The February 2026 crawl consists of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million registered domains.

The corresponding Web Graph release consists of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.


We've also recently launched our new Examples & Resources browser, which you can use to discover tools and other projects making use of Common Crawl data in the community.  If you've got something you think would be a good addition, please let us know, we'd be thrilled to include it.

Have fun!
TV

Bahar Zafer

unread,
Feb 24, 2026, 12:12:19 PM (yesterday) Feb 24
to common...@googlegroups.com
Thanks for amazing work! 

I wonder whether the Web Graph data contains edge weights (perhaps defined as the frequency of hyperlinks between two hosts or two domains). 

Best regards, 
Bahar Zafer

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/ea8692d2-53f4-4f14-89be-57dafc800639n%40googlegroups.com.

Thom Vaughan

unread,
Feb 24, 2026, 12:40:13 PM (yesterday) Feb 24
to Common Crawl
Hi Bahar, thanks!

No, the Web Graph edges aren't weighted, they're represented as directed (from, to) pairs, so multiple links from host A to host B still result in a single edge with no count attached.  The same is the case at the domain level after aggregation.

The cc-webgraph repo has the tools used to construct and process the graphs but the underlying WebGraph framework (by Boldi and Vigna) that's used to compress and process the graphs also doesn't natively support edge weights in its .graph format.

TV

Reply all
Reply to author
Forward
0 new messages