April 2026 Crawl and Web Graphs

19 views
Skip to first unread message

Luca Foppiano

unread,
Apr 30, 2026, 11:22:41 AM (4 days ago) Apr 30
to Common Crawl
Hi all,


The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.

Starting from this crawl, revisit records now use the WARC header Content-Type: application/http;msgtype=response (previously message/http), aligning with iipc/warc-specifications#55 for consistency with other HTTP response records.

The corresponding Web Graph release consists of 269.0 million nodes and 9.4 billion edges at the host level, and 124.6 million nodes and 4.8 billion edges at the domain level.


Live long and prosper!
Luca

Reply all
Reply to author
Forward
0 new messages