January 2025 Crawl and Web Graphs

73 views
Skip to first unread message

Thom Vaughan

unread,
Feb 1, 2025, 11:17:35 PMFeb 1
to Common Crawl
Hello everyone,

We're pleased to announce the January 2025 crawl (CC-MAIN-2025-05) and the corresponding Web Graph release (cc-main-2024-25-nov-dec-jan).

The January crawl contains 3 billion web pages (or 460 TiB of uncompressed content) fetched between the 12th and the 26th of January. Page captures are from 49 million hosts or 39 million registered domains and include 0.98 billion new URLs, not visited in any of our prior crawls.

An erratum regarding SURT URLs was fixed for this crawl, please see iipc/webarchive-commons#102 for further information.  Thank you to Tom Morris for identifying this.

The Web Graph consists of 277.7 million nodes and 2.7 billion edges at the host level, and 100.8 million nodes and 1.9 billion edges at the domain level.

Further info can be found in the links below:

🔗 January 2025 Crawl Announcement
🔗 January 2025 Web Graph Announcement
🔗 Web Graph Statistics

TV

Pascal Wichmann

unread,
Feb 2, 2025, 12:07:17 AMFeb 2
to Common Crawl
Hi CC team,

Many thanks for your hard work.
Are you sure the warc.paths.gz is valid / non-corrupted? I am not able to unpack it as usual.

Cheers
Pascal

Thom Vaughan

unread,
Feb 2, 2025, 10:59:39 AMFeb 2
to Common Crawl
Hi Pascal,

Yes, the file is valid and contains 90,000 paths as expected:

$ zcat warc.paths.gz|wc -l
   90000

You might want to try downloading it again, perhaps your download did not complete successfully.

TV

Pascal Wichmann

unread,
Feb 2, 2025, 12:46:39 PMFeb 2
to Common Crawl
Very sorry, it was my default tool (Archive Utility) that errored on your file on two different machines: "Unable to expand "warc.paths.gz" into "Warc Paths January 2025". (Error 79 - Inappropriate file type or format.)"

But gunzip on Mac worked just fine.

I am sorry.



Reply all
Reply to author
Forward
0 new messages