Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

February 2025 Crawl and Web Graphs

93 views
Skip to first unread message

Thom Vaughan

unread,
Feb 25, 2025, 5:53:55 AMFeb 25
to Common Crawl
Hi folks,

The February 2025 crawl archive and corresponding Web Graph release are now available.

The crawl (CC-MAIN-2025-08) contains 2.6 billion web pages (around 402 TiB uncompressed); page captures are from 47.6 million hosts or 38.5 million registered domains and include 1 billion new URLs, not visited in any of our prior crawls.

The Web Graph release (cc-main-2024-25-dec-jan-feb) contains 267.4 million nodes and 2.7 billion edges at the host level, and 106.5 million nodes and 1.9 billion edges at the domain level.

See these links for further info:


TV

William Roe

unread,
Feb 28, 2025, 6:19:09 AMFeb 28
to common...@googlegroups.com
Hi Thom,

FYI It seems like the index server went down last night...

Thanks!

Bill

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/ba68dc44-c02b-47ef-9ff8-e95ed5a50004n%40googlegroups.com.


--
"And in the end, it's not the years in your life that count. It's the life in your years." --Abraham Lincoln

Thom Vaughan

unread,
Feb 28, 2025, 6:24:14 AMFeb 28
to Common Crawl
Hi Bill,

Thanks for your message. The index server didn’t go down last night, but we are aware of ongoing issues which are documented on the index server's home page. Our team is actively monitoring the situation.

TV

William Roe

unread,
Feb 28, 2025, 8:39:23 AMFeb 28
to common...@googlegroups.com
Thom,

lol.... I can't open that url...

This page isn’t working

index.commoncrawl.org didn’t send any data.

ERR_EMPTY_RESPONSE

Bill

Thom Vaughan

unread,
Feb 28, 2025, 11:00:56 AMFeb 28
to Common Crawl
Hi Bill,

We can't know for sure what you did that was classified as misbehaviour (unless we knew your IP; and **please don't post your IP address in a reply to this message, this is a public group!**) but this is what it looks like when you are blocked for misbehaviour.

The sort of thing that counts is malformed requests, or just too many requests. Please wait a little while, and try again, keeping this in mind. If you're still having trouble after waiting an hour, then please contact us privately at info [zat] commoncrawl [zot] org and we can investigate further.

All best,
TV

William Roe

unread,
Mar 1, 2025, 4:51:38 AMMar 1
to common...@googlegroups.com
Thom, 

Thanks.  I recall we are supposed to limit access to two threads, which I've done... I'll try again over the weekend.

Bill

Reply all
Reply to author
Forward
0 new messages