Common Crawl - Google Groups

Groups

1–30 of 1309

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Discuss challenges
Offer advice and share methods
Ask questions and get advice from others
Share ideas for projects and products
Look for collaborators and partners
Convey new and/or derivative uses of the data
Show off cool stuff you build
Keep up to date on the latest news from Common Crawl.

0 selected

Jul 18

Index for Common Crawl News Dataset

Hi, I could not find any indexes to the Common Crawl News Dataset, so I decided to create some: https

unread,

Index for Common Crawl News Dataset

Hi, I could not find any indexes to the Common Crawl News Dataset, so I decided to create some: https

Jul 18

Jun 25

June 2026 Crawl Archive and Corresponding Web Graph are now available

Hi everyone, Our June 2026 Crawl Archive and corresponding Web Graph are now available. The June 2026

unread,

June 2026 Crawl Archive and Corresponding Web Graph are now available

Hi everyone, Our June 2026 Crawl Archive and corresponding Web Graph are now available. The June 2026

Jun 25

Jun 14

The Columnar Index is now the URL Index

Please see: https://commoncrawl.org/blog/the-columnar-index-is-now-the-url-index -- greg

unread,

The Columnar Index is now the URL Index

Please see: https://commoncrawl.org/blog/the-columnar-index-is-now-the-url-index -- greg

Jun 14

Ayrton Vu-Guerin, Tom Morris3

Jun 14

504 error at index

Thank you for the response. This was for a university assignment, and I just wanted to get WARC files

unread,

504 error at index

Thank you for the response. This was for a university assignment, and I just wanted to get WARC files

Jun 14

Jun 7

Common Crawl Foundation - Q2 2026 Update

Our purpose We continue our mission to preserve and provide open access to the public web at global

unread,

Common Crawl Foundation - Q2 2026 Update

Our purpose We continue our mission to preserve and provide open access to the public web at global

Jun 7

May 29

May 2026 Crawl and Web Graph

Hi everyone, Our May 2026 Crawl Archive and corresponding Web Graph are now available. The May 2026

unread,

May 2026 Crawl and Web Graph

Hi everyone, Our May 2026 Crawl Archive and corresponding Web Graph are now available. The May 2026

May 29

허예솔, Greg Lindahl2

May 13

Inclusion check for an AI-native company domain (neogenesis.app) not in recent CC-MAIN snapshots

In the latest 3 crawls, we have seen 0 links to your website. We don't crawl github or hugging

unread,

Inclusion check for an AI-native company domain (neogenesis.app) not in recent CC-MAIN snapshots

In the latest 3 crawls, we have seen 0 links to your website. We don't crawl github or hugging

May 13

Apr 30

April 2026 Crawl and Web Graphs

Hi all, Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026

unread,

April 2026 Crawl and Web Graphs

Hi all, Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026

Apr 30

Aussie Picker “As I See It” No BS

Apr 29

Built two free SEO tools on CC Web Graph data — offline SQLite approach on a single Windows machine

Hi all, Long-time lurker, first post. I wanted to share two free Windows desktop tools I built on top

unread,

Built two free SEO tools on CC Web Graph data — offline SQLite approach on a single Windows machine

Hi all, Long-time lurker, first post. I wanted to share two free Windows desktop tools I built on top

Apr 29

Mar 24

March 2026 Crawl and Web Graphs

Hi all, Our March 2026 Crawl Archive and corresponding Web Graph are now available. The March 2026

unread,

March 2026 Crawl and Web Graphs

Hi all, Our March 2026 Crawl Archive and corresponding Web Graph are now available. The March 2026

Mar 24

Leo Chester (Leo), Greg Lindahl2

Mar 21

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-26-dec-jan-feb/index.html You

unread,

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-26-dec-jan-feb/index.html You

Mar 21

Jose Tafur, Greg Lindahl2

Mar 11

index.commoncrawl.org returning empty responses (March 2026)

That's our rate limiter in action. cdx-toolkit is an example of code that goes slow enough.

unread,

index.commoncrawl.org returning empty responses (March 2026)

That's our rate limiter in action. cdx-toolkit is an example of code that goes slow enough.

Mar 11

Thom Vaughan, Bahar Zafer3

Feb 24

February 2026 Crawl and Web Graphs

Hi Bahar, thanks! No, the Web Graph edges aren't weighted, they're represented as directed (

unread,

February 2026 Crawl and Web Graphs

Hi Bahar, thanks! No, the Web Graph edges aren't weighted, they're represented as directed (

Feb 24

12/23/25

December 2025 Crawl and Web Graphs

Greetings, everyone! We are pleased to announce the release of the December 2025 Crawl Archive and

unread,

December 2025 Crawl and Web Graphs

Greetings, everyone! We are pleased to announce the release of the December 2025 Crawl Archive and

12/23/25

Michele Bertasi, Sebastian Nagel3

12/22/25

Page Rank in 2025

Hi Sebastian, Thanks for all the pointers! Quite embarrassing that I missed Common Crawl's

unread,

Page Rank in 2025

Hi Sebastian, Thanks for all the pointers! Quite embarrassing that I missed Common Crawl's

12/22/25

Onni Hakala, Greg Lindahl2

11/26/25

Altering the Cloudfront WAF rate limiting error from 403 to 429

Please see the solution I gave you on Discord. I realize that you disagree with it. This is not the

unread,

Altering the Cloudfront WAF rate limiting error from 403 to 429

Please see the solution I gave you on Discord. I realize that you disagree with it. This is not the

11/26/25

11/24/25

November 2025 Crawl and Web Graphs

Hi all, Our November 2025 crawl and Web Graphs are now available. The November 2025 crawl (CC-MAIN-

unread,

November 2025 Crawl and Web Graphs

Hi all, Our November 2025 crawl and Web Graphs are now available. The November 2025 crawl (CC-MAIN-

11/24/25

Core, Al-Meer Technology2

11/21/25

[Discussion] Detecting Spider Traps by analyzing offline URL-MD5 data

Hi Bohan, That's a very interesting approach using a Trie for URL path analysis! While your Trie-

unread,

[Discussion] Detecting Spider Traps by analyzing offline URL-MD5 data

Hi Bohan, That's a very interesting approach using a Trie for URL path analysis! While your Trie-

11/21/25

11/2/25

WARC filtering and Rate Limits.

Apologies if this is answered, but the official FAQ didn't seem to mention it ad I cannot find

unread,

WARC filtering and Rate Limits.

Apologies if this is answered, but the official FAQ didn't seem to mention it ad I cannot find

11/2/25

Quanyi Hong, Sebastian Nagel2

9/24/25

Looking for 2008–2013 Common Crawl data

Hi Quanyi, unfortunately, the 2008 - 2012 crawls use a different data format. Consolidating the

unread,

Looking for 2008–2013 Common Crawl data

Hi Quanyi, unfortunately, the 2008 - 2012 crawls use a different data format. Consolidating the

9/24/25

Bruno A. H. Vincent (Webmaster & Tech), Greg Lindahl4

8/31/25

How to get a list of domain names, 200 million of them, plain text, domain only, no other data

https://commoncrawl.org/web-graphs In particular this 3.2 gigabyte file has what you want from the

unread,

How to get a list of domain names, 200 million of them, plain text, domain only, no other data

https://commoncrawl.org/web-graphs In particular this 3.2 gigabyte file has what you want from the

8/31/25

8/26/25

July/August Common Crawl Newsletter

We are pleased to release our newsletter for July and August 2025, with updates on our team's

unread,

July/August Common Crawl Newsletter

We are pleased to release our newsletter for July and August 2025, with updates on our team's

8/26/25

Vansh Devgan, … Greg Lindahl4

8/26/25

S3 Charges For Outbound Or Not

> I am thinking to use S3 method as it can pull data faster to my machines. If you are outside AWS

unread,

S3 Charges For Outbound Or Not

> I am thinking to use S3 method as it can pull data faster to my machines. If you are outside AWS

8/26/25

Yiyun Zhang, … Rich Skrenta9

8/15/25

Questions on News Common Crawl data

Publishers frequently expose content outside their paywall initially to attract crawlers and SEO, and

unread,

Questions on News Common Crawl data

Publishers frequently expose content outside their paywall initially to attract crawlers and SEO, and

8/15/25

7/27/25

July 2025 Crawl and Web Graphs

Hi folks, Our July 2025 crawl and Web Graphs are now available. The July 2025 crawl (CC-MAIN-2025-30)

unread,

July 2025 Crawl and Web Graphs

Hi folks, Our July 2025 crawl and Web Graphs are now available. The July 2025 crawl (CC-MAIN-2025-30)

7/27/25

7/8/25

The First WMDQS-Masakhane LangID Hackathon

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a

unread,

The First WMDQS-Masakhane LangID Hackathon

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a

7/8/25

7/3/25

Error in Web Graphs for June 2025

Hi all, A problem with the Web Graphs for June 2025 was discovered and promptly reported by Romain

unread,

Error in Web Graphs for June 2025

Hi all, A problem with the Web Graphs for June 2025 was discovered and promptly reported by Romain

7/3/25

7/2/25

June 2025 Crawl and Web Graphs

Hi all, Our June 2025 crawl archive and its corresponding Web Graph release are now available. The

unread,

June 2025 Crawl and Web Graphs

Hi all, Our June 2025 crawl archive and its corresponding Web Graph release are now available. The

7/2/25

6/30/25

Common Crawl at the United Nations Open Source Week, June 2025

The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City

unread,

Common Crawl at the United Nations Open Source Week, June 2025

The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City

6/30/25

Vansh Devgan, Sebastian Nagel2

6/27/25

Retrieving all URLs and endpoints associated with a specific domain.

Hi, you should have a look at the columnar URL index: https://commoncrawl.org/blog/index-to-warc-

unread,

Retrieving all URLs and endpoints associated with a specific domain.

Hi, you should have a look at the columnar URL index: https://commoncrawl.org/blog/index-to-warc-

6/27/25

Search

Clear search

Close search

Google apps

Main menu