Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:


Showing 1-20 of 540 topics
How many Hops? Asha Patel 6/26/17
Advise Needed on Crawling Question Asha Patel 6/26/17
Instructions for Common Crawl Aaron Johnson 6/26/17
Get your sh*t together folks. Squidblacklist org 6/21/17
Different query results from CDX-client between UI of http://index.commoncrawl.org/ Nelson Jiao 6/12/17
May 2017 crawl archive now available Sebastian Nagel 6/6/17
Web Graph "\n" on domain name Mark Smith 6/2/17
Need help ASAP!!! serikbek...@gmail.com 6/2/17
cc-index-server returning errors Erik Wickstrom 6/1/17
Working with the Downloaded Index John Masone 5/29/17
In-House Web Graph vertices.txt format confusion Mark Smith 5/28/17
Host-level web graph data set released Sebastian Nagel 5/26/17
Website data has....@gmail.com 5/26/17
how to know the file format fatemek...@gmail.com 5/26/17
warc data has....@gmail.com 5/24/17
Crawl Data (.wet) has....@gmail.com 5/24/17
CCNEWS data - File integrity checks Matthias Petri 5/17/17
URL list nuli...@gmail.com 5/15/17
How to get the full list of domains Luca 5/12/17
Commoncrawl index page down? Yuheng Du 5/10/17
More topics »