Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 467 topics
Is non-html information being indexed? José González-Brenes 10/20/16
Number of URLs in March 2015 index & MIME type breakdown Tom Morris 10/19/16
https access José González-Brenes 10/19/16
Globus with s3 Common Crawl Nitin Chandra Badam 10/14/16
Common Crawl for News Articles Charu Arora 10/13/16
News Dataset Available Sebastian Nagel 10/13/16
URL indexes for 2012 - 2014 Sebastian Nagel 10/13/16
September 2016 crawl archive now available Sebastian Nagel 10/10/16
Unusual search question TheBean InABox 10/4/16
AWS S3 CP Christian Lund 10/3/16
s3://commoncrawl/ access denied Borislav Agapiev 9/29/16
Missing Open Graph property Christian Lund 9/29/16
New to Crawler Maytham Fahmi 9/28/16
get an WARC archive with all files from a domain David Portabella 9/27/16
how complete is CommonCrawl? David Portabella 9/27/16
Index / API for WAT files Christian Lund 9/25/16
Missing Indexes from November 2014 and before zbagz 9/24/16
java.lang.OutOfMemoryError: Java heap space David Portabella 9/22/16
Common Crawl Index Down Eddie Johnson 9/19/16
August 2016 crawl archive now available, release of robots.txt and redirects data set Sebastian Nagel 9/16/16
More topics »