Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 427 topics
Update of CC Index to use new S3 bucket Sebastian Nagel 5/24/16
New path to Common Crawl Corpus on AWS Sara Crouse 5/23/16
Is there plan to host the data on Google's public datasets? Derek Chia 5/21/16
Sebastian Nagel joins Common Crawl Sara Crouse 5/20/16
difference between crawls Hadar Rottenberg 5/18/16
generation of WET files Bjarne Andersen 5/17/16
How to restrict the time horizon (like Jan/2015-Dec/2015) of crawling on a specific website by python code?~ Vincent Wong 5/16/16
Is there any place where information on when new crawls is announced? Ryan Jones 5/11/16
URL list 5/6/16
Re: License to use when republishing parts of the WDC data. Tom Morris 5/1/16
ANN: WebDataCommons releases 24.4 billion quads RDFa, Microdata, Embedded JSON-LD and Microformat data originating from 2.7 million pay-level-domains Robert Meusel 4/25/16
Few questions about Common Crawl Olivier Lalonde 4/21/16
mrjob: mapReduce job fails with an error about - "The system cannot find the file specified" 4/7/16
Is Urdu content included in CommonCrawl repository? Zubair Mohsin 4/6/16
Access to crawler's results for 2008-2012. Dmytro Rakovskyi 3/30/16
First timer with Common Crawl Jin 3/29/16
Regarding a free data sets Parchuri Sindhu 3/16/16
How to access blekko hosts extract? One Speed 3/14/16
Duplicate URLs with similar contents 3/14/16
A new project: Common Search Sylvain Zimmer 3/6/16
More topics »