Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:


Showing 1-20 of 496 topics
CDX Files vs ArchiveReader for WARC file processing Andrew Psaltis 6:44 AM
Page crawl depth Petzl Stephan 3:06 AM
Rate Limits Nigel Vickers 1/16/17
ccTLD .ru hosts heavily over represented Christian Lund 1/13/17
Using Common Crawl for a new project Besnik Hajredini 1/11/17
Fraud News Juan Pablo Torino 1/4/17
December 2016 crawl archive now available Sebastian Nagel 12/28/16
Re: Early crawls (2008/2009) Sebastian Nagel 12/27/16
Apache Spark Tutorial for free @mindmajix Aarusha 12/26/16
IRC/Slack? Oli Lalonde 12/15/16
how to read all the data of Common Crawl from AWS with Java? Pierre Therrode 12/13/16
Cant get proper indexes from common crawl Tal Golan 12/7/16
Javascript Olexiy Lytvynenko 12/6/16
Meanpath Jan 2014 Torrent - 1.6TB of crawl data from 115m websites. Adam Seabrook 12/1/16
No images with common crawl .warc files and pywb Gregory Petropoulos 11/30/16
Duplicates Olexiy Lytvynenko 11/29/16
Re: .wet file encoding Sebastian Nagel 11/23/16
How to operate on Common Crawl Dataset to extract website URL and the related emails? Jaffer Wilson 11/18/16
Question about web-sites that are not allowed to be scraped by their owners ekaterina...@gmail.com 11/15/16
Updated getting started with Common Crawl Matt Horridge 11/14/16
More topics »