Common Crawl

Welcome to the Common Crawl Group!

Common Crawl, a non-profit organization, 
provides an open repository of web crawl data that is freely accessible to all. In doing so, we aim to advance the open web and democratize access to information. Today, the Common Crawl Corpus encompasses over two petabytes of web crawl data collected over eight years and ongoing. As the largest, most comprehensive, open repository of web crawl data on the cloud, we contribute to the thriving open data commons that drives innovation, research, and education.

This group is for discussion and collaboration among all those who use or seek to use Common Crawl data and/or share an interest in the open data ecosystem.

Please use this forum to:

Showing 1-20 of 440 topics
I have downloaded a single warc.gz file . but it has metadata information but not the content Bhavana 7/26/16
IP searching Dan Wolff 7/26/16
Common Crawl Index Down Eddie Johnson 7/18/16
Can you estimate the rate at which webpages' content change using Common Crawl? Uri Klarman 7/15/16
Need tech help validating if Common Crawl will work for our project J Curry 7/14/16
June 2016 crawl archive now available Sebastian Nagel 7/14/16
Commoncrawl mapreduce jobs using PHP how-to 7/13/16
I wanted to download .com, .net, .org, .gov files of common crawl. Bhavana 7/12/16
Using Common Crawl to Build a Specific Database Kim 7/11/16
extract data using offset value in CDX API Gautam Balasubramanian 7/4/16
Reminder: New path to Common Crawl corpus on AWS Sara Crouse 6/23/16
May 2016 crawl archive now available Sebastian Nagel 6/20/16
Update of CC Index to use new S3 bucket Sebastian Nagel 6/14/16
Re: License to use when republishing parts of the WDC data. Tom Morris 6/7/16
Is there plan to host the data on Google's public datasets? Derek Chia 6/7/16
URL list 6/2/16
How to build anchor's related data from CommonCrawl? Amir H. Jadidinejad 6/2/16
Big Data interview questions Info Cim 5/31/16
New path to Common Crawl Corpus on AWS Sara Crouse 5/23/16
Sebastian Nagel joins Common Crawl Sara Crouse 5/20/16
More topics »